summaryrefslogtreecommitdiffstats
path: root/block
Commit message (Collapse)AuthorAgeFilesLines
...
| * | | | block: Make request operation type argument declarations consistentBart Van Assche2017-06-212-11/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of declaring the second argument of blk_*_get_request() as int and passing it to functions that expect an unsigned int, declare that second argument as unsigned int. Also because of consistency, rename that second argument from 'rw' into 'op'. This patch does not change any functionality. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: Reduce blk_mq_hw_ctx sizeBart Van Assche2017-06-211-8/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the srcu structure is rather large (184 bytes on an x86-64 system with kernel debugging disabled), only allocate it if needed. Reported-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | block: return on congested block deviceGoldwyn Rodrigues2017-06-202-2/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A new bio operation flag REQ_NOWAIT is introduced to identify bio's orignating from iocb with IOCB_NOWAIT. This flag indicates to return immediately if a request cannot be made instead of retrying. Stacked devices such as md (the ones with make_request_fn hooks) currently are not supported because it may block for housekeeping. For example, an md can have a part of the device suspended. For this reason, only request based devices are supported. In the future, this feature will be expanded to stacked devices by teaching them how to handle the REQ_NOWAIT flags. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | Revert "blk-mq: don't use sync workqueue flushing from drivers"Ming Lei2017-06-181-22/+8Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch reverts commit 2719aa217e0d02(blk-mq: don't use sync workqueue flushing from drivers) because only blk_mq_quiesce_queue() need the sync flush, and now we don't need to stop queue any more, so revert it. Also changes to cancel_delayed_work() in blk_mq_stop_hw_queue(). Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: clarify dispatch may not be drained/blocked by stopping queueMing Lei2017-06-181-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BLK_MQ_S_STOPPED may not be observed in other concurrent I/O paths, we can't guarantee that dispatching won't happen after returning from the APIs of stopping queue. So clarify the fact and avoid potential misuse. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: don't stop queue for quiescingMing Lei2017-06-181-6/+3Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Queue can be started by other blk-mq APIs and can be used in different cases, this limits uses of blk_mq_quiesce_queue() if it is based on stopping queue, and make its usage very difficult, especially users have to use the stop queue APIs carefully for avoiding to break blk_mq_quiesce_queue(). We have applied the QUIESCED flag for draining and blocking dispatch, so it isn't necessary to stop queue any more. After stopping queue is removed, blk_mq_quiesce_queue() can be used safely and easily, then users won't worry about queue restarting during quiescing at all. Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: update comments on blk_mq_quiesce_queue()Ming Lei2017-06-181-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Actually what we want to get from blk_mq_quiesce_queue() isn't only to wait for completion of all ongoing .queue_rq(). In the typical context of canceling requests, we need to make sure that the following is done in the dispatch path before starting to cancel requests: - failed dispatched request is finished - busy dispatched request is requeued, and the STARTED flag is cleared So update comment to keep code, doc and our expection consistent. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queueMing Lei2017-06-182-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is required that no dispatch can happen any more once blk_mq_quiesce_queue() returns, and we don't have such requirement on APIs of stopping queue. But blk_mq_quiesce_queue() still may not block/drain dispatch in the the case of BLK_MQ_S_START_ON_RUN, so use the new introduced flag of QUEUE_FLAG_QUIESCED and evaluate it inside RCU read-side critical sections for fixing this issue. Also blk_mq_quiesce_queue() is implemented via stopping queue, which limits its uses, and easy to cause race, because any queue restart in other paths may break blk_mq_quiesce_queue(). With the introduced flag of QUEUE_FLAG_QUIESCED, we don't need to depend on stopping queue for quiescing any more. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: introduce blk_mq_unquiesce_queueMing Lei2017-06-181-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_mq_start_stopped_hw_queues() is used implictly as counterpart of blk_mq_quiesce_queue() for unquiescing queue, so we introduce blk_mq_unquiesce_queue() and make it as counterpart of blk_mq_quiesce_queue() explicitly. This function is for improving the current quiescing mechanism in the following patches. Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | block: don't check for BIO_MAX_PAGES in blk_bio_segment_split()NeilBrown2017-06-181-16/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_bio_segment_split() makes sure bios have no more than BIO_MAX_PAGES entries in the bi_io_vec. This was done because bio_clone_bioset() (when given a mempool bioset) could not handle larger io_vecs. No driver uses bio_clone_bioset() any more, they all use bio_clone_fast() if anything, and bio_clone_fast() doesn't clone the bi_io_vec. The main user of of bio_clone_bioset() at this level is bounce.c, and bouncing now happens before blk_bio_segment_split(), so that is not of concern. So remove the big helpful comment and the code. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | block: remove bio_clone() and all references.NeilBrown2017-06-182-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bio_clone() is no longer used. Only bio_clone_bioset() or bio_clone_fast(). This is for the best, as bio_clone() used fs_bio_set, and filesystems are unlikely to want to use bio_clone(). So remove bio_clone() and all references. This includes a fix to some incorrect documentation. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | block: Improvements to bounce-buffer handlingNeilBrown2017-06-182-16/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 23688bf4f830 ("block: ensure to split after potentially bouncing a bio") blk_queue_bounce() is called *before* blk_queue_split(). This means that: 1/ the comments blk_queue_split() about bounce buffers are irrelevant, and 2/ a very large bio (more than BIO_MAX_PAGES) will no longer be split before it arrives at blk_queue_bounce(), leading to the possibility that bio_clone_bioset() will fail and a NULL will be dereferenced. Separately, blk_queue_bounce() shouldn't use fs_bio_set as the bio being copied could be from the same set, and this could lead to a deadlock. So: - allocate 2 private biosets for blk_queue_bounce, one for splitting enormous bios and one for cloning bios. - add code to split a bio that exceeds BIO_MAX_PAGES. - Fix up the comments in blk_queue_split() Credit-to: Ming Lei <tom.leiming@gmail.com> (suggested using single bio_for_each_segment loop) Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk: use non-rescuing bioset for q->bio_split.NeilBrown2017-06-181-2/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A rescuing bioset is only useful if there might be bios from that same bioset on the bio_list_on_stack queue at a time when bio_alloc_bioset() is called. This never applies to q->bio_split. Allocations from q->bio_split are only ever made from blk_queue_split() which is only ever called early in each of various make_request_fn()s. The original bio (call this A) is then passed to generic_make_request() and is placed on the bio_list_on_stack queue, and the bio that was allocated from q->bio_split (B) is processed. The processing of this may cause other bios to be passed to generic_make_request() or may even cause the bio B itself to be passed, possible after some prefix has been split off (using some other bioset). generic_make_request() now guarantees that all of these bios (B and dependants) will be fully processed before the tail of the original bio A gets handled. None of these early bios can possible trigger an allocation from the original q->bio_split as they are either too small to require splitting or (more likely) are destined for a different queue. The next time that the original q->bio_split might be used by this thread is when A is processed again, as it might still be too big to handle directly. By this time there cannot be any other bios allocated from q->bio_split in the generic_make_request() queue. So no rescuing will ever be needed. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk: make the bioset rescue_workqueue optional.NeilBrown2017-06-182-3/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch converts bioset_create() to not create a workqueue by default, so alloctions will never trigger punt_bios_to_rescuer(). It also introduces a new flag BIOSET_NEED_RESCUER which tells bioset_create() to preserve the old behavior. All callers of bioset_create() that are inside block device drivers, are given the BIOSET_NEED_RESCUER flag. biosets used by filesystems or other top-level users do not need rescuing as the bio can never be queued behind other bios. This includes fs_bio_set, blkdev_dio_pool, btrfs_bioset, xfs_ioend_bioset, and one allocated by target_core_iblock.c. biosets used by md/raid do not need rescuing as their usage was recently audited and revised to never risk deadlock. It is hoped that most, if not all, of the remaining biosets can end up being the non-rescued version. Reviewed-by: Christoph Hellwig <hch@lst.de> Credit-to: Ming Lei <ming.lei@redhat.com> (minor fixes) Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk: replace bioset_create_nobvec() with a flags arg to bioset_create()NeilBrown2017-06-182-39/+23Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "flags" arguments are often seen as good API design as they allow easy extensibility. bioset_create_nobvec() is implemented internally as a variation in flags passed to __bioset_create(). To support future extension, make the internal structure part of the API. i.e. add a 'flags' argument to bioset_create() and discard bioset_create_nobvec(). Note that the bio_split allocations in drivers/md/raid* do not need the bvec mempool - they should have used bioset_create_nobvec(). Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk: remove bio_set arg from blk_queue_split()NeilBrown2017-06-183-7/+6Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_queue_split() is always called with the last arg being q->bio_split, where 'q' is the first arg. Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses q->bio_split. This is inconsistent and unnecessary. Remove the last arg and always use q->bio_split inside blk_queue_split() Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Credit-to: Javier González <jg@lightnvm.io> (Noticed that lightnvm was missed) Reviewed-by: Javier González <javier@cnexlabs.com> Tested-by: Javier González <javier@cnexlabs.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: remove __blk_mq_alloc_requestChristoph Hellwig2017-06-182-47/+27Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move most code into blk_mq_rq_ctx_init, and the rest into blk_mq_get_request. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq-sched: unify request prepare methodsChristoph Hellwig2017-06-183-35/+29Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch makes sure we always allocate requests in the core blk-mq code and use a common prepare_request method to initialize them for both mq I/O schedulers. For Kyber and additional limit_depth method is added that is called before allocating the request. Also because none of the intializations can really fail the new method does not return an error - instead the bfq finish method is hardened to deal with the no-IOC case. Last but not least this removes the abuse of RQF_QUEUE by the blk-mq scheduling code as RQF_ELFPRIV is all that is needed now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: refactor blk_mq_sched_assign_iocChristoph Hellwig2017-06-183-28/+17Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_mq_sched_assign_ioc now only handles the assigned of the ioc if the schedule needs it (bfq only at the moment). The caller to the per-request initializer is moved out so that it can be merged with a similar call for the kyber I/O scheduler. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | bfq-iosched: fix NULL ioc check in bfq_get_rq_privateChristoph Hellwig2017-06-181-10/+5Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | icq_to_bic is a container_of operation, so we need to check for NULL before it. Also move the check outside the spinlock while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: streamline blk_mq_get_requestChristoph Hellwig2017-06-181-14/+14
| | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: simplify blk_mq_free_requestChristoph Hellwig2017-06-182-38/+15Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge three functions only tail-called by blk_mq_free_request into blk_mq_free_request. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq-sched: unify request finished methodsChristoph Hellwig2017-06-183-15/+10Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No need to have two different callouts of bfq vs kyber. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: remove blk_mq_sched_{get,put}_rq_privChristoph Hellwig2017-06-183-26/+8Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Having these as separate helpers in a header really does not help readability, or my chances to refactor this code sanely. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: move blk_mq_sched_{get,put}_request to blk-mq.cChristoph Hellwig2017-06-183-73/+67Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Having them out of line in blk-mq-sched.c just makes the code flow unnecessarily complicated. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: mark blk_mq_rq_ctx_init staticChristoph Hellwig2017-06-182-5/+2Star
| | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | Merge branch 'nvme-4.13' of git://git.infradead.org/nvme into for-4.13/blockJens Axboe2017-06-162-9/+7Star
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NVMe changes for 4.13 from Christoph: Highlights: - UUID identifier support from Johannes - Lots of cleanups from Sagi - Host Memory Buffer support from me And lots of cleanups and smaller fixes of course. Note that the UUID identifier changes are based on top of the uuid tree. I am the maintainer of that tree and will send it to Linus as soon as 4.12 is released as various other trees depend on it as well (and the diffstat includes those changes unfortunately)
| | * \ \ \ Merge branch 'uuid-types' of bombadil.infradead.org:public_git/uuid into ↵Christoph Hellwig2017-06-132-9/+7Star
| | |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | nvme-base
| * | | | | | block: Dedicated error code fixupsBart Van Assche2017-06-162-3/+3
| |/ / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes two sparse warnings introduced by the "dedicated error codes for the block layer V3" patch series. These changes have not been tested. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | blk-mq: fixup type of 'ret' in __blk_mq_try_issue_directly()Jens Axboe2017-06-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Should be a blk_status_t type, not an integer. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | | Merge tag 'v4.12-rc5' into for-4.13/blockJens Axboe2017-06-1211-57/+179
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've already got a few conflicts and upcoming work depends on some of the changes that have gone into mainline as regression fixes for this series. Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream trees to continue working on 4.13 changes. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | | | block: switch bios to blk_status_tChristoph Hellwig2017-06-096-34/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace bi_error with a new bi_status to allow for a clear conversion. Note that device mapper overloaded bi_error with a private value, which we'll have to keep arround at least for now and thus propagate to a proper blk_status_t value. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | blk-mq: switch ->queue_rq return value to blk_status_tChristoph Hellwig2017-06-091-20/+17Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the same values for use for request completion errors as the return value from ->queue_rq. BLK_STS_RESOURCE is special cased to cause a requeue, and all the others are completed as-is. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | block: introduce new block status code typeChristoph Hellwig2017-06-096-80/+106
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we use nornal Linux errno values in the block layer, and while we accept any error a few have overloaded magic meanings. This patch instead introduces a new blk_status_t value that holds block layer specific status codes and explicitly explains their meaning. Helpers to convert from and to the previous special meanings are provided for now, but I suspect we want to get rid of them in the long run - those drivers that have a errno input (e.g. networking) usually get errnos that don't know about the special block layer overloads, and similarly returning them to userspace will usually return somethings that strictly speaking isn't correct for file system operations, but that's left as an exercise for later. For now the set of errors is a very limited set that closely corresponds to the previous overloaded errno values, but there is some low hanging fruite to improve it. blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse typechecking, so that we can easily catch places passing the wrong values. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | bsg: Check queue type before attaching to a queueBart Van Assche2017-06-011-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since BSG only supports request queues for which struct scsi_request is the first member of their private request data, refuse to register block layer queues for which struct scsi_request is not the first member of their private data. References: commit bd1599d931ca ("scsi_transport_sas: fix BSG ioctl memory corruption") References: commit 82ed4db499b8 ("block: split scsi_request out of struct request") Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | block: Introduce queue flag QUEUE_FLAG_SCSI_PASSTHROUGHBart Van Assche2017-06-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | From the context where a SCSI command is submitted it is not always possible to figure out whether or not the queue the command is submitted to has struct scsi_request as the first member of its private data. Hence introduce the flag QUEUE_FLAG_SCSI_PASSTHROUGH. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Don Brace <don.brace@microsemi.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | blk-mq-debugfs: Add 'kick' operationBart Van Assche2017-06-011-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Running a queue causes the block layer to examine the per-CPU and hw queues but not the requeue list. Hence add a 'kick' operation that also examines the requeue list. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Eduardo Valentin <eduval@amazon.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | blk-mq-debugfs: Show busy requestsBart Van Assche2017-06-011-0/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Requests that got stuck in a block driver are neither on blk_mq_ctx.rq_list nor on any hw dispatch queue. Make these visible in debugfs through the "busy" attribute. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Eduardo Valentin <eduval@amazon.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | blk-mq-debugfs: Show requeue listBart Van Assche2017-06-011-0/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When verifying whether or not a blk-mq driver forgot to kick the requeue list after having requeued a request it is important to be able to verify the contents of the requeue list. Hence export that list through debugfs. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Eduardo Valentin <eduval@amazon.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | blk-mq-debugfs: Show atomic request flagsBart Van Assche2017-06-011-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When analyzing e.g. queue lockups it is important to know whether or not a request has already been started. Hence also show the atomic request flags. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Eduardo Valentin <eduval@amazon.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | cfq-iosched: Delete unused function min_vdisktime()Matthias Kaehlcke2017-05-301-9/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes the following warning when building with clang: block/cfq-iosched.c:970:19: error: unused function 'min_vdisktime' [-Werror,-Wunused-function] Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | blk-mq: make per-sw-queue bio merge as default .bio_mergeMing Lei2017-05-263-72/+58Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because what the per-sw-queue bio merge does is basically same with scheduler's .bio_merge(), this patch makes per-sw-queue bio merge as the default .bio_merge if no scheduler is used or io scheduler doesn't provide .bio_merge(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | blk-mq: merge bio into sw queue before pluggingMing Lei2017-05-261-22/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before blk-mq is introduced, I/O is merged to elevator before being putted into plug queue, but blk-mq changed the order and makes merging to sw queue basically impossible. Then it is observed that throughput of sequential I/O is degraded about 10%~20% on virtio-blk in the test[1] if mq-deadline isn't used. This patch moves the bio merging per sw queue before plugging, like what blk_queue_bio() does, and the performance regression is fixed under this situation. [1]. test script: sudo fio --direct=1 --size=128G --bsrange=4k-4k --runtime=40 --numjobs=16 --ioengine=libaio --iodepth=64 --group_reporting=1 --filename=/dev/vdb --name=virtio_blk-test-$RW --rw=$RW --output-format=json RW=read or write Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | | | | | Merge tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuidLinus Torvalds2017-07-032-9/+7Star
|\ \ \ \ \ \ \ | | |_|/ / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull uuid subsystem from Christoph Hellwig: "This is the new uuid subsystem, in which Amir, Andy and I have started consolidating our uuid/guid helpers and improving the types used for them. Note that various other subsystems have pulled in this tree, so I'd like it to go in early. UUID/GUID summary: - introduce the new uuid_t/guid_t types that are going to replace the somewhat confusing uuid_be/uuid_le types and make the terminology fit the various specs, as well as the userspace libuuid library. (me, based on a previous version from Amir) - consolidated generic uuid/guid helper functions lifted from XFS and libnvdimm (Amir and me) - conversions to the new types and helpers (Amir, Andy and me)" * tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid: (34 commits) ACPI: hns_dsaf_acpi_dsm_guid can be static mmc: sdhci-pci: make guid intel_dsm_guid static uuid: Take const on input of uuid_is_null() and guid_is_null() thermal: int340x_thermal: fix compile after the UUID API switch thermal: int340x_thermal: Switch to use new generic UUID API acpi: always include uuid.h ACPI: Switch to use generic guid_t in acpi_evaluate_dsm() ACPI / extlog: Switch to use new generic UUID API ACPI / bus: Switch to use new generic UUID API ACPI / APEI: Switch to use new generic UUID API acpi, nfit: Switch to use new generic UUID API MAINTAINERS: add uuid entry tmpfs: generate random sb->s_uuid scsi_debug: switch to uuid_t nvme: switch to uuid_t sysctl: switch to use uuid_t partitions/ldm: switch to use uuid_t overlayfs: use uuid_t instead of uuid_be fs: switch ->s_uuid to uuid_t ima/policy: switch to use uuid_t ...
| * | | | | | partitions/ldm: switch to use uuid_tChristoph Hellwig2017-06-052-9/+7Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | And the uuid helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
* | | | | | | block: provide bio_uninit() free freeing integrity/task associationsJens Axboe2017-06-281-3/+9
| |_|_|/ / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Wen reports significant memory leaks with DIF and O_DIRECT: "With nvme devive + T10 enabled, On a system it has 256GB and started logging /proc/meminfo & /proc/slabinfo for every minute and in an hour it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute leaking. /proc/meminfo | grep SUnreclaim... SUnreclaim: 6752128 kB SUnreclaim: 6874880 kB SUnreclaim: 7238080 kB .... SUnreclaim: 22307264 kB SUnreclaim: 22485888 kB SUnreclaim: 22720256 kB When testcases with T10 enabled call into __blkdev_direct_IO_simple, code doesn't free memory allocated by bio_integrity_alloc. The patch fixes the issue. HTX has been run with +60 hours without failure." Since __blkdev_direct_IO_simple() allocates the bio on the stack, it doesn't go through the regular bio free. This means that any ancillary data allocated with the bio through the stack is not freed. Hence, we can leak the integrity data associated with the bio, if the device is using DIF/DIX. Fix this by providing a bio_uninit() and export it, so that we can use it to free this data. Note that this is a minimal fix for this issue. Any current user of bio's that are allocated outside of bio_alloc_bioset() suffers from this issue, most notably some drivers. We will fix those in a more comprehensive patch for 4.13. This also means that the commit marked as being fixed by this isn't the real culprit, it's just the most obvious one out there. Fixes: 542ff7bf18c6 ("block: new direct I/O implementation") Reported-by: Wen Xiong <wenxiong@linux.vnet.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | | | blk-mq: fix performance regression with shared tagsJens Axboe2017-06-213-24/+59
| |_|_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we have shared tags enabled, then every IO completion will trigger a full loop of every queue belonging to a tag set, and every hardware queue for each of those queues, even if nothing needs to be done. This causes a massive performance regression if you have a lot of shared devices. Instead of doing this huge full scan on every IO, add an atomic counter to the main queue that tracks how many hardware queues have been marked as needing a restart. With that, we can avoid looking for restartable queues, if we don't have to. Max reports that this restores performance. Before this patch, 4K IOPS was limited to 22-23K IOPS. With the patch, we are running at 950-970K IOPS. Fixes: 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared") Reported-by: Max Gurtovoy <maxg@mellanox.com> Tested-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Tested-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | | block: Fix a blk_exit_rl() regressionBart Van Assche2017-06-141-12/+22
| |_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Avoid that the following complaint is reported: BUG: sleeping function called from invalid context at kernel/workqueue.c:2790 in_atomic(): 1, irqs_disabled(): 0, pid: 41, name: rcuop/3 1 lock held by rcuop/3/41: #0: (rcu_callback){......}, at: [<ffffffff8111f9a2>] rcu_nocb_kthread+0x282/0x500 Call Trace: dump_stack+0x86/0xcf ___might_sleep+0x174/0x260 __might_sleep+0x4a/0x80 flush_work+0x7e/0x2e0 __cancel_work_timer+0x143/0x1c0 cancel_work_sync+0x10/0x20 blk_throtl_exit+0x25/0x60 blkcg_exit_queue+0x35/0x40 blk_release_queue+0x42/0x130 kobject_put+0xa9/0x190 This happens since we invoke callbacks that need to block from the queue release handler. Fix this by pushing the final release to a workqueue. Reported-by: Ross Zwisler <zwisler@gmail.com> Fixes: commit b425e5049258 ("block: Avoid that blk_exit_rl() triggers a use-after-free") Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Updated changelog Signed-off-by: Jens Axboe <axboe@fb.com>
* | | | block, bfq: access and cache blkg data only when safePaolo Valente2017-06-083-36/+105
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In blk-cgroup, operations on blkg objects are protected with the request_queue lock. This is no more the lock that protects I/O-scheduler operations in blk-mq. In fact, the latter are now protected with a finer-grained per-scheduler-instance lock. As a consequence, although blkg lookups are also rcu-protected, blk-mq I/O schedulers may see inconsistent data when they access blkg and blkg-related objects. BFQ does access these objects, and does incur this problem, in the following case. The blkg_lookup performed in bfq_get_queue, being protected (only) through rcu, may happen to return the address of a copy of the original blkg. If this is the case, then the blkg_get performed in bfq_get_queue, to pin down the blkg, is useless: it does not prevent blk-cgroup code from destroying both the original blkg and all objects directly or indirectly referred by the copy of the blkg. BFQ accesses these objects, which typically causes a crash for NULL-pointer dereference of memory-protection violation. Some additional protection mechanism should be added to blk-cgroup to address this issue. In the meantime, this commit provides a quick temporary fix for BFQ: cache (when safe) blkg data that might disappear right after a blkg_lookup. In particular, this commit exploits the following facts to achieve its goal without introducing further locks. Destroy operations on a blkg invoke, as a first step, hooks of the scheduler associated with the blkg. And these hooks are executed with bfqd->lock held for BFQ. As a consequence, for any blkg associated with the request queue an instance of BFQ is attached to, we are guaranteed that such a blkg is not destroyed, and that all the pointers it contains are consistent, while that instance is holding its bfqd->lock. A blkg_lookup performed with bfqd->lock held then returns a fully consistent blkg, which remains consistent until this lock is held. In more detail, this holds even if the returned blkg is a copy of the original one. Finally, also the object describing a group inside BFQ needs to be protected from destruction on the blkg_free of the original blkg (which invokes bfq_pd_free). This commit adds private refcounting for this object, to let it disappear only after no bfq_queue refers to it any longer. This commit also removes or updates some stale comments on locking issues related to blk-cgroup operations. Reported-by: Tomas Konir <tomas.konir@gmail.com> Reported-by: Lee Tibbert <lee.tibbert@gmail.com> Reported-by: Marco Piazza <mpiazza@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Tomas Konir <tomas.konir@gmail.com> Tested-by: Lee Tibbert <lee.tibbert@gmail.com> Tested-by: Marco Piazza <mpiazza@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | | blk-throttle: set default latency baseline for harddiskShaohua Li2017-06-071-3/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | hard disk IO latency varies a lot depending on spindle move. The latency range could be from several microseconds to several milliseconds. It's pretty hard to get the baseline latency used by io.low. We will use a different stragety here. The idea is only using IO with spindle move to determine if cgroup IO is in good state. For HD, if io latency is small (< 1ms), we ignore the IO. Such IO is likely from sequential IO, and is helpless to help determine if a cgroup's IO is impacted by other cgroups. With this, we only account IO with big latency. Then we can choose a hardcoded baseline latency for HD (4ms, which is typical IO latency with seek). With all these settings, the io.low latency works for both HD and SSD. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>