summaryrefslogtreecommitdiffstats
path: root/block/blk-mq.c
Commit message (Collapse)AuthorAgeFilesLines
* blk-mq: fix sysfs inflight counterOmar Sandoval2018-04-261-0/+19
| | | | | | | | | | When the blk-mq inflight implementation was added, /proc/diskstats was converted to use it, but /sys/block/$dev/inflight was not. Fix it by adding another helper to count in-flight requests by data direction. Fixes: f299b7c7a9de ("blk-mq: provide internal in-flight variant") Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: count allocated but not started requests in iostats inflightOmar Sandoval2018-04-261-12/+9Star
| | | | | | | | | | | | | | | In the legacy block case, we increment the counter right after we allocate the request, not when the driver handles it. In both the legacy and blk-mq cases, part_inc_in_flight() is called from blk_account_io_start() right after we've allocated the request. blk-mq only considers requests started requests as inflight, but this is inconsistent with the legacy definition and the intention in the code. This removes the started condition and instead counts all allocated requests. Fixes: f299b7c7a9de ("blk-mq: provide internal in-flight variant") Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Revert "blk-mq: remove code for dealing with remapping queue"Ming Lei2018-04-251-3/+31
| | | | | | | | | | | | | | | | | | This reverts commit 37c7c6c76d431dd7ef9c29d95f6052bd425f004c. Turns out some drivers(most are FC drivers) may not use managed IRQ affinity, and has their customized .map_queues meantime, so still keep this code for avoiding regression. Reported-by: Laurence Oberman <loberman@redhat.com> Tested-by: Laurence Oberman <loberman@redhat.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com> Cc: Ewan Milne <emilne@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: start request gstate with gen 1Jianchao Wang2018-04-171-0/+7
| | | | | | | | | | | | | | | | | | | | | | rq->gstate and rq->aborted_gstate both are zero before rqs are allocated. If we have a small timeout, when the timer fires, there could be rqs that are never allocated, and also there could be rq that has been allocated but not initialized and started. At the moment, the rq->gstate and rq->aborted_gstate both are 0, thus the blk_mq_terminate_expired will identify the rq is timed out and invoke .timeout early. For scsi, this will cause scsi_times_out to be invoked before the scsi_cmnd is not initialized, scsi_cmnd->device is still NULL at the moment, then we will get crash. Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Martin Steigerwald <Martin@Lichtvoll.de> Cc: stable@vger.kernel.org Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: remove code for dealing with remapping queueMing Lei2018-04-101-31/+3Star
| | | | | | | | | | | | | | | | | | Firstly, from commit 4b855ad37194 ("blk-mq: Create hctx for each present CPU), blk-mq doesn't remap queue any more after CPU topo is changed. Secondly, set->nr_hw_queues can't be bigger than nr_cpu_ids, and now we map all possible CPUs to hw queues, so at least one CPU is mapped to each hctx. So queue mapping has became static and fixed just like percpu variable, and we don't need to handle queue remapping any more. Cc: Stefan Haberland <sth@linux.vnet.ibm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: don't check queue mapped in __blk_mq_delay_run_hw_queue()Ming Lei2018-04-101-3/+0Star
| | | | | | | | | | | | | | | | | | | | | | There are several reasons for removing the check: 1) blk_mq_hw_queue_mapped() returns true always now since each hctx may be mapped by one CPU at least 2) when there isn't any online CPU mapped to this hctx, there won't be any IO queued to this CPU, blk_mq_run_hw_queue() only runs queue if there is IO queued to this hctx 3) If __blk_mq_delay_run_hw_queue() is called by blk_mq_delay_run_hw_queue(), which is run from blk_mq_dispatch_rq_list() or scsi_mq_get_budget(), and the hctx to be handled has to be mapped. Cc: Stefan Haberland <sth@linux.vnet.ibm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: remove blk_mq_delay_queue()Ming Lei2018-04-101-28/+2Star
| | | | | | | | | | | No driver uses this interface any more, so remove it. Cc: Stefan Haberland <sth@linux.vnet.ibm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: introduce blk_mq_hw_queue_first_cpu() to figure out first cpuMing Lei2018-04-101-12/+11Star
| | | | | | | | | | | | | This patch introduces helper of blk_mq_hw_queue_first_cpu() for figuring out the hctx's first cpu, and code duplication can be avoided. Cc: Stefan Haberland <sth@linux.vnet.ibm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: avoid to write intermediate result to hctx->next_cpuMing Lei2018-04-101-8/+9
| | | | | | | | | | | | | This patch figures out the final selected CPU, then writes it to hctx->next_cpu once, then we can avoid to intermediate next cpu observed from other dispatch paths. Cc: Stefan Haberland <sth@linux.vnet.ibm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: make sure that correct hctx->next_cpu is setMing Lei2018-04-101-0/+2
| | | | | | | | | | | | | | | | | | From commit 20e4d81393196 (blk-mq: simplify queue mapping & schedule with each possisble CPU), one hctx can be mapped from all offline CPUs, then hctx->next_cpu can be set as wrong. This patch fixes this issue by making hctx->next_cpu pointing to the first CPU in hctx->cpumask if all CPUs in hctx->cpumask are offline. Cc: Stefan Haberland <sth@linux.vnet.ibm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Fixes: 20e4d81393196 ("blk-mq: simplify queue mapping & schedule with each possisble CPU") Cc: stable@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: order getting budget and driver tagMing Lei2018-04-101-11/+10Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch orders getting budget and driver tag by making sure to acquire driver tag after budget is got, this way can help to avoid the following race: 1) before dispatch request from scheduler queue, get one budget first, then dequeue a request, call it request A. 2) in another IO path for dispatching request B which is from hctx->dispatch, driver tag is got, then try to get budget in blk_mq_dispatch_rq_list(), unfortunately the budget is held by request A. 3) meantime blk_mq_dispatch_rq_list() is called for dispatching request A, and try to get driver tag first, unfortunately no driver tag is available because the driver tag is held by request B 4) both two IO pathes can't move on, and IO stall is caused. This issue can be observed when running dbench on USB storage. This patch fixes this issue by always getting budget before getting driver tag. Cc: stable@vger.kernel.org Fixes: de1482974080ec9e ("blk-mq: introduce .get_budget and .put_budget in blk_mq_ops") Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Protect queue flag changes with the queue lockBart Van Assche2018-03-081-1/+1
| | | | | | | | | | | | | | Since the queue flags may be changed concurrently from multiple contexts after a queue becomes visible in sysfs, make these changes safe by protecting these with the queue lock. Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Introduce blk_queue_flag_{set,clear,test_and_{set,clear}}()Bart Van Assche2018-03-081-10/+2Star
| | | | | | | | | | | | | | | Introduce functions that modify the queue flags and that protect these modifications with the request queue lock. Except for moving one wake_up_all() call from inside to outside a critical section, this patch does not change any functionality. Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Use the queue_flag_*() functions instead of open-coding theseBart Van Assche2018-03-081-1/+1
| | | | | | | | | | | | | | Except for changing the atomic queue flag manipulations that are protected by the queue lock into non-atomic manipulations, this patch does not change any functionality. Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Add 'lock' as third argument to blk_alloc_queue_node()Bart Van Assche2018-02-281-1/+1
| | | | | | | | | | | | This patch does not change any functionality. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Ulf Hansson <ulf.hansson@linaro.org> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: clear ctx pending bit under ctx lockOmar Sandoval2018-02-281-1/+1
| | | | | | | | | | | | | | | When we insert a request, we set the software queue pending bit while holding the software queue lock. However, we clear it outside of the lock, so it's possible that a concurrent insert could reset the bit after we clear it but before we empty the request list. Afterwards, the bit would still be set but the software queue wouldn't have any requests in it, leading us to do a spurious run in the future. This is mostly a benign/theoretical issue, but it makes the following change easier to justify. Signed-off-by: Omar Sandoval <osandov@fb.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: don't call io sched's .requeue_request when requeueing rq to ->dispatchMing Lei2018-02-241-1/+3
| | | | | | | | | | | | | | | | | | | | | | __blk_mq_requeue_request() covers two cases: - one is that the requeued request is added to hctx->dispatch, such as blk_mq_dispatch_rq_list() - another case is that the request is requeued to io scheduler, such as blk_mq_requeue_request(). We should call io sched's .requeue_request callback only for the 2nd case. Cc: Paolo Valente <paolo.valente@linaro.org> Cc: Omar Sandoval <osandov@fb.com> Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers") Cc: stable@vger.kernel.org Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk: optimization for classic pollingNitesh Shetty2018-02-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | This removes the dependency on interrupts to wake up task. Set task state as TASK_RUNNING, if need_resched() returns true, while polling for IO completion. Earlier, polling task used to sleep, relying on interrupt to wake it up. This made some IO take very long when interrupt-coalescing is enabled in NVMe. Reference: http://lists.infradead.org/pipermail/linux-nvme/2018-February/015435.html Changes since v2->v3: -using __set_current_state() instead of set_current_state() Changes since v1->v2: -setting task state once in blk_poll, instead of multiple callers. Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: introduce BLK_STS_DEV_RESOURCEMing Lei2018-01-311-4/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This status is returned from driver to block layer if device related resource is unavailable, but driver can guarantee that IO dispatch will be triggered in future when the resource is available. Convert some drivers to return BLK_STS_DEV_RESOURCE. Also, if driver returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls. BLK_MQ_DELAY_QUEUE is 3 ms because both scsi-mq and nvmefc are using that magic value. If a driver can make sure there is in-flight IO, it is safe to return BLK_STS_DEV_RESOURCE because: 1) If all in-flight IOs complete before examining SCHED_RESTART in blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue is run immediately in this case by blk_mq_dispatch_rq_list(); 2) if there is any in-flight IO after/when examining SCHED_RESTART in blk_mq_dispatch_rq_list(): - if SCHED_RESTART isn't set, queue is run immediately as handled in 1) - otherwise, this request will be dispatched after any in-flight IO is completed via blk_mq_sched_restart() 3) if SCHED_RESTART is set concurently in context because of BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two cases and make sure IO hang can be avoided. One invariant is that queue will be rerun if SCHED_RESTART is set. Suggested-by: Jens Axboe <axboe@kernel.dk> Tested-by: Laurence Oberman <loberman@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge branch 'for-4.16/block' of git://git.kernel.dk/linux-blockLinus Torvalds2018-01-291-230/+437
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: "This is the main pull request for block IO related changes for the 4.16 kernel. Nothing major in this pull request, but a good amount of improvements and fixes all over the map. This contains: - BFQ improvements, fixes, and cleanups from Angelo, Chiara, and Paolo. - Support for SMR zones for deadline and mq-deadline from Damien and Christoph. - Set of fixes for bcache by way of Michael Lyle, including fixes from himself, Kent, Rui, Tang, and Coly. - Series from Matias for lightnvm with fixes from Hans Holmberg, Javier, and Matias. Mostly centered around pblk, and the removing rrpc 1.2 in preparation for supporting 2.0. - A couple of NVMe pull requests from Christoph. Nothing major in here, just fixes and cleanups, and support for command tracing from Johannes. - Support for blk-throttle for tracking reads and writes separately. From Joseph Qi. A few cleanups/fixes also for blk-throttle from Weiping. - Series from Mike Snitzer that enables dm to register its queue more logically, something that's alwways been problematic on dm since it's a stacked device. - Series from Ming cleaning up some of the bio accessor use, in preparation for supporting multipage bvecs. - Various fixes from Ming closing up holes around queue mapping and quiescing. - BSD partition fix from Richard Narron, fixing a problem where we can't mount newer (10/11) FreeBSD partitions. - Series from Tejun reworking blk-mq timeout handling. The previous scheme relied on atomic bits, but it had races where we would think a request had timed out if it to reused at the wrong time. - null_blk now supports faking timeouts, to enable us to better exercise and test that functionality separately. From me. - Kill the separate atomic poll bit in the request struct. After this, we don't use the atomic bits on blk-mq anymore at all. From me. - sgl_alloc/free helpers from Bart. - Heavily contended tag case scalability improvement from me. - Various little fixes and cleanups from Arnd, Bart, Corentin, Douglas, Eryu, Goldwyn, and myself" * 'for-4.16/block' of git://git.kernel.dk/linux-block: (186 commits) block: remove smart1,2.h nvme: add tracepoint for nvme_complete_rq nvme: add tracepoint for nvme_setup_cmd nvme-pci: introduce RECONNECTING state to mark initializing procedure nvme-rdma: remove redundant boolean for inline_data nvme: don't free uuid pointer before printing it nvme-pci: Suspend queues after deleting them bsg: use pr_debug instead of hand crafted macros blk-mq-debugfs: don't allow write on attributes with seq_operations set nvme-pci: Fix queue double allocations block: Set BIO_TRACE_COMPLETION on new bio during split blk-throttle: use queue_is_rq_based block: Remove kblockd_schedule_delayed_work{,_on}() blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly() lib/scatterlist: Fix chaining support in sgl_alloc_order() blk-throttle: track read and write request individually block: add bdev_read_only() checks to common helpers block: fail op_is_write() requests to read-only partitions blk-throttle: export io_serviced_recursive, io_service_bytes_recursive ...
| * blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delaysBart Van Assche2018-01-191-4/+3Star
| | | | | | | | | | | | | | | | | | | | | | Make sure that calling blk_mq_run_hw_queue() or blk_mq_kick_requeue_list() triggers a queue run without delay even if blk_mq_delay_run_hw_queue() has been called recently and if its delay has not yet expired. Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: Rename blk_mq_request_direct_issue() into ↵Bart Van Assche2018-01-191-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | blk_mq_request_issue_directly() Most blk-mq functions have a name that follows the pattern blk_mq_${action}. However, the function name blk_mq_request_direct_issue is an exception. Hence rename this function. This patch does not change any functionality. Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busyMing Lei2018-01-181-12/+10Star
| | | | | | | | | | | | | | | | | | | | | | | | | | If we run into blk_mq_request_direct_issue(), when queue is busy, we don't want to dispatch this request into hctx->dispatch_list, and what we need to do is to return the queue busy info to caller, so that caller can deal with it well. Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback") Reported-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_requestMike Snitzer2018-01-171-9/+7Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit: 923218f6166a ("blk-mq: don't allocate driver tag upfront for flush rq") we no longer use the 'can_block' argument in blk_mq_sched_insert_request(). Kill it. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Added actual commit message as to why it's being removed. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedbackMing Lei2018-01-171-8/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_insert_cloned_request() is called in the fast path of a dm-rq driver (e.g. blk-mq request-based DM mpath). blk_insert_cloned_request() uses blk_mq_request_bypass_insert() to directly append the request to the blk-mq hctx->dispatch_list of the underlying queue. 1) This way isn't efficient enough because the hctx spinlock is always used. 2) With blk_insert_cloned_request(), we completely bypass underlying queue's elevator and depend on the upper-level dm-rq driver's elevator to schedule IO. But dm-rq currently can't get the underlying queue's dispatch feedback at all. Without knowing whether a request was issued or not (e.g. due to underlying queue being busy) the dm-rq elevator will not be able to provide effective IO merging (as a side-effect of dm-rq currently blindly destaging a request from its elevator only to requeue it after a delay, which kills any opportunity for merging). This obviously causes very bad sequential IO performance. Fix this by updating blk_insert_cloned_request() to use blk_mq_request_direct_issue(). blk_mq_request_direct_issue() allows a request to be issued directly to the underlying queue and returns the dispatch feedback (blk_status_t). If blk_mq_request_direct_issue() returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE to _not_ destage the request. Whereby preserving the opportunity to merge IO. With this, request-based DM's blk-mq sequential IO performance is vastly improved (as much as 3X in mpath/virtio-scsi testing). Signed-off-by: Ming Lei <ming.lei@redhat.com> [blk-mq.c changes heavily influenced by Ming Lei's initial solution, but they were refactored to make them less fragile and easier to read/review] Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: factor out a few helpers from __blk_mq_try_issue_directlyMike Snitzer2018-01-171-27/+52
| | | | | | | | | | | | | | | | | | | | | | No functional change. Just makes code flow more logically. In following commit, __blk_mq_try_issue_directly() will be used to return the dispatch result (blk_status_t) to DM. DM needs this information to improve IO merging. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: turn WARN_ON in __blk_mq_run_hw_queue into printkMing Lei2018-01-171-2/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We know this WARN_ON is harmless and in reality it may be trigged, so convert it to printk() and dump_stack() to avoid to confusing people. Also add comment about two releated races here. Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Stefan Haberland <sth@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "jianchao.wang" <jianchao.w.wang@oracle.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: make sure hctx->next_cpu is set correctlyMing Lei2018-01-171-2/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When hctx->next_cpu is set from possible online CPUs, there is one race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally break workqueue. The race can be triggered in the following two sitations: 1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called to dispatch requests from the DEAD cpu context, but at that time, this DEAD CPU has been cleared from 'cpu_online_mask', so all CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set a bad value. 2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue should be run on the other CPU A, then CPU A may become offline at the same time and all CPUs in hctx->cpumask become offline. This patch deals with this issue by re-selecting next CPU, and making sure it is set correctly. Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Stefan Haberland <sth@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Thomas Gleixner <tglx@linutronix.de> Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com> Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com> Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: fix bad clear of RQF_MQ_INFLIGHT in blk_mq_ct_ctx_init()Jens Axboe2018-01-141-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | A previous commit moved the clearing of rq->rq_flags later, but we may have already set RQF_MQ_INFLIGHT when that happens. Ensure that we correctly initialize rq->rq_flags to the right value. This is based on an original fix by Ming, just rewritten to not require a conditional. Fixes: 7c3fb70f0341 ("block: rearrange a few request fields for better cache layout") Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: simplify queue mapping & schedule with each possisble CPUChristoph Hellwig2018-01-121-11/+8Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous patch assigns interrupt vectors to all possible CPUs, so now hctx can be mapped to possible CPUs, this patch applies this fact to simplify queue mapping & schedule so that we don't need to handle CPU hotplug for dealing with physical CPU plug & unplug. With this simplication, we can work well on physical CPU plug & unplug, which is a normal use case for VM at least. Make sure we allocate blk_mq_ctx structures for all possible CPUs, and set hctx->numa_node for possible CPUs which are mapped to this hctx. And only choose the online CPUs for schedule. Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU") (merged the three into one because any single one may not work, and fix selecting online CPUs for scheduler) Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: Reduce the number of if-statements in blk_mq_mark_tag_wait()Bart Van Assche2018-01-111-34/+35
| | | | | | | | | | | | | | | | | | | | | | | | This patch does not change any functionality but makes the blk_mq_mark_tag_wait() code slightly easier to read. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Omar Sandoval <osandov@fb.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: Add locking annotations to hctx_lock() and hctx_unlock()Bart Van Assche2018-01-101-0/+2
| | | | | | | | | | | | | | | | | | | | | | This patch avoids that sparse reports the following: block/blk-mq.c:637:33: warning: context imbalance in 'hctx_unlock' - unexpected unlock block/blk-mq.c:642:9: warning: context imbalance in 'hctx_lock' - wrong count at exit Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: rearrange a few request fields for better cache layoutJens Axboe2018-01-101-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move completion related items (like the call single data) near the end of the struct, instead of mixing them in with the initial queueing related fields. Move queuelist below the bio structures. Then we have all queueing related bits in the first cache line. This yields a 1.5-2% increase in IOPS for a null_blk test, both for sync and for high thread count access. Sync test goes form 975K to 992K, 32-thread case from 20.8M to 21.2M IOPS. Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: convert REQ_ATOM_COMPLETE to stealing rq->__deadline bitJens Axboe2018-01-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | We only have one atomic flag left. Instead of using an entire unsigned long for that, steal the bottom bit of the deadline field that we already reserved. Remove ->atomic_flags, since it's now unused. Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: add accessors for setting/querying request deadlineJens Axboe2018-01-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | We reduce the resolution of request expiry, but since we're already using jiffies for this where resolution depends on the kernel configuration and since the timeout resolution is coarse anyway, that should be fine. Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: remove REQ_ATOM_POLL_SLEPTJens Axboe2018-01-101-3/+2Star
| | | | | | | | | | | | | | | | | | | | | | We don't need this to be an atomic flag, it can be a regular flag. We either end up on the same CPU for the polling, in which case the state is sane, or we did the sleep which would imply the needed barrier to ensure we see the right state. Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: Explain when 'active_queues' is decrementedBart Van Assche2018-01-101-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | It is nontrivial to derive from the blk-mq source code when blk_mq_tags.active_queues is decremented. Hence add a comment that explains this. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: Fix spelling in a source code commentBart Van Assche2018-01-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | Change "nedeing" into "needing" and "caes" into "cases". Fixes: commit f906a6a0f426 ("blk-mq: improve tag waiting setup for non-shared tags") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Omar Sandoval <osandov@fb.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: silence false positive warnings in hctx_unlock()Jens Axboe2018-01-091-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In some stupider versions of gcc, it complains: block/blk-mq.c: In function ‘blk_mq_complete_request’: ./include/linux/srcu.h:175:2: warning: ‘srcu_idx’ may be used uninitialized in this function [-Wmaybe-uninitialized] __srcu_read_unlock(sp, idx); ^ block/blk-mq.c:620:6: note: ‘srcu_idx’ was declared here int srcu_idx; ^ which is completely bogus, since we only use srcu_idx when hctx->flags & BLK_MQ_F_BLOCKING is set, and that's the case where hctx_lock() has initialized it. Just set it to '0' in the normal path in hctx_lock() to silence this annoying warning. Fixes: 04ced159cec8 ("blk-mq: move hctx lock/unlock into a helper") Fixes: 5197c05e16b4 ("blk-mq: protect completion path with RCU") Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: rename blk_mq_hw_ctx->queue_rq_srcu to ->srcuTejun Heo2018-01-091-7/+7
| | | | | | | | | | | | | | | | | | | | The RCU protection has been expanded to cover both queueing and completion paths making ->queue_rq_srcu a misnomer. Rename it to ->srcu as suggested by Bart. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Bart Van Assche <Bart.VanAssche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: remove REQ_ATOM_STARTEDTejun Heo2018-01-091-29/+8Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | After the recent updates to use generation number and state based synchronization, we can easily replace REQ_ATOM_STARTED usages by adding an extra state to distinguish completed but not yet freed state. Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with blk_mq_rq_state() tests. REQ_ATOM_STARTED no longer has any users left and is removed. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mqTejun Heo2018-01-091-8/+7Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the recent updates to use generation number and state based synchronization, blk-mq no longer depends on REQ_ATOM_COMPLETE except to avoid firing the same timeout multiple times. Remove all REQ_ATOM_COMPLETE usages and use a new rq_flags flag RQF_MQ_TIMEOUT_EXPIRED to avoid firing the same timeout multiple times. This removes atomic bitops from hot paths too. v2: Removed blk_clear_rq_complete() from blk_mq_rq_timed_out(). v3: Added RQF_MQ_TIMEOUT_EXPIRED flag. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: "jianchao.wang" <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: make blk_abort_request() trigger timeout pathTejun Heo2018-01-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With issue/complete and timeout paths now using the generation number and state based synchronization, blk_abort_request() is the only one which depends on REQ_ATOM_COMPLETE for arbitrating completion. There's no reason for blk_abort_request() to be a completely separate path. This patch makes blk_abort_request() piggyback on the timeout path instead of trying to terminate the request directly. This removes the last dependency on REQ_ATOM_COMPLETE in blk-mq. Note that this makes blk_abort_request() asynchronous - it initiates abortion but the actual termination will happen after a short while, even when the caller owns the request. AFAICS, SCSI and ATA should be fine with that and I think mtip32xx and dasd should be safe but not completely sure. It'd be great if people who know the drivers take a look. v2: - Add comment explaining the lack of synchronization around ->deadline update as requested by Bart. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Asai Thambi SP <asamymuthupa@micron.com> Cc: Stefan Haberland <sth@linux.vnet.ibm.com> Cc: Jan Hoeppner <hoeppner@linux.vnet.ibm.com> Cc: Bart Van Assche <Bart.VanAssche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: use blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETETejun Heo2018-01-091-3/+3
| | | | | | | | | | | | | | | | | | | | | | blk_mq_check_inflight() and blk_mq_poll_hybrid_sleep() test REQ_ATOM_COMPLETE to determine the request state. Both uses are speculative and we can test REQ_ATOM_STARTED and blk_mq_rq_state() for equivalent results. Replace the tests. This will allow removing REQ_ATOM_COMPLETE usages from blk-mq. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: replace timeout synchronization with a RCU and generation based schemeTejun Heo2018-01-091-72/+157
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, blk-mq timeout path synchronizes against the usual issue/completion path using a complex scheme involving atomic bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence rules. Unfortunately, it contains quite a few holes. There's a complex dancing around REQ_ATOM_STARTED and REQ_ATOM_COMPLETE between issue/completion and timeout paths; however, they don't have a synchronization point across request recycle instances and it isn't clear what the barriers add. blk_mq_check_expired() can easily read STARTED from N-2'th iteration, deadline from N-1'th, blk_mark_rq_complete() against Nth instance. In fact, it's pretty easy to make blk_mq_check_expired() terminate a later instance of a request. If we induce 5 sec delay before time_after_eq() test in blk_mq_check_expired(), shorten the timeout to 2s, and issue back-to-back large IOs, blk-mq starts timing out requests spuriously pretty quickly. Nothing actually timed out. It just made the call on a recycle instance of a request and then terminated a later instance long after the original instance finished. The scenario isn't theoretical either. This patch replaces the broken synchronization mechanism with a RCU and generation number based one. 1. Each request has a u64 generation + state value, which can be updated only by the request owner. Whenever a request becomes in-flight, the generation number gets bumped up too. This provides the basis for the timeout path to distinguish different recycle instances of the request. Also, marking a request in-flight and setting its deadline are protected with a seqcount so that the timeout path can fetch both values coherently. 2. The timeout path fetches the generation, state and deadline. If the verdict is timeout, it records the generation into a dedicated request abortion field and does RCU wait. 3. The completion path is also protected by RCU (from the previous patch) and checks whether the current generation number and state match the abortion field. If so, it skips completion. 4. The timeout path, after RCU wait, scans requests again and terminates the ones whose generation and state still match the ones requested for abortion. By now, the timeout path knows that either the generation number and state changed if it lost the race or the completion will yield to it and can safely timeout the request. While it's more lines of code, it's conceptually simpler, doesn't depend on direct use of subtle memory ordering or coherence, and hopefully doesn't terminate the wrong instance. While this change makes REQ_ATOM_COMPLETE synchronization unnecessary between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't removed yet as it's still used in other places. Future patches will move all state tracking to the new mechanism and remove all bitops in the hot paths. Note that this patch adds a comment explaining a race condition in BLK_EH_RESET_TIMER path. The race has always been there and this patch doesn't change it. It's just documenting the existing race. v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao. - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter. - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter. v3: - Fixed possible extended seqcount / u64_stats_sync read looping spotted by Peter. - MQ_RQ_IDLE was incorrectly being set in complete_request instead of free_request. Fixed. v4: - Rebased on top of hctx_lock() refactoring patch. - Added comment explaining the use of hctx_lock() in completion path. v5: - Added comments requested by Bart. - Note the addition of BLK_EH_RESET_TIMER race condition in the commit message. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: "jianchao.wang" <jianchao.w.wang@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <Bart.VanAssche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: protect completion path with RCUTejun Heo2018-01-091-0/+5
| | | | | | | | | | | | | | | | | | | | Currently, blk-mq protects only the issue path with RCU. This patch puts the completion path under the same RCU protection. This will be used to synchronize issue/completion against timeout by later patches, which will also add the comments. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: move hctx lock/unlock into a helperJens Axboe2018-01-091-34/+32Star
| | | | | | | | | | | | | | | | | | | | | | | | | | Move the RCU vs SRCU logic into lock/unlock helpers, which makes the actual functional bits within the locked region much easier to read. tj: Reordered in front of timeout revamp patches and added the missing blk_mq_run_hw_queue() conversion. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: fix kernel oops in blk_mq_tag_idle()Ming Lei2018-01-091-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | HW queues may be unmapped in some cases, such as blk_mq_update_nr_hw_queues(), then we need to check it before calling blk_mq_tag_idle(), otherwise the following kernel oops can be triggered, so fix it by checking if the hw queue is unmapped since it doesn't make sense to idle the tags any more after hw queues are unmapped. [ 440.771298] Workqueue: nvme-wq nvme_rdma_del_ctrl_work [nvme_rdma] [ 440.779104] task: ffff894bae755ee0 ti: ffff893bf9bc8000 task.ti: ffff893bf9bc8000 [ 440.788359] RIP: 0010:[<ffffffffb730e2b4>] [<ffffffffb730e2b4>] __blk_mq_tag_idle+0x24/0x40 [ 440.798697] RSP: 0018:ffff893bf9bcbd10 EFLAGS: 00010286 [ 440.805538] RAX: 0000000000000000 RBX: ffff895bb131dc00 RCX: 000000000000011f [ 440.814426] RDX: 00000000ffffffff RSI: 0000000000000120 RDI: ffff895bb131dc00 [ 440.823301] RBP: ffff893bf9bcbd10 R08: 000000000001b860 R09: 4a51d361c00c0000 [ 440.832193] R10: b5907f32b4cc7003 R11: ffffd6cabfb57000 R12: ffff894bafd1e008 [ 440.841091] R13: 0000000000000001 R14: ffff895baf770000 R15: 0000000000000080 [ 440.849988] FS: 0000000000000000(0000) GS:ffff894bbdcc0000(0000) knlGS:0000000000000000 [ 440.859955] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 440.867274] CR2: 0000000000000008 CR3: 000000103d098000 CR4: 00000000001407e0 [ 440.876169] Call Trace: [ 440.879818] [<ffffffffb7309d68>] blk_mq_exit_hctx+0xd8/0xe0 [ 440.887051] [<ffffffffb730dc40>] blk_mq_free_queue+0xf0/0x160 [ 440.894465] [<ffffffffb72ff679>] blk_cleanup_queue+0xd9/0x150 [ 440.901881] [<ffffffffc08a802b>] nvme_ns_remove+0x5b/0xb0 [nvme_core] [ 440.910068] [<ffffffffc08a811b>] nvme_remove_namespaces+0x3b/0x60 [nvme_core] [ 440.919026] [<ffffffffc08b817b>] __nvme_rdma_remove_ctrl+0x2b/0xb0 [nvme_rdma] [ 440.928079] [<ffffffffc08b8237>] nvme_rdma_del_ctrl_work+0x17/0x20 [nvme_rdma] [ 440.937126] [<ffffffffb70ab58a>] process_one_work+0x17a/0x440 [ 440.944517] [<ffffffffb70ac3a8>] worker_thread+0x278/0x3c0 [ 440.951607] [<ffffffffb70ac130>] ? manage_workers.isra.24+0x2a0/0x2a0 [ 440.959760] [<ffffffffb70b352f>] kthread+0xcf/0xe0 [ 440.966055] [<ffffffffb70b3460>] ? insert_kthread_work+0x40/0x40 [ 440.973715] [<ffffffffb76d8658>] ret_from_fork+0x58/0x90 [ 440.980586] [<ffffffffb70b3460>] ? insert_kthread_work+0x40/0x40 [ 440.988229] Code: 5b 41 5c 5d c3 66 90 0f 1f 44 00 00 48 8b 87 20 01 00 00 f0 0f ba 77 40 01 19 d2 85 d2 75 08 c3 0f 1f 80 00 00 00 00 55 48 89 e5 <f0> ff 48 08 48 8d 78 10 e8 7f 0f 05 00 5d c3 0f 1f 00 66 2e 0f [ 441.011620] RIP [<ffffffffb730e2b4>] __blk_mq_tag_idle+0x24/0x40 [ 441.019301] RSP <ffff893bf9bcbd10> [ 441.024052] CR2: 0000000000000008 Reported-by: Zhang Yi <yizhan@redhat.com> Tested-by: Zhang Yi <yizhan@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: fix race between updating nr_hw_queues and switching io schedMing Lei2018-01-061-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In both elevator_switch_mq() and blk_mq_update_nr_hw_queues(), sched tags can be allocated, and q->nr_hw_queue is used, and race is inevitable, for example: blk_mq_init_sched() may trigger use-after-free on hctx, which is freed in blk_mq_realloc_hw_ctxs() when nr_hw_queues is decreased. This patch fixes the race be holding q->sysfs_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reported-by: Yi Zhang <yi.zhang@redhat.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: avoid to map CPU into stale hw queueMing Lei2018-01-061-2/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_mq_pci_map_queues() may not map one CPU into any hw queue, but its previous map isn't cleared yet, and may point to one stale hw queue index. This patch fixes the following issue by clearing the mapping table before setting it up in blk_mq_pci_map_queues(). This patches fixes this following issue reported by Zhang Yi: [ 101.202734] BUG: unable to handle kernel NULL pointer dereference at 0000000094d3013f [ 101.211487] IP: blk_mq_map_swqueue+0xbc/0x200 [ 101.216346] PGD 0 P4D 0 [ 101.219171] Oops: 0000 [#1] SMP [ 101.222674] Modules linked in: sunrpc ipmi_ssif vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore mxm_wmi intel_rapl_perf iTCO_wdt ipmi_si ipmi_devintf pcspkr iTCO_vendor_support sg dcdbas ipmi_msghandler wmi mei_me lpc_ich shpchp mei acpi_power_meter dm_multipath ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci crc32c_intel libata tg3 nvme nvme_core megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod [ 101.284881] CPU: 0 PID: 504 Comm: kworker/u25:5 Not tainted 4.15.0-rc2 #1 [ 101.292455] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017 [ 101.301001] Workqueue: nvme-wq nvme_reset_work [nvme] [ 101.306636] task: 00000000f2c53190 task.stack: 000000002da874f9 [ 101.313241] RIP: 0010:blk_mq_map_swqueue+0xbc/0x200 [ 101.318681] RSP: 0018:ffffc9000234fd70 EFLAGS: 00010282 [ 101.324511] RAX: ffff88047ffc9480 RBX: ffff88047e130850 RCX: 0000000000000000 [ 101.332471] RDX: ffffe8ffffd40580 RSI: ffff88047e509b40 RDI: ffff88046f37a008 [ 101.340432] RBP: 000000000000000b R08: ffff88046f37a008 R09: 0000000011f94280 [ 101.348392] R10: ffff88047ffd4d00 R11: 0000000000000000 R12: ffff88046f37a008 [ 101.356353] R13: ffff88047e130f38 R14: 000000000000000b R15: ffff88046f37a558 [ 101.364314] FS: 0000000000000000(0000) GS:ffff880277c00000(0000) knlGS:0000000000000000 [ 101.373342] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 101.379753] CR2: 0000000000000098 CR3: 000000047f409004 CR4: 00000000001606f0 [ 101.387714] Call Trace: [ 101.390445] blk_mq_update_nr_hw_queues+0xbf/0x130 [ 101.395791] nvme_reset_work+0x6f4/0xc06 [nvme] [ 101.400848] ? pick_next_task_fair+0x290/0x5f0 [ 101.405807] ? __switch_to+0x1f5/0x430 [ 101.409988] ? put_prev_entity+0x2f/0xd0 [ 101.414365] process_one_work+0x141/0x340 [ 101.418836] worker_thread+0x47/0x3e0 [ 101.422921] kthread+0xf5/0x130 [ 101.426424] ? rescuer_thread+0x380/0x380 [ 101.430896] ? kthread_associate_blkcg+0x90/0x90 [ 101.436048] ret_from_fork+0x1f/0x30 [ 101.440034] Code: 48 83 3c ca 00 0f 84 2b 01 00 00 48 63 cd 48 8b 93 10 01 00 00 8b 0c 88 48 8b 83 20 01 00 00 4a 03 14 f5 60 04 af 81 48 8b 0c c8 <48> 8b 81 98 00 00 00 f0 4c 0f ab 30 8b 81 f8 00 00 00 89 42 44 [ 101.461116] RIP: blk_mq_map_swqueue+0xbc/0x200 RSP: ffffc9000234fd70 [ 101.468205] CR2: 0000000000000098 [ 101.471907] ---[ end trace 5fe710f98228a3ca ]--- [ 101.482489] Kernel panic - not syncing: Fatal exception [ 101.488505] Kernel Offset: disabled [ 101.497752] ---[ end Kernel panic - not syncing: Fatal exception Reviewed-by: Christoph Hellwig <hch@lst.de> Suggested-by: Christoph Hellwig <hch@lst.de> Reported-by: Yi Zhang <yi.zhang@redhat.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>