summaryrefslogtreecommitdiffstats
path: root/drivers/nvme/host
Commit message (Collapse)AuthorAgeFilesLines
* nvme-core: declare local symbols staticBart Van Assche2018-10-171-1/+1
| | | | | | | | | This patch avoids that sparse complains about missing declarations. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvmet-rdma: check for timeout in nvme_rdma_wait_for_cm()Bart Van Assche2018-10-171-1/+8
| | | | | | | | Check whether queue->cm_error holds a value before reading it. This patch addresses Coverity ID 1373774: unchecked return value. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme: update node paths after adding new pathKeith Busch2018-10-171-0/+9
| | | | | | | | | | | | | | The nvme namespace paths were being updated only when the current path was not set or nonoptimized. If a new path comes online that is a better path for its NUMA node, the multipath selector may continue using the previously set path on a potentially further node. This patch re-runs the path assignment after successfully adding a new optimized path. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
* lightnvm: do no update csecs and sos on 1.2Javier González2018-10-091-0/+3
| | | | | | | | | | 1.2 devices exposes their data and metadata size through the separate identify command. Make sure that the NVMe LBA format does not override these values. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: use internal allocation for chunk log pageJavier González2018-10-091-8/+17
| | | | | | | | | | | | | | | | | | The lightnvm subsystem provides helpers to retrieve chunk metadata, where the target needs to provide a buffer to store the metadata. An implicit assumption is that this buffer is contiguous and can be used to retrieve the data from the device. If the device exposes too many chunks, then kmalloc might fail, thus failing instance creation. This patch removes this assumption by implementing an internal buffer in the lightnvm subsystem to retrieve chunk metadata. Targets can then use virtual memory allocations. Since this is a target API change, adapt pblk accordingly. Signed-off-by: Javier González <javier@cnexlabs.com> Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: move bad block and chunk state logic to coreMatias Bjørling2018-10-091-2/+2
| | | | | | | | | | | | | | | | | | | | | pblk implements two data paths for recovery line state. One for 1.2 and another for 2.0, instead of having pblk implement these, combine them in the core to reduce complexity and make available to other targets. The new interface will adhere to the 2.0 chunk definition, including managing open chunks with an active write pointer. To provide this interface, a 1.2 device recovers the state of the chunks by manually detecting if a chunk is either free/open/close/offline, and if open, scanning the flash pages sequentially to find the next writeable page. This process takes on average ~10 seconds on a device with 64 dies, 1024 blocks and 60us read access time. The process can be parallelized but is left out for maintenance simplicity, as the 1.2 specification is deprecated. For 2.0 devices, the logic is maintained internally in the drive and retrieved through the 2.0 interface. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nvme: take node locality into account when selecting a pathChristoph Hellwig2018-10-013-28/+54
| | | | | | | | | | | Make current_path an array with an entry for every possible node, and cache the best path on a per-node basis. Take the node distance into account when selecting it. This is primarily useful for dual-ported PCIe devices which are connected to PCIe root ports on different sockets. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Hannes Reinecke <hare@suse.com>
* nvme: call nvme_complete_rq when nvmf_check_ready fails for mpath I/OJames Smart2018-10-012-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an io is rejected by nvmf_check_ready() due to validation of the controller state, the nvmf_fail_nonready_command() will normally return BLK_STS_RESOURCE to requeue and retry. However, if the controller is dying or the I/O is marked for NVMe multipath, the I/O is failed so that the controller can terminate or so that the io can be issued on a different path. Unfortunately, as this reject point is before the transport has accepted the command, blk-mq ends up completing the I/O and never calls nvme_complete_rq(), which is where multipath may preserve or re-route the I/O. The end result is, the device user ends up seeing an EIO error. Example: single path connectivity, controller is under load, and a reset is induced. An I/O is received: a) while the reset state has been set but the queues have yet to be stopped; or b) after queues are started (at end of reset) but before the reconnect has completed. The I/O finishes with an EIO status. This patch makes the following changes: - Adds the HOST_PATH_ERROR pathing status from TP4028 - Modifies the reject point such that it appears to queue successfully, but actually completes the io with the new pathing status and calls nvme_complete_rq(). - nvme_complete_rq() recognizes the new status, avoids resetting the controller (likely was already done in order to get this new status), and calls the multipather to clear the current path that errored. This allows the next command (retry or new command) to select a new path if there is one. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme-core: add async event trace helperChaitanya Kulkarni2018-10-012-2/+37
| | | | | | | | | This patch adds a new event for nvme async event notification. We print the async event in the decoded format when we recognize the event otherwise we just dump the result. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme_fc: add 'nvme_discovery' sysfs attribute to fc transport deviceJames Smart2018-10-011-9/+95
| | | | | | | | | | | | | The fc transport device should allow for a rediscovery, as userspace might have lost the events. Example is udev events not handled during system startup. This patch add a sysfs entry 'nvme_discovery' on the fc class to have it replay all udev discovery events for all local port/remote port address pairs. Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme-fc: fix for a minor typosMilan P. Gandhi2018-10-011-2/+2
| | | | | | Signed-off-by: Milan P. Gandhi <mgandhi@redhat.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme: fix typo in nvme_identify_ns_descsMilan P. Gandhi2018-10-011-1/+1
| | | | | | Signed-off-by: Milan P. Gandhi <mgandhi@redhat.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* Merge tag 'v4.19-rc6' into for-4.20/blockJens Axboe2018-10-011-2/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge -rc6 in, for two reasons: 1) Resolve a trivial conflict in the blk-mq-tag.c documentation 2) A few important regression fixes went into upstream directly, so they aren't in the 4.20 branch. Signed-off-by: Jens Axboe <axboe@kernel.dk> * tag 'v4.19-rc6': (780 commits) Linux 4.19-rc6 MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c cpufreq: qcom-kryo: Fix section annotations perf/core: Add sanity check to deal with pinned event failure xen/blkfront: correct purging of persistent grants Revert "xen/blkfront: When purging persistent grants, keep them in the buffer" selftests/powerpc: Fix Makefiles for headers_install change blk-mq: I/O and timer unplugs are inverted in blktrace dax: Fix deadlock in dax_lock_mapping_entry() x86/boot: Fix kexec booting failure in the SEV bit detection code bcache: add separate workqueue for journal_write to avoid deadlock drm/amd/display: Fix Edid emulation for linux drm/amd/display: Fix Vega10 lightup on S3 resume drm/amdgpu: Fix vce work queue was not cancelled when suspend Revert "drm/panel: Add device_link from panel device to DRM device" xen/blkfront: When purging persistent grants, keep them in the buffer clocksource/drivers/timer-atmel-pit: Properly handle error cases block: fix deadline elevator drain for zoned block devices ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set ... Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nvme: properly propagate errors in nvme_mpath_initSusobhan Dey2018-09-261-2/+4
| | | | | | | | | | Signed-off-by: Susobhan Dey <susobhan.dey@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* | nvme: register ns_id attributes as default sysfs groupsHannes Reinecke2018-09-284-92/+59Star
| | | | | | | | | | | | | | | | | | | | | | We should be registering the ns_id attribute as default sysfs attribute groups, otherwise we have a race condition between the uevent and the attributes appearing in sysfs. Suggested-by: Bart van Assche <bvanassche@acm.org> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: genhd: add 'groups' argument to device_add_diskHannes Reinecke2018-09-282-2/+2
|/ | | | | | | | | | | | | | Update device_add_disk() to take an 'groups' argument so that individual drivers can register a device with additional sysfs attributes. This avoids race condition the driver would otherwise have if these groups were to be created with sysfs_add_groups(). Signed-off-by: Martin Wilck <martin.wilck@suse.com> Signed-off-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nvme-pci: add a memory barrier to nvme_dbbuf_update_and_check_eventMichal Wnukowski2018-08-281-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In many architectures loads may be reordered with older stores to different locations. In the nvme driver the following two operations could be reordered: - Write shadow doorbell (dbbuf_db) into memory. - Read EventIdx (dbbuf_ei) from memory. This can result in a potential race condition between driver and VM host processing requests (if given virtual NVMe controller has a support for shadow doorbell). If that occurs, then the NVMe controller may decide to wait for MMIO doorbell from guest operating system, and guest driver may decide not to issue MMIO doorbell on any of subsequent commands. This issue is purely timing-dependent one, so there is no easy way to reproduce it. Currently the easiest known approach is to run "Oracle IO Numbers" (orion) that is shipped with Oracle DB: orion -run advanced -num_large 0 -size_small 8 -type rand -simulate \ concat -write 40 -duration 120 -matrix row -testname nvme_test Where nvme_test is a .lun file that contains a list of NVMe block devices to run test against. Limiting number of vCPUs assigned to given VM instance seems to increase chances for this bug to occur. On test environment with VM that got 4 NVMe drives and 1 vCPU assigned the virtual NVMe controller hang could be observed within 10-20 minutes. That correspond to about 400-500k IO operations processed (or about 100GB of IO read/writes). Orion tool was used as a validation and set to run in a loop for 36 hours (equivalent of pushing 550M IO operations). No issues were observed. That suggest that the patch fixes the issue. Fixes: f9f38e33389c ("nvme: improve performance for virtual NVMe devices") Signed-off-by: Michal Wnukowski <wnukowski@google.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> [hch: updated changelog and comment a bit] Signed-off-by: Christoph Hellwig <hch@lst.de>
* Merge branch 'linus/master' into rdma.git for-nextJason Gunthorpe2018-08-1610-359/+670
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rdma.git merge resolution for the 4.19 merge window Conflicts: drivers/infiniband/core/rdma_core.c - Use the rdma code and revise with the new spelling for atomic_fetch_add_unless drivers/nvme/host/rdma.c - Replace max_sge with max_send_sge in new blk code drivers/nvme/target/rdma.c - Use the blk code and revise to use NULL for ib_post_recv when appropriate - Replace max_sge with max_recv_sge in new blk code net/rds/ib_send.c - Use the net code and revise to use NULL for ib_post_recv when appropriate Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
| * Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-blockLinus Torvalds2018-08-1410-359/+670
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: "First pull request for this merge window, there will also be a followup request with some stragglers. This pull request contains: - Fix for a thundering heard issue in the wbt block code (Anchal Agarwal) - A few NVMe pull requests: * Improved tracepoints (Keith) * Larger inline data support for RDMA (Steve Wise) * RDMA setup/teardown fixes (Sagi) * Effects log suppor for NVMe target (Chaitanya Kulkarni) * Buffered IO suppor for NVMe target (Chaitanya Kulkarni) * TP4004 (ANA) support (Christoph) * Various NVMe fixes - Block io-latency controller support. Much needed support for properly containing block devices. (Josef) - Series improving how we handle sense information on the stack (Kees) - Lightnvm fixes and updates/improvements (Mathias/Javier et al) - Zoned device support for null_blk (Matias) - AIX partition fixes (Mauricio Faria de Oliveira) - DIF checksum code made generic (Max Gurtovoy) - Add support for discard in iostats (Michael Callahan / Tejun) - Set of updates for BFQ (Paolo) - Removal of async write support for bsg (Christoph) - Bio page dirtying and clone fixups (Christoph) - Set of bcache fix/changes (via Coly) - Series improving blk-mq queue setup/teardown speed (Ming) - Series improving merging performance on blk-mq (Ming) - Lots of other fixes and cleanups from a slew of folks" * tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits) blkcg: Make blkg_root_lookup() work for queues in bypass mode bcache: fix error setting writeback_rate through sysfs interface null_blk: add lock drop/acquire annotation Blk-throttle: reduce tail io latency when iops limit is enforced block: paride: pd: mark expected switch fall-throughs block: Ensure that a request queue is dissociated from the cgroup controller block: Introduce blk_exit_queue() blkcg: Introduce blkg_root_lookup() block: Remove two superfluous #include directives blk-mq: count the hctx as active before allocating tag block: bvec_nr_vecs() returns value for wrong slab bcache: trivial - remove tailing backslash in macro BTREE_FLAG bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section bcache: set max writeback rate when I/O request is idle bcache: add code comments for bset.c bcache: fix mistaken comments in request.c bcache: fix mistaken code comments in bcache.h bcache: add a comment in super.c bcache: avoid unncessary cache prefetch bch_btree_node_get() bcache: display rate debug parameters to 0 when writeback is not running ...
| | * nvme-fabrics: fix ctrl_loss_tmo < 0 to reconnect foreverTal Shorer2018-08-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the user supplies a ctrl_loss_tmo < 0, we warn them that this will cause the fabrics layer to attempt reconnection forever. However, in reality the fabrics layer never attempts to reconnect because the condition to test whether we should reconnect is backwards in this case. Signed-off-by: Tal Shorer <tal.shorer@gmail.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * nvme: set gendisk read only based on nsattrChaitanya Kulkarni2018-08-081-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NVMe 1.3 TP 4005 introduces new filed (NSATTR). This field indicates whether given namespace is write protected or not. This patch sets the gendisk associated with the namespace to read only based on the identify namespace nsattr field. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * nvme: fixup crash on failed discoveryHannes Reinecke2018-08-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | When the initial discovery fails the subsystem hasn't been setup yet in nvme_mpath_stop, and we can't dereference ctrl->subsys. Fixes: 0d0b660f ("nvme: add ANA support") Signed-off-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * lightnvm: remove minor version check for 2.0Matias Bjørling2018-08-061-6/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A minor version number increase should not break backwards compatibility. Fixes: 3cb98f84d368b ("lightnvm: add minor version to generic geometry") Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| | * Merge branch 'nvme-4.19' of git://git.infradead.org/nvme into for-4.19/block2Jens Axboe2018-08-064-62/+422
| | |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NVMe changes from Christoph: "This contains the support for TP4004, Asymmetric Namespace Access, which makes NVMe multipathing usable in practice." * 'nvme-4.19' of git://git.infradead.org/nvme: nvmet: use Retain Async Event bit to clear AEN nvmet: support configuring ANA groups nvmet: add minimal ANA support nvmet: track and limit the number of namespaces per subsystem nvmet: keep a port pointer in nvmet_ctrl nvme: add ANA support nvme: remove nvme_req_needs_failover nvme: simplify the API for getting log pages nvme.h: add ANA definitions nvme.h: add support for the log specific field Signed-off-by: Jens Axboe <axboe@kernel.dk>
| | | * nvme: add ANA supportChristoph Hellwig2018-07-273-27/+408
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for Asynchronous Namespace Access as specified in NVMe 1.3 TP 4004. With ANA each namespace attached to a controller belongs to an ANA group that describes the characteristics of accessing the namespaces through this controller. In the optimized and non-optimized states namespaces can be accessed regularly, although in a multi-pathing environment we should always prefer to access a namespace through a controller where an optimized relationship exists. Namespaces in Inaccessible, Permanent-Loss or Change state for a given controller should not be accessed. The states are updated through reading the ANA log page, which is read once during controller initialization, whenever the ANA change notice AEN is received, or when one of the ANA specific status codes that signal a state change is received on a command. The ANA state is kept in the nvme_ns structure, which makes the checks in the fast path very simple. Updating the ANA state when reading the log page is also very simple, the only downside is that finding the initial ANA state when scanning for namespaces is a bit cumbersome. The gendisk for a ns_head is only registered once a live path for it exists. Without that the kernel would hang during partition scanning. Includes fixes and improvements from Hannes Reinecke. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
| | | * nvme: remove nvme_req_needs_failoverChristoph Hellwig2018-07-273-14/+2Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we just call out to blk_path_error there isn't really any good reason to not merge it into the only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
| | | * nvme: simplify the API for getting log pagesChristoph Hellwig2018-07-273-25/+16Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge nvme_get_log and nvme_get_log_ext into a single helper, which takes a plain nsid instead of the nvme_ns pointer. Also add support for the log specific field while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
| | * | Merge tag 'v4.18-rc6' into for-4.19/block2Jens Axboe2018-08-062-34/+41
| | |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a merge conflict down the line. Signed-of-by: Jens Axboe <axboe@kernel.dk>
| | * | | nvme: use blk API to remap ref tags for IOs with metadataMax Gurtovoy2018-07-303-82/+20Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Also moved the logic of the remapping to the nvme core driver instead of implementing it in the nvme pci driver. This way all the other nvme transport drivers will benefit from it (in case they'll implement metadata support). Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| | * | | block: move ref_tag calculation func to the block layerMax Gurtovoy2018-07-301-2/+1Star
| | | |/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently this function is implemented in the scsi layer, but it's actual place should be the block layer since T10-PI is a general data integrity feature that is used in the nvme protocol as well. Suggested-by: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| | * | nvme-rdma: centralize admin/io queue teardown sequenceSagi Grimberg2018-07-241-37/+29Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We follow the same queue teardown sequence in delete, reset and error recovery. Centralize the logic. This patch does not change any functionality. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme-rdma: centralize controller setup sequenceSagi Grimberg2018-07-241-77/+53Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Centralize controller sequence to a single routine that correctly cleans up after failures instead of having multiple apperances in several flows (create, reset, reconnect). One thing that we also gain here are the sanity/boundary checks also when connecting back to a dynamic controller. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme-rdma: unquiesce queues when deleting the controllerSagi Grimberg2018-07-241-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the controller is going away, we need to unquiesce the IO queues so that all pending request can fail gracefully before moving forward with controller deletion. Do that before we destroy the IO queues so blk_cleanup_queue won't block in freeze. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme-rdma: mark expected switch fall-throughGustavo A. R. Silva2018-07-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme: add disk name to trace eventsKeith Busch2018-07-242-7/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This will print the disk name to the nvme event trace for io requests so a user can better distinguish traffic to different disks. This can be used to create disk based filters. For example, to see only nvme0n2 traffic: echo "disk == \"nvme0n2\"" > /sys/kernel/debug/tracing/events/nvme/filter Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> [hch: turned __assign_disk_name into an inline function] Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme: add controller name to trace eventsKeith Busch2018-07-241-6/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This appends the controller instance to the nvme trace buffer to distinguish which controller is dispatching and completing a command. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme: use hw qid in trace eventsKeith Busch2018-07-233-71/+53Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can not match a command to its completion based on the command id alone. We need the submitting queue identifier to pair with the completion, so this patch adds that to the trace buffer. This patch is also collapsing the admin and IO submission traces into a single one so we don't need to duplicate this and creating unnecessary code branches: we know if the command is an admin vs IO based on the qid. And since we're here, the patch fixes code formatting in the area. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> [hch: move the qid helper to nvme.h and made it an inline function] Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme: cache struct nvme_ctrl reference to struct nvme_requestSagi Grimberg2018-07-234-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We will need to reference the controller in the setup and completion time for tracing and future traffic based keep alive support. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme-rdma: support up to 4 segments of inline dataSteve Wise2018-07-231-11/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow up to 4 segments of inline data for NVMF WRITE operations. This reduces latency for small WRITEs by removing the need for the target to issue a READ WR for IB, or a REG_MR + READ WR chain for iWarp. Also cap the inline segments used based on the limitations of the device. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme: move init of keep_alive work item to controller initializationJames Smart2018-07-231-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the code initializes the keep alive work item whenever nvme_start_keep_alive() is called. However, this routine is called several times while reconnecting, etc. Although it's hoped that keep alive is always disabled and not scheduled when start is called, re-initing if it were scheduled or completing can have very bad side effects. There's no need for re-initialization. Move the keep_alive work item and cmd struct initialization to controller init. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | lightnvm: limit get chunk meta request sizeMatias Bjørling2018-07-131-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For devices that does not specify a limit on its transfer size, the get_chk_meta command may send down a single I/O retrieving the full chunk metadata table. Resulting in large 2-4MB I/O requests. Instead, split up the I/Os to a maximum of 256KB and issue them separately to reduce memory requirements. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| | * | lightnvm: Remove redundant rq->__data_len initializationBart Van Assche2018-07-131-4/+2Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since both blk_old_get_request() and blk_mq_alloc_request() initialize rq->__data_len to zero, it is not necessary to initialize that member in nvme_nvm_alloc_request(). Hence remove the rq->__data_len initialization from nvme_nvm_alloc_request(). Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | Merge tag 'v4.18' into rdma.git for-nextJason Gunthorpe2018-08-167-81/+131
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Resolve merge conflicts from the -rc cycle against the rdma.git tree: Conflicts: drivers/infiniband/core/uverbs_cmd.c - New ifs added to ib_uverbs_ex_create_flow in -rc and for-next - Merge removal of file->ucontext in for-next with new code in -rc drivers/infiniband/core/uverbs_main.c - for-next removed code from ib_uverbs_write() that was modified in for-rc Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
| * | | Merge tag 'for-linus-20180727' of git://git.kernel.dk/linux-blockLinus Torvalds2018-07-274-6/+11
| |\ \ \ | | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: "Bigger than usual at this time, mostly due to the O_DIRECT corruption issue and the fact that I was on vacation last week. This contains: - NVMe pull request with two fixes for the FC code, and two target fixes (Christoph) - a DIF bio reset iteration fix (Greg Edwards) - two nbd reply and requeue fixes (Josef) - SCSI timeout fixup (Keith) - a small series that fixes an issue with bio_iov_iter_get_pages(), which ended up causing corruption for larger sized O_DIRECT writes that ended up racing with buffered writes (Martin Wilck)" * tag 'for-linus-20180727' of git://git.kernel.dk/linux-block: block: reset bi_iter.bi_done after splitting bio block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs blkdev: __blkdev_direct_IO_simple: fix leak in error case block: bio_iov_iter_get_pages: fix size of last iovec nvmet: only check for filebacking on -ENOTBLK nvmet: fixup crash on NULL device path scsi: set timed out out mq requests to complete blk-mq: export setting request completion state nvme: if_ready checks to fail io to deleting controller nvmet-fc: fix target sgl list on large transfers nbd: handle unexpected replies better nbd: don't requeue the same request twice.
| | * | nvme: if_ready checks to fail io to deleting controllerJames Smart2018-07-244-6/+11
| | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The revised if_ready checks skipped over the case of returning error when the controller is being deleted. Instead it was returning BUSY, which caused the ios to retry, which caused the ns delete to hang waiting for the ios to drain. Stack trace of hang looks like: kworker/u64:2 D 0 74 2 0x80000000 Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core] Call Trace: ? __schedule+0x26d/0x820 schedule+0x32/0x80 blk_mq_freeze_queue_wait+0x36/0x80 ? remove_wait_queue+0x60/0x60 blk_cleanup_queue+0x72/0x160 nvme_ns_remove+0x106/0x140 [nvme_core] nvme_remove_namespaces+0x7e/0xa0 [nvme_core] nvme_delete_ctrl_work+0x4d/0x80 [nvme_core] process_one_work+0x160/0x350 worker_thread+0x1c3/0x3d0 kthread+0xf5/0x130 ? process_one_work+0x350/0x350 ? kthread_bind+0x10/0x10 ret_from_fork+0x1f/0x30 Extend nvmf_fail_nonready_command() to supply the controller pointer so that the controller state can be looked at. Fail any io to a controller that is deleting. Fixes: 3bc32bb1186c ("nvme-fabrics: refactor queue ready check") Fixes: 35897b920c8a ("nvme-fabrics: fix and refine state checks in __nvmf_check_ready") Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Ewan D. Milne <emilne@redhat.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com>
| * | nvme: fix handling of metadata_len for NVME_IOCTL_IO_CMDRoland Dreier2018-07-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The old code in nvme_user_cmd() passed the userspace virtual address from nvme_passthru_cmd.metadata as the length of the metadata buffer as well as the address to nvme_submit_user_cmd(). Fixes: 63263d60 ("nvme: Use metadata for passthrough commands") Signed-off-by: Roland Dreier <roland@purestorage.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | nvme: don't enable AEN if not supportedWeiping Zhang2018-07-171-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Avoid excuting set_feature command if there is no supported bit in Optional Asynchronous Events Supported (OAES). Fixes: c0561f82 ("nvme: submit AEN event configuration on startup") Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Weiping Zhang <zhangweiping@didichuxing.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | nvme: ensure forward progress during Admin passthruScott Bauer2018-07-171-24/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the controller supports effects and goes down during the passthru admin command we will deadlock during namespace revalidation. [ 363.488275] INFO: task kworker/u16:5:231 blocked for more than 120 seconds. [ 363.488290] Not tainted 4.17.0+ #2 [ 363.488296] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 363.488303] kworker/u16:5 D 0 231 2 0x80000000 [ 363.488331] Workqueue: nvme-reset-wq nvme_reset_work [nvme] [ 363.488338] Call Trace: [ 363.488385] schedule+0x75/0x190 [ 363.488396] rwsem_down_read_failed+0x1c3/0x2f0 [ 363.488481] call_rwsem_down_read_failed+0x14/0x30 [ 363.488504] down_read+0x1d/0x80 [ 363.488523] nvme_stop_queues+0x1e/0xa0 [nvme_core] [ 363.488536] nvme_dev_disable+0xae4/0x1620 [nvme] [ 363.488614] nvme_reset_work+0xd1e/0x49d9 [nvme] [ 363.488911] process_one_work+0x81a/0x1400 [ 363.488934] worker_thread+0x87/0xe80 [ 363.488955] kthread+0x2db/0x390 [ 363.488977] ret_from_fork+0x35/0x40 Fixes: 84fef62d135b6 ("nvme: check admin passthru command effects") Signed-off-by: Scott Bauer <scott.bauer@intel.com> Reviewed-by: Keith Busch <keith.busch@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | nvme-pci: fix memory leak on probe failureKeith Busch2018-07-121-5/+7
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The nvme driver specific structures need to be initialized prior to enabling the generic controller so we can unwind on failure with out using the reference counting callbacks so that 'probe' and 'remove' can be symmetric. The newly added iod_mempool is the only resource that was being allocated out of order, and a failure there would leak the generic controller memory. This patch just moves that allocation above the controller initialization. Fixes: 943e942e6266f ("nvme-pci: limit max IO size and segments to avoid high order allocations") Reported-by: Weiping Zhang <zwp10758@gmail.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme-rdma: fix possible double free of controller async event bufferSagi Grimberg2018-06-281-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | If reconnect/reset failed where the controller async event buffer was freed, we might end up freeing it again as we call nvme_rdma_destroy_admin_queue again in the remove path. Given that the sequence is guaranteed to serialize by .ctrl_stop, we simply set ctrl->async_event_sqe.data to NULL and don't free it in future visits. Reported-by: Max Gurtovoy <maxg@mellanox.com> Tested-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>