summaryrefslogtreecommitdiffstats
path: root/net/sunrpc/xprtrdma
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'akpm' (patches from Andrew)Linus Torvalds2019-07-191-2/+1Star
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge yet more updates from Andrew Morton: "The rest of MM and a kernel-wide procfs cleanup. Summary of the more significant patches: - Patch series "mm/memory_hotplug: Factor out memory block devicehandling", v3. David Hildenbrand. Some spring-cleaning of the memory hotplug code, notably in drivers/base/memory.c - "mm: thp: fix false negative of shmem vma's THP eligibility". Yang Shi. Fix /proc/pid/smaps output for THP pages used in shmem. - "resource: fix locking in find_next_iomem_res()" + 1. Nadav Amit. Bugfix and speedup for kernel/resource.c - Patch series "mm: Further memory block device cleanups", David Hildenbrand. More spring-cleaning of the memory hotplug code. - Patch series "mm: Sub-section memory hotplug support". Dan Williams. Generalise the memory hotplug code so that pmem can use it more completely. Then remove the hacks from the libnvdimm code which were there to work around the memory-hotplug code's constraints. - "proc/sysctl: add shared variables for range check", Matteo Croce. We have about 250 instances of int zero; ... .extra1 = &zero, in the tree. This is a tree-wide sweep to make all those private "zero"s and "one"s use global variables. Alas, it isn't practical to make those two global integers const" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (38 commits) proc/sysctl: add shared variables for range check mm: migrate: remove unused mode argument mm/sparsemem: cleanup 'section number' data types libnvdimm/pfn: stop padding pmem namespaces to section alignment libnvdimm/pfn: fix fsdax-mode namespace info-block zero-fields mm/devm_memremap_pages: enable sub-section remap mm: document ZONE_DEVICE memory-model implications mm/sparsemem: support sub-section hotplug mm/sparsemem: prepare for sub-section ranges mm: kill is_dev_zone() helper mm/hotplug: kill is_dev_zone() usage in __remove_pages() mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap() mm/hotplug: prepare shrink_{zone, pgdat}_span for sub-section removal mm/sparsemem: add helpers track active portions of a section at boot mm/sparsemem: introduce a SECTION_IS_EARLY flag mm/sparsemem: introduce struct mem_section_usage drivers/base/memory.c: get rid of find_memory_block_hinted() mm/memory_hotplug: move and simplify walk_memory_blocks() mm/memory_hotplug: rename walk_memory_range() and pass start+size instead of pfns mm: make register_mem_sect_under_node() static ...
| * proc/sysctl: add shared variables for range checkMatteo Croce2019-07-191-2/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the sysctl code the proc_dointvec_minmax() function is often used to validate the user supplied value between an allowed range. This function uses the extra1 and extra2 members from struct ctl_table as minimum and maximum allowed value. On sysctl handler declaration, in every source file there are some readonly variables containing just an integer which address is assigned to the extra1 and extra2 members, so the sysctl range is enforced. The special values 0, 1 and INT_MAX are very often used as range boundary, leading duplication of variables like zero=0, one=1, int_max=INT_MAX in different source files: $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l 248 Add a const int array containing the most commonly used values, some macros to refer more easily to the correct array member, and use them instead of creating a local one for every object file. This is the bloat-o-meter output comparing the old and new binary compiled with the default Fedora config: # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164) Data old new delta sysctl_vals - 12 +12 __kstrtab_sysctl_vals - 12 +12 max 14 10 -4 int_max 16 - -16 one 68 - -68 zero 128 28 -100 Total: Before=20583249, After=20583085, chg -0.00% [mcroce@redhat.com: tipc: remove two unused variables] Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com [akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c] [arnd@arndb.de: proc/sysctl: make firmware loader table conditional] Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de [akpm@linux-foundation.org: fix fs/eventpoll.c] Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com Signed-off-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Kees Cook <keescook@chromium.org> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge tag 'nfs-for-5.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2019-07-188-319/+423
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client updates from Trond Myklebust: "Highlights include: Stable fixes: - SUNRPC: Ensure bvecs are re-synced when we re-encode the RPC request - Fix an Oops in ff_layout_track_ds_error due to a PTR_ERR() dereference - Revert buggy NFS readdirplus optimisation - NFSv4: Handle the special Linux file open access mode - pnfs: Fix a problem where we gratuitously start doing I/O through the MDS Features: - Allow NFS client to set up multiple TCP connections to the server using a new 'nconnect=X' mount option. Queue length is used to balance load. - Enhance statistics reporting to report on all transports when using multiple connections. - Speed up SUNRPC by removing bh-safe spinlocks - Add a mechanism to allow NFSv4 to request that containers set a unique per-host identifier for when the hostname is not set. - Ensure NFSv4 updates the lease_time after a clientid update Bugfixes and cleanup: - Fix use-after-free in rpcrdma_post_recvs - Fix a memory leak when nfs_match_client() is interrupted - Fix buggy file access checking in NFSv4 open for execute - disable unsupported client side deduplication - Fix spurious client disconnections - Fix occasional RDMA transport deadlock - Various RDMA cleanups - Various tracepoint fixes - Fix the TCP callback channel to guarantee the server can actually send the number of callback requests that was negotiated at mount time" * tag 'nfs-for-5.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (68 commits) pnfs/flexfiles: Add tracepoints for detecting pnfs fallback to MDS pnfs: Fix a problem where we gratuitously start doing I/O through the MDS SUNRPC: Optimise transport balancing code SUNRPC: Ensure the bvecs are reset when we re-encode the RPC request pnfs/flexfiles: Fix PTR_ERR() dereferences in ff_layout_track_ds_error NFSv4: Don't use the zero stateid with layoutget SUNRPC: Fix up backchannel slot table accounting SUNRPC: Fix initialisation of struct rpc_xprt_switch SUNRPC: Skip zero-refcount transports SUNRPC: Replace division by multiplication in calculation of queue length NFSv4: Validate the stateid before applying it to state recovery nfs4.0: Refetch lease_time after clientid update nfs4: Rename nfs41_setup_state_renewal nfs4: Make nfs4_proc_get_lease_time available for nfs4.0 nfs: Fix copy-and-paste error in debug message NFS: Replace 16 seq_printf() calls by seq_puts() NFS: Use seq_putc() in nfs_show_stats() Revert "NFS: readdirplus optimization by cache mechanism" (memleak) SUNRPC: Fix transport accounting when caller specifies an rpc_xprt NFS: Record task, client ID, and XID in xdr_status trace points ...
| * SUNRPC: Fix up backchannel slot table accountingTrond Myklebust2019-07-183-0/+9
| | | | | | | | | | | | | | Add a per-transport maximum limit in the socket case, and add helpers to allow the NFSv4 code to discover that limit. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| * Merge tag 'nfs-rdma-for-5.3-1' of ↵Trond Myklebust2019-07-125-311/+406
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.linux-nfs.org/projects/anna/linux-nfs NFSoRDMA client updates for 5.3 New features: - Add a way to place MRs back on the free list - Reduce context switching - Add new trace events Bugfixes and cleanups: - Fix a BUG when tracing is enabled with NFSv4.1 - Fix a use-after-free in rpcrdma_post_recvs - Replace use of xdr_stream_pos in rpcrdma_marshal_req - Fix occasional transport deadlock - Fix show_nfs_errors macros, other tracing improvements - Remove RPCRDMA_REQ_F_PENDING and fr_state - Various simplifications and refactors
| | * xprtrdma: Modernize ops->connectChuck Lever2019-07-092-15/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adapt and apply changes that were made to the TCP socket connect code. See the following commits for details on the purpose of these changes: Commit 7196dbb02ea0 ("SUNRPC: Allow changing of the TCP timeout parameters on the fly") Commit 3851f1cdb2b8 ("SUNRPC: Limit the reconnect backoff timer to the max RPC message timeout") Commit 02910177aede ("SUNRPC: Fix reconnection timeouts") Some common transport code is moved to xprt.c to satisfy the code duplication police. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * xprtrdma: Remove rpcrdma_req::rl_bufferChuck Lever2019-07-093-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up. There is only one remaining function, rpcrdma_buffer_put(), that uses this field. Its caller can supply a pointer to the correct rpcrdma_buffer, enabling the removal of an 8-byte pointer field from a frequently-allocated shared data structure. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * xprtrdma: Refactor chunk encodingChuck Lever2019-07-091-20/+16Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up. Move the "not present" case into the individual chunk encoders. This improves code organization and readability. The reason for the original organization was to optimize for the case where there there are no chunks. The optimization turned out to be inconsequential, so let's err on the side of code readability. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * xprtrdma: Streamline rpcrdma_post_recvsChuck Lever2019-07-091-21/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rb_lock is contended between rpcrdma_buffer_create, rpcrdma_buffer_put, and rpcrdma_post_recvs. Commit e340c2d6ef2a ("xprtrdma: Reduce the doorbell rate (Receive)") causes rpcrdma_post_recvs to take the rb_lock repeatedly when it determines more Receives are needed. Streamline this code path so it takes the lock just once in most cases to build the Receive chain that is about to be posted. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * xprtrdma: Simplify rpcrdma_rep_createChuck Lever2019-07-091-14/+8Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up. Commit 7c8d9e7c8863 ("xprtrdma: Move Receive posting to Receive handler") reduced the number of rpcrdma_rep_create call sites to one. After that commit, the backchannel code no longer invokes it. Therefore the free list logic added by commit d698c4a02ee0 ("xprtrdma: Fix backchannel allocation of extra rpcrdma_reps") is no longer necessary, and in fact adds some extra overhead that we can do without. Simply post any newly created reps. They will get added back to the rb_recv_bufs list when they subsequently complete. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * xprtrdma: Wake RPCs directly in rpcrdma_wc_send pathChuck Lever2019-07-094-50/+36Star
| | | | | | | | | | | | | | | | | | | | | | | | Eliminate a context switch in the path that handles RPC wake-ups when a Receive completion has to wait for a Send completion. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * xprtrdma: Reduce context switching due to Local InvalidationChuck Lever2019-07-094-53/+136
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit ba69cd122ece ("xprtrdma: Remove support for FMR memory registration"), FRWR is the only supported memory registration mode. We can take advantage of the asynchronous nature of FRWR's LOCAL_INV Work Requests to get rid of the completion wait by having the LOCAL_INV completion handler take care of DMA unmapping MRs and waking the upper layer RPC waiter. This eliminates two context switches when local invalidation is necessary. As a side benefit, we will no longer need the per-xprt deferred completion work queue. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * xprtrdma: Add mechanism to place MRs back on the free listChuck Lever2019-07-093-0/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a marshal operation fails, any MRs that were already set up for that request are recycled. Recycling releases MRs and creates new ones, which is expensive. Since commit f2877623082b ("xprtrdma: Chain Send to FastReg WRs") was merged, recycling FRWRs is unnecessary. This is because before that commit, frwr_map had already posted FAST_REG Work Requests, so ownership of the MRs had already been passed to the NIC and thus dealing with them had to be delayed until they completed. Since that commit, however, FAST_REG WRs are posted at the same time as the Send WR. This means that if marshaling fails, we are certain the MRs are safe to simply unmap and place back on the free list because neither the Send nor the FAST_REG WRs have been posted yet. The kernel still has ownership of the MRs at this point. This reduces the total number of MRs that the xprt has to create under heavy workloads and makes the marshaling logic less brittle. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * xprtrdma: Remove fr_stateChuck Lever2019-07-093-123/+94Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that both the Send and Receive completions are handled in process context, it is safe to DMA unmap and return MRs to the free or recycle lists directly in the completion handlers. Doing this means rpcrdma_frwr no longer needs to track the state of each MR, meaning that a VALID or FLUSHED MR can no longer appear on an xprt's MR free list. Thus there is no longer a need to track the MR's registration state in rpcrdma_frwr. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * xprtrdma: Remove the RPCRDMA_REQ_F_PENDING flagChuck Lever2019-07-093-5/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 9590d083c1bb ("xprtrdma: Use xprt_pin_rqst in rpcrdma_reply_handler") pins incoming RPC/RDMA replies so they can be left in the pending requests queue while they are being processed without introducing a race between ->buf_free and the transport's reply handler. Therefore RPCRDMA_REQ_F_PENDING is no longer necessary. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * xprtrdma: Fix occasional transport deadlockChuck Lever2019-07-094-29/+20Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Under high I/O workloads, I've noticed that an RPC/RDMA transport occasionally deadlocks (IOPS goes to zero, and doesn't recover). Diagnosis shows that the sendctx queue is empty, but when sendctxs are returned to the queue, the xprt_write_space wake-up never occurs. The wake-up logic in rpcrdma_sendctx_put_locked is racy. I noticed that both EMPTY_SCQ and XPRT_WRITE_SPACE are implemented via an atomic bit. Just one of those is sufficient. Removing EMPTY_SCQ in favor of the generic bit mechanism makes the deadlock un-reproducible. Without EMPTY_SCQ, rpcrdma_buffer::rb_flags is no longer used and is therefore removed. Unfortunately this patch does not apply cleanly to stable. If needed, someone will have to port it and test it. Fixes: 2fad659209d5 ("xprtrdma: Wait on empty sendctx queue") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * xprtrdma: Replace use of xdr_stream_pos in rpcrdma_marshal_reqChuck Lever2019-07-091-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a latent bug. xdr_stream_pos works by subtracting xdr_stream::nwords from xdr_buf::len. But xdr_stream::nwords is not initialized by xdr_init_encode(). It works today only because all fields in rpcrdma_req::rl_stream are initialized to zero by rpcrdma_req_create, making the subtraction in xdr_stream_pos always a no-op. I found this issue via code inspection. It was introduced by commit 39f4cd9e9982 ("xprtrdma: Harden chunk list encoding against send buffer overflow"), but the code has changed enough since then that this fix can't be automatically applied to stable. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * xprtrdma: Fix use-after-free in rpcrdma_post_recvsChuck Lever2019-07-021-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dereference wr->next /before/ the memory backing wr has been released. This issue was found by code inspection. It is not expected to be a significant problem because it is in an error path that is almost never executed. Fixes: 7c8d9e7c8863 ("xprtrdma: Move Receive posting to ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * | SUNRPC: Remove the bh-safe lock requirement on xprt->transport_lockTrond Myklebust2019-07-063-8/+8
| | | | | | | | | | | | Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* | | Merge tag 'scsi-sg' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds2019-07-121-2/+3
|\ \ \ | |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull SCSI scatter-gather list updates from James Bottomley: "This topic branch covers a fundamental change in how our sg lists are allocated to make mq more efficient by reducing the size of the preallocated sg list. This necessitates a large number of driver changes because the previous guarantee that if a driver specified SG_ALL as the size of its scatter list, it would get a non-chained list and didn't need to bother with scatterlist iterators is now broken and every driver *must* use scatterlist iterators. This was broken out as a separate topic because we need to convert all the drivers before pulling the trigger and unconverted drivers kept being found, necessitating a rebase" * tag 'scsi-sg' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (21 commits) scsi: core: don't preallocate small SGL in case of NO_SG_CHAIN scsi: lib/sg_pool.c: clear 'first_chunk' in case of no preallocation scsi: core: avoid preallocating big SGL for data scsi: core: avoid preallocating big SGL for protection information scsi: lib/sg_pool.c: improve APIs for allocating sg pool scsi: esp: use sg helper to iterate over scatterlist scsi: NCR5380: use sg helper to iterate over scatterlist scsi: wd33c93: use sg helper to iterate over scatterlist scsi: ppa: use sg helper to iterate over scatterlist scsi: pcmcia: nsp_cs: use sg helper to iterate over scatterlist scsi: imm: use sg helper to iterate over scatterlist scsi: aha152x: use sg helper to iterate over scatterlist scsi: s390: zfcp_fc: use sg helper to iterate over scatterlist scsi: staging: unisys: visorhba: use sg helper to iterate over scatterlist scsi: usb: image: microtek: use sg helper to iterate over scatterlist scsi: pmcraid: use sg helper to iterate over scatterlist scsi: ipr: use sg helper to iterate over scatterlist scsi: mvumi: use sg helper to iterate over scatterlist scsi: lpfc: use sg helper to iterate over scatterlist scsi: advansys: use sg helper to iterate over scatterlist ...
| * | scsi: lib/sg_pool.c: improve APIs for allocating sg poolMing Lei2019-06-201-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sg_alloc_table_chained() currently allows the caller to provide one preallocated SGL and returns if the requested number isn't bigger than size of that SGL. This is used to inline an SGL for an IO request. However, scattergather code only allows that size of the 1st preallocated SGL to be SG_CHUNK_SIZE(128). This means a substantial amount of memory (4KB) is claimed for the SGL for each IO request. If the I/O is small, it would be prudent to allocate a smaller SGL. Introduce an extra parameter to sg_alloc_table_chained() and sg_free_table_chained() for specifying size of the preallocated SGL. Both __sg_free_table() and __sg_alloc_table() assume that each SGL has the same size except for the last one. Change the code to allow both functions to accept a variable size for the 1st preallocated SGL. [mkp: attempted to clarify commit desc] Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Ewan D. Milne <emilne@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: netdev@vger.kernel.org Cc: linux-nvme@lists.infradead.org Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
* | | Merge tag 'nfsd-5.2-2' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2019-07-061-1/+6
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | Pull nfsd fixes from Bruce Fields: "Two more quick bugfixes for nfsd: fixing a regression causing mount failures on high-memory machines and fixing the DRC over RDMA" * tag 'nfsd-5.2-2' of git://linux-nfs.org/~bfields/linux: nfsd: Fix overflow causing non-working mounts on 1 TB machines svcrdma: Ignore source port when computing DRC hash
| * | svcrdma: Ignore source port when computing DRC hashChuck Lever2019-06-191-1/+6
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The DRC appears to be effectively empty after an RPC/RDMA transport reconnect. The problem is that each connection uses a different source port, which defeats the DRC hash. Clients always have to disconnect before they send retransmissions to reset the connection's credit accounting, thus every retransmit on NFS/RDMA will miss the DRC. An NFS/RDMA client's IP source port is meaningless for RDMA transports. The transport layer typically sets the source port value on the connection to a random ephemeral port. The server already ignores it for the "secure port" check. See commit 16e4d93f6de7 ("NFSD: Ignore client's source port on RDMA transports"). The Linux NFS server's DRC resolves XID collisions from the same source IP address by using the checksum of the first 200 bytes of the RPC call header. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* / xprtrdma: Use struct_size() in kzalloc()Gustavo A. R. Silva2019-05-281-2/+1Star
|/ | | | | | | | | | | | | | | | | | | | | | | | One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; instance = kzalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Remove stale commentChuck Lever2019-04-251-7/+0Star
| | | | | | | The comment hasn't been accurate for several years. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Update comments that reference ib_drain_qpChuck Lever2019-04-251-4/+6
| | | | | | | | | Commit e1ede312f17e ("xprtrdma: Fix helper that drains the transport") replaced the ib_drain_qp() call, so update documenting comments to reflect current operation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Remove pr_err() call sites from completion handlersChuck Lever2019-04-252-28/+4Star
| | | | | | | Clean up: rely on the trace points instead. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Eliminate struct rpcrdma_create_data_internalChuck Lever2019-04-254-71/+46Star
| | | | | | | | | | Clean up. Move the remaining field in rpcrdma_create_data_internal so the structure can be removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Aggregate the inline settings in struct rpcrdma_epChuck Lever2019-04-255-36/+41
| | | | | | | | | | | | | | | | Clean up. The inline settings are actually a characteristic of the endpoint, and not related to the device. They are also modified after the transport instance is created, so they do not belong in the cdata structure either. Lastly, let's use names that are more natural to RDMA than to NFS: inline_write -> inline_send and inline_read -> inline_recv. The /proc files retain their names to avoid breaking user space. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Remove rpcrdma_create_data_internal::rsize and wsizeChuck Lever2019-04-252-11/+0Star
| | | | | | | | | | | | | | | Clean up. xprt_rdma_max_inline_{read,write} cannot be set to large values by virtue of proc_dointvec_minmax. The current maximum is RPCRDMA_MAX_INLINE, which is much smaller than RPCRDMA_MAX_SEGS * PAGE_SIZE. The .rsize and .wsize fields are otherwise unused in the current code base, and thus can be removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Eliminate rpcrdma_ia::ri_deviceChuck Lever2019-04-253-28/+25Star
| | | | | | | | | | | Clean up. Since commit 54cbd6b0c6b9 ("xprtrdma: Delay DMA mapping Send and Receive buffers"), a pointer to the device is now saved in each regbuf when it is DMA mapped. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: More Send completion batchingChuck Lever2019-04-252-13/+1Star
| | | | | | | | | | | | | | Instead of using a fixed number, allow the amount of Send completion batching to vary based on the client's maximum credit limit. - A larger default gives a small boost to IOPS throughput - Reducing it based on max_requests gives a safe result when the max credit limit is cranked down (eg. when the device has a small max_qp_wr). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Clean up sendctx functionsChuck Lever2019-04-253-26/+23Star
| | | | | | | | | | | | Minor clean-ups I've stumbled on since sendctx was merged last year. In particular, making Send completion processing more efficient appears to have a measurable impact on IOPS throughput. Note: test_and_clear_bit() returns a value, thus an explicit memory barrier is not necessary. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Trace marshaling failuresChuck Lever2019-04-251-0/+1
| | | | | | | | Record an event when rpcrdma_marshal_req returns a non-zero return value to help track down why an xprt close might have occurred. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Increase maximum number of backchannel requestsChuck Lever2019-04-251-3/+5
| | | | | | | | Reflects the change introduced in commit 067c46967160 ("NFSv4.1: Bump the default callback session slot count to 16"). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Backchannel can use GFP_KERNEL allocationsChuck Lever2019-04-251-64/+40Star
| | | | | | | | | | | The Receive handler runs in process context, thus can use on-demand GFP_KERNEL allocations instead of pre-allocation. This makes the xprtrdma backchannel independent of the number of backchannel session slots provisioned by the Upper Layer protocol. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Clean up regbuf helpersChuck Lever2019-04-253-82/+80Star
| | | | | | | | | | | For code legibility, clean up the function names to be consistent with the pattern: "rpcrdma" _ object-type _ action Also rpcrdma_regbuf_alloc and rpcrdma_regbuf_free no longer have any callers outside of verbs.c, and can thus be made static. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: De-duplicate "allocate new, free old regbuf"Chuck Lever2019-04-253-47/+39Star
| | | | | | | | | | | Clean up by providing an API to do this common task. At this point, the difference between rpcrdma_get_sendbuf and rpcrdma_get_recvbuf has become tiny. These can be collapsed into a single helper. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Allocate req's regbufs at xprt create timeChuck Lever2019-04-254-25/+33
| | | | | | | | | | | | | | | | | Allocating an rpcrdma_req's regbufs at xprt create time enables a pair of micro-optimizations: First, if these regbufs are always there, we can eliminate two conditional branches from the hot xprt_rdma_allocate path. Second, by allocating a 1KB buffer, it places a lower bound on the size of these buffers, without adding yet another conditional branch. The lower bound reduces the number of hardway re- allocations. In fact, for some workloads it completely eliminates hardway allocations. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: rpcrdma_regbuf alignmentChuck Lever2019-04-255-29/+35
| | | | | | | | | | Allocate the struct rpcrdma_regbuf separately from the I/O buffer to better guarantee the alignment of the I/O buffer and eliminate the wasted space between the rpcrdma_regbuf metadata and the buffer itself. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Clean up rpcrdma_create_rep() and rpcrdma_destroy_rep()Chuck Lever2019-04-251-16/+9Star
| | | | | | | | For code legibility, clean up the function names to be consistent with the pattern: "rpcrdma" _ object-type _ action Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Clean up rpcrdma_create_req()Chuck Lever2019-04-253-17/+21
| | | | | | | | | | | | | | | Eventually, I'd like to invoke rpcrdma_create_req() during the call_reserve step. Memory allocation there probably needs to use GFP_NOIO. Therefore a set of GFP flags needs to be passed in. As an additional clean up, just return a pointer or NULL, because the only error return code here is -ENOMEM. Lastly, clean up the function names to be consistent with the pattern: "rpcrdma" _ object-type _ action Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Fix an frwr_map recovery nitChuck Lever2019-04-251-1/+1
| | | | | | | | | After a DMA map failure in frwr_map, mark the MR so that recycling won't attempt to DMA unmap it. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Fixes: e2f34e26710b ("xprtrdma: Yet another double DMA-unmap") Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* SUNRPC: Avoid digging into the ATOMIC poolChuck Lever2019-04-251-1/+1
| | | | | | | | | Page allocation requests made when the SPARSE_PAGES flag is set are allowed to fail, and are not critical. No need to spend a rare resource. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* SUNRPC: Refactor xprt_request_wait_receive()Trond Myklebust2019-04-252-2/+2
| | | | | | | | | Convert the transport callback to actually put the request to sleep instead of just setting a timeout. This is in preparation for rpc_sleep_on_timeout(). Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Fix helper that drains the transportChuck Lever2019-04-111-1/+1
| | | | | | | | | | We want to drain only the RQ first. Otherwise the transport can deadlock on ->close if there are outstanding Send completions. Fixes: 6d2d0ee27c7a ("xprtrdma: Replace rpcrdma_receive_wq ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org # v5.0+ Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* Merge tag 'nfsd-5.1' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2019-03-124-32/+10Star
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS server updates from Bruce Fields: "Miscellaneous NFS server fixes. Probably the most visible bug is one that could artificially limit NFSv4.1 performance by limiting the number of oustanding rpcs from a single client. Neil Brown also gets a special mention for fixing a 14.5-year-old memory-corruption bug in the encoding of NFSv3 readdir responses" * tag 'nfsd-5.1' of git://linux-nfs.org/~bfields/linux: nfsd: allow nfsv3 readdir request to be larger. nfsd: fix wrong check in write_v4_end_grace() nfsd: fix memory corruption caused by readdir nfsd: fix performance-limiting session calculation svcrpc: fix UDP on servers with lots of threads svcrdma: Remove syslog warnings in work completion handlers svcrdma: Squelch compiler warning when SUNRPC_DEBUG is disabled svcrdma: Use struct_size() in kmalloc() svcrpc: fix unlikely races preventing queueing of sockets svcrpc: svc_xprt_has_something_to_do seems a little long SUNRPC: Don't allow compiler optimisation of svc_xprt_release_slot() nfsd: fix an IS_ERR() vs NULL check
| * svcrdma: Remove syslog warnings in work completion handlersChuck Lever2019-02-064-27/+2Star
| | | | | | | | | | | | | | | | | | These can result in a lot of log noise, and are able to be triggered by client misbehavior. Since there are trace points in these handlers now, there's no need to spam the log. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: Squelch compiler warning when SUNRPC_DEBUG is disabledChuck Lever2019-02-061-1/+3
| | | | | | | | | | | | | | | | | | | | | | CC [M] net/sunrpc/xprtrdma/svc_rdma_transport.o linux/net/sunrpc/xprtrdma/svc_rdma_transport.c: In function ‘svc_rdma_accept’: linux/net/sunrpc/xprtrdma/svc_rdma_transport.c:452:19: warning: variable ‘sap’ set but not used [-Wunused-but-set-variable] struct sockaddr *sap; ^ Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: Use struct_size() in kmalloc()Gustavo A. R. Silva2019-02-061-2/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; instance = kmalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>