summaryrefslogtreecommitdiffstats
path: root/net/sunrpc/xprtrdma/svc_rdma_transport.c
Commit message (Collapse)AuthorAgeFilesLines
* SUNRPC: Remove the bh-safe lock requirement on xprt->transport_lockTrond Myklebust2019-07-061-4/+4
| | | | Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* svcrdma: Ignore source port when computing DRC hashChuck Lever2019-06-191-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | The DRC appears to be effectively empty after an RPC/RDMA transport reconnect. The problem is that each connection uses a different source port, which defeats the DRC hash. Clients always have to disconnect before they send retransmissions to reset the connection's credit accounting, thus every retransmit on NFS/RDMA will miss the DRC. An NFS/RDMA client's IP source port is meaningless for RDMA transports. The transport layer typically sets the source port value on the connection to a random ephemeral port. The server already ignores it for the "secure port" check. See commit 16e4d93f6de7 ("NFSD: Ignore client's source port on RDMA transports"). The Linux NFS server's DRC resolves XID collisions from the same source IP address by using the checksum of the first 200 bytes of the RPC call header. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Remove syslog warnings in work completion handlersChuck Lever2019-02-061-5/+0Star
| | | | | | | | | These can result in a lot of log noise, and are able to be triggered by client misbehavior. Since there are trace points in these handlers now, there's no need to spam the log. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Squelch compiler warning when SUNRPC_DEBUG is disabledChuck Lever2019-02-061-1/+3
| | | | | | | | | | | CC [M] net/sunrpc/xprtrdma/svc_rdma_transport.o linux/net/sunrpc/xprtrdma/svc_rdma_transport.c: In function ‘svc_rdma_accept’: linux/net/sunrpc/xprtrdma/svc_rdma_transport.c:452:19: warning: variable ‘sap’ set but not used [-Wunused-but-set-variable] struct sockaddr *sap; ^ Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Remove max_sge check at connect timeChuck Lever2019-02-061-6/+3Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Two and a half years ago, the client was changed to use gathered Send for larger inline messages, in commit 655fec6987b ("xprtrdma: Use gathered Send for large inline messages"). Several fixes were required because there are a few in-kernel device drivers whose max_sge is 3, and these were broken by the change. Apparently my memory is going, because some time later, I submitted commit 25fd86eca11c ("svcrdma: Don't overrun the SGE array in svc_rdma_send_ctxt"), and after that, commit f3c1fd0ee294 ("svcrdma: Reduce max_send_sges"). These too incorrectly assumed in-kernel device drivers would have more than a few Send SGEs available. The fix for the server side is not the same. This is because the fundamental problem on the server is that, whether or not the client has provisioned a chunk for the RPC reply, the server must squeeze even the most complex RPC replies into a single RDMA Send. Failing in the send path because of Send SGE exhaustion should never be an option. Therefore, instead of failing when the send path runs out of SGEs, switch to using a bounce buffer mechanism to handle RPC replies that are too complex for the device to send directly. That allows us to remove the max_sge check to enable drivers with small max_sge to work again. Reported-by: Don Dutile <ddutile@redhat.com> Fixes: 25fd86eca11c ("svcrdma: Don't overrun the SGE array in ...") Cc: stable@vger.kernel.org Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* sunrpc: remove unused xpo_prep_reply_hdr callbackVasily Averin2018-12-281-1/+0Star
| | | | | | | | | | | xpo_prep_reply_hdr are not used now. It was defined for tcp transport only, however it cannot be called indirectly, so let's move it to its caller and remove unused callback. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* sunrpc: remove svc_rdma_bc_classVasily Averin2018-12-281-57/+0Star
| | | | | | | Remove svc_xprt_class svc_rdma_bc_class and related functions. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* sunrpc: replace svc_serv->sv_bc_xprt by boolean flagVasily Averin2018-12-281-1/+0Star
| | | | | | | | svc_serv-> sv_bc_xprt is netns-unsafe and cannot be used as pointer. To prevent its misuse in future it is replaced by new boolean flag. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Reduce max_send_sgesChuck Lever2018-10-291-4/+6
| | | | | | | | | There's no need to request a large number of send SGEs because the inline threshold already constrains the number of SGEs per Send. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* Merge tag 'nfsd-4.19-1' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2018-08-241-2/+1Star
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull nfsd updates from Bruce Fields: "Chuck Lever fixed a problem with NFSv4.0 callbacks over GSS from multi-homed servers. The only new feature is a minor bit of protocol (change_attr_type) which the client doesn't even use yet. Other than that, various bugfixes and cleanup" * tag 'nfsd-4.19-1' of git://linux-nfs.org/~bfields/linux: (27 commits) sunrpc: Add comment defining gssd upcall API keywords nfsd: Remove callback_cred nfsd: Use correct credential for NFSv4.0 callback with GSS sunrpc: Extract target name into svc_cred sunrpc: Enable the kernel to specify the hostname part of service principals sunrpc: Don't use stack buffer with scatterlist rpc: remove unneeded variable 'ret' in rdma_listen_handler nfsd: use true and false for boolean values nfsd: constify write_op[] fs/nfsd: Delete invalid assignment statements in nfsd4_decode_exchange_id NFSD: Handle full-length symlinks NFSD: Refactor the generic write vector fill helper svcrdma: Clean up Read chunk path svcrdma: Avoid releasing a page in svc_xprt_release() nfsd: Mark expected switch fall-through sunrpc: remove redundant variables 'checksumlen','blocksize' and 'data' nfsd: fix leaked file lock with nfs exported overlayfs nfsd: don't advertise a SCSI layout for an unsupported request_queue nfsd: fix corrupted reply to badly ordered compound nfsd: clarify check_op_ordering ...
| * rpc: remove unneeded variable 'ret' in rdma_listen_handlerzhong jiang2018-08-091-2/+1Star
| | | | | | | | | | | | | | | | The ret is not modified after initalization, So just remove the variable and return 0. Signed-off-by: zhong jiang <zhongjiang@huawei.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* | IB/core: add max_send_sge and max_recv_sge attributesSteve Wise2018-06-181-1/+1
|/ | | | | | | | | | | | | | | | | | | This patch replaces the ib_device_attr.max_sge with max_send_sge and max_recv_sge. It allows ulps to take advantage of devices that have very different send and recv sge depths. For example cxgb4 has a max_recv_sge of 4, yet a max_send_sge of 16. Splitting out these attributes allows much more efficient use of the SQ for cxgb4 with ulps that use the RDMA_RW API. Consider a large RDMA WRITE that has 16 scattergather entries. With max_sge of 4, the ulp would send 4 WRITE WRs, but with max_sge of 16, it can be done with 1 WRITE WR. Acked-by: Sagi Grimberg <sagi@grimberg.me> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Selvin Xavier <selvin.xavier@broadcom.com> Acked-by: Shiraz Saleem <shiraz.saleem@intel.com> Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* svcrdma: Don't overrun the SGE array in svc_rdma_send_ctxtChuck Lever2018-05-111-4/+9
| | | | | | | | | | | | | | | | | | | | | Receive buffers are always the same size, but each Send WR has a variable number of SGEs, based on the contents of the xdr_buf being sent. While assembling a Send WR, keep track of the number of SGEs so that we don't exceed the device's maximum, or walk off the end of the Send SGE array. For now the Send path just fails if it exceeds the maximum. The current logic in svc_rdma_accept bases the maximum number of Send SGEs on the largest NFS request that can be sent or received. In the transport layer, the limit is actually based on the capabilities of the underlying device, not on properties of the Upper Layer Protocol. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Introduce svc_rdma_send_ctxtChuck Lever2018-05-111-201/+4Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | svc_rdma_op_ctxt's are pre-allocated and maintained on a per-xprt free list. This eliminates the overhead of calling kmalloc / kfree, both of which grab a globally shared lock that disables interrupts. Introduce a replacement to svc_rdma_op_ctxt's that is built especially for the svcrdma Send path. Subsequent patches will take advantage of this new structure by allocating real resources which are then cached in these objects. The allocations are freed when the transport is torn down. I've renamed the structure so that static type checking can be used to ensure that uses of op_ctxt and send_ctxt are not confused. As an additional clean up, structure fields are renamed to conform with kernel coding conventions. Additional clean ups: - Handle svc_rdma_send_ctxt_get allocation failure at each call site, rather than pre-allocating and hoping we guessed correctly - All send_ctxt_put call-sites request page freeing, so remove the @free_pages argument - All send_ctxt_put call-sites unmap SGEs, so fold that into svc_rdma_send_ctxt_put Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Persistently allocate and DMA-map Receive buffersChuck Lever2018-05-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current Receive path uses an array of pages which are allocated and DMA mapped when each Receive WR is posted, and then handed off to the upper layer in rqstp::rq_arg. The page flip releases unused pages in the rq_pages pagelist. This mechanism introduces a significant amount of overhead. So instead, kmalloc the Receive buffer, and leave it DMA-mapped while the transport remains connected. This confers a number of benefits: * Each Receive WR requires only one receive SGE, no matter how large the inline threshold is. This helps the server-side NFS/RDMA transport operate on less capable RDMA devices. * The Receive buffer is left allocated and mapped all the time. This relieves svc_rdma_post_recv from the overhead of allocating and DMA-mapping a fresh buffer. * svc_rdma_wc_receive no longer has to DMA unmap the Receive buffer. It has to DMA sync only the number of bytes that were received. * svc_rdma_build_arg_xdr no longer has to free a page in rq_pages for each page in the Receive buffer, making it a constant-time function. * The Receive buffer is now plugged directly into the rq_arg's head[0].iov_vec, and can be larger than a page without spilling over into rq_arg's page list. This enables simplification of the RDMA Read path in subsequent patches. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Remove sc_rq_depthChuck Lever2018-05-111-9/+8Star
| | | | | | | | Clean up: No need to retain rq_depth in struct svcrdma_xprt, it is used only in svc_rdma_accept(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Introduce svc_rdma_recv_ctxtChuck Lever2018-05-111-134/+8Star
| | | | | | | | | | | | | | | | | | | | | | | svc_rdma_op_ctxt's are pre-allocated and maintained on a per-xprt free list. This eliminates the overhead of calling kmalloc / kfree, both of which grab a globally shared lock that disables interrupts. To reduce contention further, separate the use of these objects in the Receive and Send paths in svcrdma. Subsequent patches will take advantage of this separation by allocating real resources which are then cached in these objects. The allocations are freed when the transport is torn down. I've renamed the structure so that static type checking can be used to ensure that uses of op_ctxt and recv_ctxt are not confused. As an additional clean up, structure fields are renamed to conform with kernel coding conventions. As a final clean up, helpers related to recv_ctxt are moved closer to the functions that use them. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Trace key RDMA API eventsChuck Lever2018-05-111-41/+26Star
| | | | | | | | | | | This includes: * Posting on the Send and Receive queues * Send, Receive, Read, and Write completion * Connect upcalls * QP errors Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Trace key RPC/RDMA protocol eventsChuck Lever2018-05-111-7/+12
| | | | | | | | | | | | | | This includes: * Transport accept and tear-down * Decisions about using Write and Reply chunks * Each RDMA segment that is handled * Whenever an RDMA_ERR is sent As a clean-up, I've standardized the order of the includes, and removed some now redundant dprintk call sites. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Use passed-in net namespace when creating RDMA listenerChuck Lever2018-05-111-18/+17Star
| | | | | | | | | | Ensure each RDMA listener and its children transports are created in the same net namespace as the user that started the NFS service. This is similar to how listener sockets are created in svc_create_socket, required for enabling support for containers. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Add proper SPDX tags for NetApp-contributed sourceChuck Lever2018-05-111-0/+1
| | | | | Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* sunrpc: Save remote presentation address in svc_xprt for trace eventsChuck Lever2018-04-031-1/+3
| | | | | | | | | | | | | | | | | | | | TP_printk defines a format string that is passed to user space for converting raw trace event records to something human-readable. My user space's printf (Oracle Linux 7), however, does not have a %pI format specifier. The result is that what is supposed to be an IP address in the output of "trace-cmd report" is just a string that says the field couldn't be displayed. To fix this, adopt the same approach as the client: maintain a pre- formated presentation address for occasions when %pI is not available. The location of the trace_svc_send trace point is adjusted so that rqst->rq_xprt is not NULL when the trace event is recorded. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svc: Simplify ->xpo_secure_portChuck Lever2018-04-031-3/+3
| | | | | | | | Clean up: Instead of returning a value that is used to set or clear a bit, just make ->xpo_secure_port mangle that bit, and return void. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Consult max_qp_init_rd_atom when accepting connectionsChuck Lever2018-03-201-13/+9Star
| | | | | | | | | | | | | | | | | | The target needs to return the lesser of the client's Inbound RDMA Read Queue Depth (IRD), provided in the connection parameters, and the local device's Outbound RDMA Read Queue Depth (ORD). The latter limit is max_qp_init_rd_atom, not max_qp_rd_atom. The svcrdma_ord value caps the ORD value for iWARP transports, which do not exchange ORD/IRD values at connection time. Since no other Linux kernel RDMA-enabled storage target sees fit to provide this cap, I'm removing it here too. initiator_depth is a u8, so ensure the computed ORD value does not overflow that field. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Use pr_err to report Receive errorsChuck Lever2018-03-201-3/+3
| | | | | | | Clean up: Other completion handlers use pr_err, not pr_warn. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Post Receives in the Receive completion handlerChuck Lever2018-01-181-18/+7Star
| | | | | | | | | This change improves Receive efficiency by posting Receives only on the same CPU that handles Receive completion. Improved latency and throughput has been noted with this change. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Enqueue after setting XPT_CLOSE in completion handlersChuck Lever2017-11-071-3/+8
| | | | | | | | | | | | | | | | I noticed the server was sometimes not closing the connection after a flushed Send. For example, if the client responds with an RNR NAK to a Reply from the server, that client might be deadlocked, and thus wouldn't send any more traffic. Thus the server wouldn't have any opportunity to notice the XPT_CLOSE bit has been set. Enqueue the transport so that svcxprt notices the bit even if there is no more transport activity after a flushed completion, QP access error, or device removal event. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-By: Devesh Sharma <devesh.sharma@broadcom.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Estimate Send Queue depth properlyChuck Lever2017-09-051-4/+13
| | | | | | | | | | | | | | | | | | | | | The rdma_rw API adjusts max_send_wr upwards during the rdma_create_qp() call. If the ULP actually wants to take advantage of these extra resources, it must increase the size of its send completion queue (created before rdma_create_qp is called) and increase its send queue accounting limit. Use the new rdma_rw_mr_factor API to figure out the correct value to use for the Send Queue and Send Completion Queue depths. And, ensure that the chosen Send Queue depth for a newly created transport does not overrun the QP WR limit of the underlying device. Lastly, there's no longer a need to carry the Send Queue depth in struct svcxprt_rdma, since the value is used only in the svc_rdma_accept() path. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Limit RQ depthChuck Lever2017-09-051-7/+12
| | | | | | | | Ensure that the chosen Receive Queue depth for a newly created transport does not overrun the QP WR limit of the underlying device. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* sunrpc: Const-ify instances of struct svc_xprt_opsChuck Lever2017-08-251-2/+2
| | | | | | | | Close an attack vector by moving the arrays of server-side transport methods to read-only memory. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Clean up after converting svc_rdma_recvfrom to rdma_rw APIChuck Lever2017-07-121-35/+4Star
| | | | | | | | Clean up: Registration mode details are now handled by the rdma_rw API, and thus can be removed from svcrdma. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Clean-up svc_rdma_unmap_dmaChuck Lever2017-07-121-14/+5Star
| | | | | | | | | There's no longer a need to compare each SGE's lkey with the PD's local_dma_lkey. Now that FRWR is gone, all DMA mappings are for pages that were registered with this key. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Remove frmr cacheChuck Lever2017-07-121-86/+0Star
| | | | | | | | | | Clean up: Now that the svc_rdma_recvfrom path uses the rdma_rw API, the details of Read sink buffer registration are dealt with by the kernel's RDMA core. This cache is no longer used, and can be removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Remove unused Read completion handlersChuck Lever2017-07-121-84/+9Star
| | | | | | | | | | | | | | Clean up: The generic RDMA R/W API conversion of svc_rdma_recvfrom replaced the Register, Read, and Invalidate completion handlers. Remove the old ones, which are no longer used. These handlers shared some helper code with svc_rdma_wc_send. Fold the wc_common helper back into the one remaining completion handler. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Use generic RDMA R/W API in RPC Call pathChuck Lever2017-07-121-13/+0Star
| | | | | | | | | | | | | | | | The current svcrdma recvfrom code path has a lot of detail about registration mode and the type of port (iWARP, IB, etc). Instead, use the RDMA core's generic R/W API. This shares code with other RDMA-enabled ULPs that manages the gory details of buffer registration and the posting of RDMA Read Work Requests. Since the Read list marshaling code is being replaced, I took the opportunity to replace C structure-based XDR encoding code with more portable code that uses pointer arithmetic. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Remove the req_map cacheChuck Lever2017-04-251-84/+0Star
| | | | | | | | req_maps are no longer used by the send path and can thus be removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Remove unused RDMA Write completion handlerChuck Lever2017-04-251-18/+0Star
| | | | | | | | Clean up. All RDMA Write completions are now handled by svc_rdma_wc_write_ctx. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Use rdma_rw API in RPC reply pathChuck Lever2017-04-251-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current svcrdma sendto code path posts one RDMA Write WR at a time. Each of these Writes typically carries a small number of pages (for instance, up to 30 pages for mlx4 devices). That means a 1MB NFS READ reply requires 9 ib_post_send() calls for the Write WRs, and one for the Send WR carrying the actual RPC Reply message. Instead, use the new rdma_rw API. The details of Write WR chain construction and memory registration are taken care of in the RDMA core. svcrdma can focus on the details of the RPC-over-RDMA protocol. This gives three main benefits: 1. All Write WRs for one RDMA segment are posted in a single chain. As few as one ib_post_send() for each Write chunk. 2. The Write path can now use FRWR to register the Write buffers. If the device's maximum page list depth is large, this means a single Write WR is needed for each RPC's Write chunk data. 3. The new code introduces support for RPCs that carry both a Write list and a Reply chunk. This combination can be used for an NFSv4 READ where the data payload is large, and thus is removed from the Payload Stream, but the Payload Stream is still larger than the inline threshold. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Introduce local rdma_rw API helpersChuck Lever2017-04-251-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The plan is to replace the local bespoke code that constructs and posts RDMA Read and Write Work Requests with calls to the rdma_rw API. This shares code with other RDMA-enabled ULPs that manages the gory details of buffer registration and posting Work Requests. Some design notes: o The structure of RPC-over-RDMA transport headers is flexible, allowing multiple segments per Reply with arbitrary alignment, each with a unique R_key. Write and Send WRs continue to be built and posted in separate code paths. However, one whole chunk (with one or more RDMA segments apiece) gets exactly one ib_post_send and one work completion. o svc_xprt reference counting is modified, since a chain of rdma_rw_ctx structs generates one completion, no matter how many Write WRs are posted. o The current code builds the transport header as it is construct- ing Write WRs. I've replaced that with marshaling of transport header data items in a separate step. This is because the exact structure of client-provided segments may not align with the components of the server's reply xdr_buf, or the pages in the page list. Thus parts of each client-provided segment may be written at different points in the send path. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULTChuck Lever2017-04-251-1/+1
| | | | | | | | | | | | | | | | The Send Queue depth is temporarily reduced to 1 SQE per credit. The new rdma_rw API does an internal computation, during QP creation, to increase the depth of the Send Queue to handle RDMA Read and Write operations. This change has to come before the NFSD code paths are updated to use the rdma_rw API. Without this patch, rdma_rw_init_qp() increases the size of the SQ too much, resulting in memory allocation failures during QP creation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* svcrdma: set XPT_CONG_CTRL flag for bc xprtChuck Lever2017-03-291-0/+1
| | | | | | | | Same change as Kinglong Mee's fix for the TCP backchannel service. Fixes: 5283b03ee5cd ("nfs/nfsd/sunrpc: enforce transport...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* Merge tag 'nfsd-4.11' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2017-03-011-32/+37
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull nfsd updates from Bruce Fields: "The nfsd update this round is mainly a lot of miscellaneous cleanups and bugfixes. A couple changes could theoretically break working setups on upgrade. I don't expect complaints in practice, but they seem worth calling out just in case: - NFS security labels are now off by default; a new security_label export flag reenables it per export. But, having them on by default is a disaster, as it generally only makes sense if all your clients and servers have similar enough selinux policies. Thanks to Jason Tibbitts for pointing this out. - NFSv4/UDP support is off. It was never really supported, and the spec explicitly forbids it. We only ever left it on out of laziness; thanks to Jeff Layton for finally fixing that" * tag 'nfsd-4.11' of git://linux-nfs.org/~bfields/linux: (34 commits) nfsd: Fix display of the version string nfsd: fix configuration of supported minor versions sunrpc: don't register UDP port with rpcbind when version needs congestion control nfs/nfsd/sunrpc: enforce transport requirements for NFSv4 sunrpc: flag transports as having congestion control sunrpc: turn bitfield flags in svc_version into bools nfsd: remove superfluous KERN_INFO nfsd: special case truncates some more nfsd: minor nfsd_setattr cleanup NFSD: Reserve adequate space for LOCKT operation NFSD: Get response size before operation for all RPCs nfsd/callback: Drop a useless data copy when comparing sessionid nfsd/callback: skip the callback tag nfsd/callback: Cleanup callback cred on shutdown nfsd/idmap: return nfserr_inval for 0-length names SUNRPC/Cache: Always treat the invalid cache as unexpired SUNRPC: Drop all entries from cache_detail when cache_purge() svcrdma: Poll CQs in "workqueue" mode svcrdma: Combine list fields in struct svc_rdma_op_ctxt svcrdma: Remove unused sc_dto_q field ...
| * sunrpc: flag transports as having congestion controlJeff Layton2017-02-241-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | NFSv4 requires a transport protocol with congestion control in most cases. On an IP network, that means that NFSv4 over UDP should be forbidden. The situation with RDMA is a bit more nuanced, but most RDMA transports are suitable for this. For now, we assume that all RDMA transports are suitable, but we may need to revise that at some point. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: Poll CQs in "workqueue" modeChuck Lever2017-02-081-13/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | svcrdma calls svc_xprt_put() in its completion handlers, which currently run in IRQ context. However, svc_xprt_put() is meant to be invoked in process context, not in IRQ context. After the last transport reference is gone, it directly calls a transport release function that expects to run in process context. Change the CQ polling modes to IB_POLL_WORKQUEUE so that svcrdma invokes svc_xprt_put() only in process context. As an added benefit, bottom half-disabled spin locking can be eliminated from I/O paths. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: Combine list fields in struct svc_rdma_op_ctxtChuck Lever2017-02-081-18/+15Star
| | | | | | | | | | | | | | | | | | | | Clean up: The free list and the dto_q list fields are never used at the same time. Reduce the size of struct svc_rdma_op_ctxt by combining these fields. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: Remove unused sc_dto_q fieldChuck Lever2017-02-081-1/+0Star
| | | | | | | | | | | | | | | | | | | | | | Clean up. Commit be99bb11400c ("svcrdma: Use new CQ API for RPC-over-RDMA server send CQs") removed code that used the sc_dto_q field, but neglected to remove sc_dto_q at the same time. Fixes: be99bb11400c ("svcrdma: Use new CQ API for RPC-over- ...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: Clean up RPC-over-RDMA Reply header encoderChuck Lever2017-02-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace C structure-based XDR decoding with pointer arithmetic. Pointer arithmetic is considered more portable, and is used throughout the kernel's existing XDR encoders. The gcc optimizer generates similar assembler code either way. Byte-swapping before a memory store on x86 typically results in an instruction pipeline stall. Avoid byte-swapping when encoding a new header. svcrdma currently doesn't alter a connection's credit grant value after the connection has been accepted, so it is effectively a constant. Cache the byte-swapped value in a separate field. Christoph suggested pulling the header encoding logic into the only function that uses it. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* | locking/atomic, kref: Add kref_read()Peter Zijlstra2017-01-141-2/+2
|/ | | | | | | | | | | | | | | | | | | | | | Since we need to change the implementation, stop exposing internals. Provide kref_read() to read the current reference count; typically used for debug messages. Kills two anti-patterns: atomic_read(&kref->refcount) kref->refcount.counter Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Merge tag 'nfsd-4.10' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2016-12-161-63/+31Star
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull nfsd updates from Bruce Fields: "The one new feature is support for a new NFSv4.2 mode_umask attribute that makes ACL inheritance a little more useful in environments that default to restrictive umasks. Requires client-side support, also on its way for 4.10. Other than that, miscellaneous smaller fixes and cleanup, especially to the server rdma code" [ The client side of the umask attribute was merged yesterday ] * tag 'nfsd-4.10' of git://linux-nfs.org/~bfields/linux: nfsd: add support for the umask attribute sunrpc: use DEFINE_SPINLOCK() svcrdma: Further clean-up of svc_rdma_get_inv_rkey() svcrdma: Break up dprintk format in svc_rdma_accept() svcrdma: Remove unused variable in rdma_copy_tail() svcrdma: Remove unused variables in xprt_rdma_bc_allocate() svcrdma: Remove svc_rdma_op_ctxt::wc_status svcrdma: Remove DMA map accounting svcrdma: Remove BH-disabled spin locking in svc_rdma_send() svcrdma: Renovate sendto chunk list parsing svcauth_gss: Close connection when dropping an incoming message svcrdma: Clear xpt_bc_xps in xprt_setup_rdma_bc() error exit arm nfsd: constify reply_cache_stats_operations structure nfsd: update workqueue creation sunrpc: GFP_KERNEL should be GFP_NOFS in crypto code nfsd: catch errors in decode_fattr earlier nfsd: clean up supported attribute handling nfsd: fix error handling for clients that fail to return the layout nfsd: more robust allocation failure handling in nfsd_reply_cache_init
| * svcrdma: Break up dprintk format in svc_rdma_accept()Chuck Lever2016-11-301-37/+18Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current code results in: Nov 7 14:50:19 klimt kernel: svcrdma: newxprt->sc_cm_id=ffff88085590c800, newxprt->sc_pd=ffff880852a7ce00#012 cm_id->device=ffff88084dd20000, sc_pd->device=ffff88084dd20000#012 cap.max_send_wr = 272#012 cap.max_recv_wr = 34#012 cap.max_send_sge = 32#012 cap.max_recv_sge = 32 Nov 7 14:50:19 klimt kernel: svcrdma: new connection ffff880855908000 accepted with the following attributes:#012 local_ip : 10.0.0.5#012 local_port#011 : 20049#012 remote_ip : 10.0.0.2#012 remote_port : 59909#012 max_sge : 32#012 max_sge_rd : 30#012 sq_depth : 272#012 max_requests : 32#012 ord : 16 Split up the output over multiple dprintks and take the opportunity to fix the display of IPv6 addresses. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>