summaryrefslogtreecommitdiffstats
path: root/net
Commit message (Collapse)AuthorAgeFilesLines
...
| | * | | | | | svcrdma: use offset_in_page() macroChuck Lever2017-07-121-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up: Use offset_in_page() macro instead of open-coding. Reported-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| | * | | | | | svcrdma: Clean up after converting svc_rdma_recvfrom to rdma_rw APIChuck Lever2017-07-121-35/+4Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up: Registration mode details are now handled by the rdma_rw API, and thus can be removed from svcrdma. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| | * | | | | | svcrdma: Clean-up svc_rdma_unmap_dmaChuck Lever2017-07-121-14/+5Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's no longer a need to compare each SGE's lkey with the PD's local_dma_lkey. Now that FRWR is gone, all DMA mappings are for pages that were registered with this key. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| | * | | | | | svcrdma: Remove frmr cacheChuck Lever2017-07-121-86/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up: Now that the svc_rdma_recvfrom path uses the rdma_rw API, the details of Read sink buffer registration are dealt with by the kernel's RDMA core. This cache is no longer used, and can be removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| | * | | | | | svcrdma: Remove unused Read completion handlersChuck Lever2017-07-121-84/+9Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up: The generic RDMA R/W API conversion of svc_rdma_recvfrom replaced the Register, Read, and Invalidate completion handlers. Remove the old ones, which are no longer used. These handlers shared some helper code with svc_rdma_wc_send. Fold the wc_common helper back into the one remaining completion handler. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| | * | | | | | svcrdma: Properly compute .len and .buflen for received RPC CallsChuck Lever2017-07-122-11/+5Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an RPC-over-RDMA request is received, the Receive buffer contains a Transport Header possibly followed by an RPC message. Even though rq_arg.head[0] (as passed to NFSD) does not contain the Transport Header header, currently rq_arg.len includes the size of the Transport Header. That violates the intent of the xdr_buf API contract. .buflen should include everything, but .len should be exactly the length of the RPC message in the buffer. The rq_arg fields are summed together at the end of svc_rdma_recvfrom to obtain the correct return value. rq_arg.len really ought to contain the correct number of bytes already, but it currently doesn't due to the above misbehavior. Let's instead ensure that .buflen includes the length of the transport header, and that .len is always equal to head.iov_len + .page_len + tail.iov_len . Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| | * | | | | | svcrdma: Use generic RDMA R/W API in RPC Call pathChuck Lever2017-07-122-454/+106Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current svcrdma recvfrom code path has a lot of detail about registration mode and the type of port (iWARP, IB, etc). Instead, use the RDMA core's generic R/W API. This shares code with other RDMA-enabled ULPs that manages the gory details of buffer registration and the posting of RDMA Read Work Requests. Since the Read list marshaling code is being replaced, I took the opportunity to replace C structure-based XDR encoding code with more portable code that uses pointer arithmetic. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| | * | | | | | svcrdma: Add recvfrom helpers to svc_rdma_rw.cChuck Lever2017-07-121-1/+424
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | svc_rdma_rw.c already contains helpers for the sendto path. Introduce helpers for the recvfrom path. The plan is to replace the local NFSD bespoke code that constructs and posts RDMA Read Work Requests with calls to the rdma_rw API. This shares code with other RDMA-enabled ULPs that manages the gory details of buffer registration and posting Work Requests. This new code also puts all RDMA_NOMSG-specific logic in one place. Lastly, the use of rqstp->rq_arg.pages is deprecated in favor of using rqstp->rq_pages directly, for clarity. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| | * | | | | | sunrpc: Allocate up to RPCSVC_MAXPAGES per svc_rqstChuck Lever2017-07-121-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | svcrdma needs 259 pages allocated to receive 1MB NFSv4.0 WRITE requests: - 1 page for the transport header and head iovec - 256 pages for the data payload - 1 page for the trailing GETATTR request (since NFSD XDR decoding does not look for a tail iovec, the GETATTR is stuck at the end of the rqstp->rq_arg.pages list) - 1 page for building the reply xdr_buf But RPCSVC_MAXPAGES is already 259 (on x86_64). The problem is that svc_alloc_arg never allocates that many pages. To address this: 1. The final element of rq_pages always points to NULL. To accommodate up to 259 pages in rq_pages, add an extra element to rq_pages for the array termination sentinel. 2. Adjust the calculation of "pages" to match how RPCSVC_MAXPAGES is calculated, so it can go up to 259. Bruce noted that the calculation assumes sv_max_mesg is a multiple of PAGE_SIZE, which might not always be true. I didn't change this assumption. 3. Change the loop boundaries to allow 259 pages to be allocated. Additional clean-up: WARN_ON_ONCE adds an extra conditional branch, which is basically never taken. And there's no need to dump the stack here because svc_alloc_arg has only one caller. Keeping that NULL "array termination sentinel"; there doesn't appear to be any code that depends on it, only code in nfsd_splice_actor() which needs the 259th element to be initialized to *something*. So it's possible we could just keep the array at 259 elements and drop that final NULL, but we're being conservative for now. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| | * | | | | | svcrdma: Don't account for Receive queue "starvation"Chuck Lever2017-06-281-15/+6Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | >From what I can tell, calling ->recvfrom when there is no work to do is a normal part of operation. This is the only way svc_recv can tell when there is no more data ready to receive on the transport. Neither the TCP nor the UDP transport implementations have a "starve" metric. The cost of receive starvation accounting is bumping an atomic, which results in extra (IMO unnecessary) bus traffic between CPU sockets, while holding a spin lock. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| | * | | | | | svcrdma: Improve Reply chunk sanity checkingChuck Lever2017-06-281-6/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Identify malformed transport headers and unsupported chunk combinations as early as possible. - Ensure that segment lengths are not crazy. - Ensure that the Reply chunk's segment count is not crazy. With a 1KB inline threshold, the largest number of Write segments that can be conveyed is about 60 (for a RDMA_NOMSG Reply message). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| | * | | | | | svcrdma: Improve Write chunk sanity checkingChuck Lever2017-06-281-5/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Identify malformed transport headers and unsupported chunk combinations as early as possible. - Reject RPC-over-RDMA messages that contain more than one Write chunk, since this implementation does not support more than one per message. - Ensure that segment lengths are not crazy. - Ensure that the chunk's segment count is not crazy. With a 1KB inline threshold, the largest number of Write segments that can be conveyed is about 60 (for a RDMA_NOMSG Reply message). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| | * | | | | | svcrdma: Improve Read chunk sanity checkingChuck Lever2017-06-281-18/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Identify malformed transport headers and unsupported chunk combinations as early as possible. - Reject RPC-over-RDMA messages that contain more than one Read chunk, since this implementation currently does not support more than one per RPC transaction. - Ensure that segment lengths are not crazy. - Remove the segment count check. With a 1KB inline threshold, the largest number of Read segments that can be conveyed is about 40 (for a RDMA_NOMSG Call message). This is nowhere near RPCSVC_MAXPAGES. As far as I can tell, that was just a sanity check and does not enforce an implementation limit. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| | * | | | | | svcrdma: Remove svc_rdma_marshal.cChuck Lever2017-06-283-170/+128Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | svc_rdma_marshal.c has one remaining exported function -- svc_rdma_xdr_decode_req -- and it has a single call site. Take the same approach as the sendto path, and move this function into the source file where it is called. This is a refactoring change only. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| | * | | | | | svcrdma: Avoid Send Queue overflowChuck Lever2017-06-282-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sanity case: Catch the case where more Work Requests are being posted to the Send Queue than there are Send Queue Entries. This might happen if a client sends a chunk with more segments than there are SQEs for the transport. The server can't send that reply, so the transport will deadlock unless the client drops the RPC. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| | * | | | | | svcrdma: Squelch disconnection messagesChuck Lever2017-06-281-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The server displays "svcrdma: failed to post Send WR (-107)" in the kernel log when the client disconnects. This could flood the server's log, so remove the message. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| | * | | | | | sunrpc: Disable splice for krb5iChuck Lever2017-06-282-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Running a multi-threaded 8KB fio test (70/30 mix), three or four out of twelve of the jobs fail when using krb5i. The failure is an EIO on a read. Troubleshooting confirmed the EIO results when the client fails to verify the MIC of an NFS READ reply. Bruce suggested the problem could be due to the data payload changing between the time the reply's MIC was computed on the server and the time the reply was actually sent. krb5p gets around this problem by disabling RQ_SPLICE_OK. Use the same mechanism for krb5i RPCs. "iozone -i0 -i1 -s128m -y1k -az -I", export is tmpfs, mount is sec=krb5i,vers=3,proto=rdma. The important numbers are the read / reread column. Here's without the RQ_SPLICE_OK patch: kB reclen write rewrite read reread 131072 1 7546 7929 8396 8267 131072 2 14375 14600 15843 15639 131072 4 19280 19248 21303 21410 131072 8 32350 31772 35199 34883 131072 16 36748 37477 49365 51706 131072 32 55669 56059 57475 57389 131072 64 74599 75190 74903 75550 131072 128 99810 101446 102828 102724 131072 256 122042 122612 124806 125026 131072 512 137614 138004 141412 141267 131072 1024 146601 148774 151356 151409 131072 2048 180684 181727 293140 292840 131072 4096 206907 207658 552964 549029 131072 8192 223982 224360 454493 473469 131072 16384 228927 228390 654734 632607 And here's with it: kB reclen write rewrite read reread 131072 1 7700 7365 7958 8011 131072 2 13211 13303 14937 14414 131072 4 19001 19265 20544 20657 131072 8 30883 31097 34255 33566 131072 16 36868 34908 51499 49944 131072 32 56428 55535 58710 56952 131072 64 73507 74676 75619 74378 131072 128 100324 101442 103276 102736 131072 256 122517 122995 124639 124150 131072 512 137317 139007 140530 140830 131072 1024 146807 148923 151246 151072 131072 2048 179656 180732 292631 292034 131072 4096 206216 208583 543355 541951 131072 8192 223738 224273 494201 489372 131072 16384 229313 229840 691719 668427 I would say that there is not much difference in this test. For good measure, here's the same test with sec=krb5p: kB reclen write rewrite read reread 131072 1 5982 5881 6137 6218 131072 2 10216 10252 10850 10932 131072 4 12236 12575 15375 15526 131072 8 15461 15462 23821 22351 131072 16 25677 25811 27529 27640 131072 32 31903 32354 34063 33857 131072 64 42989 43188 45635 45561 131072 128 52848 53210 56144 56141 131072 256 59123 59214 62691 62933 131072 512 63140 63277 66887 67025 131072 1024 65255 65299 69213 69140 131072 2048 76454 76555 133767 133862 131072 4096 84726 84883 251925 250702 131072 8192 89491 89482 270821 276085 131072 16384 91572 91597 361768 336868 BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=307 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| | * | | | | | Merge tag 'v4.12-rc5' into nfsd treeJ. Bruce Fields2017-06-28106-638/+938
| | |\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Update to get f0c3192ceee3 "virtio_net: lower limit on buffer size". That bug was interfering with my nfsd testing.
| | * | | | | | sunrpc: mark all struct svc_version instances as constChristoph Hellwig2017-05-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
| | * | | | | | sunrpc: mark all struct svc_procinfo instances as constChristoph Hellwig2017-05-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | struct svc_procinfo contains function pointers, and marking it as constant avoids it being able to be used as an attach vector for code injections. Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | | | | | sunrpc: move pc_count out of struct svc_procinfoChristoph Hellwig2017-05-152-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pc_count is the only writeable memeber of struct svc_procinfo, which is a good candidate to be const-ified as it contains function pointers. This patch moves it into out out struct svc_procinfo, and into a separate writable array that is pointed to by struct svc_version. Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | | | | | sunrpc: properly type pc_encode callbacksChristoph Hellwig2017-05-151-4/+2Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Drop the resp argument as it can trivially be derived from the rqstp argument. With that all functions now have the same prototype, and we can remove the unsafe casting to kxdrproc_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
| | * | | | | | sunrpc: properly type pc_decode callbacksChristoph Hellwig2017-05-151-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Drop the argp argument as it can trivially be derived from the rqstp argument. With that all functions now have the same prototype, and we can remove the unsafe casting to kxdrproc_t. Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | | | | | sunrpc: properly type pc_release callbacksChristoph Hellwig2017-05-151-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Drop the p and resp arguments as they are always NULL or can trivially be derived from the rqstp argument. With that all functions now have the same prototype, and we can remove the unsafe casting to kxdrproc_t. Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | | | | | sunrpc: properly type pc_func callbacksChristoph Hellwig2017-05-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Drop the argp and resp arguments as they can trivially be derived from the rqstp argument. With that all functions now have the same prototype, and we can remove the unsafe casting to svc_procfunc as well as the svc_procfunc typedef itself. Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | | | | | sunrpc: mark all struct rpc_procinfo instances as constChristoph Hellwig2017-05-154-13/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | struct rpc_procinfo contains function pointers, and marking it as constant avoids it being able to be used as an attach vector for code injections. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
| | * | | | | | sunrpc: move p_count out of struct rpc_procinfoChristoph Hellwig2017-05-154-8/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | p_count is the only writeable memeber of struct rpc_procinfo, which is a good candidate to be const-ified as it contains function pointers. This patch moves it into out out struct rpc_procinfo, and into a separate writable array that is pointed to by struct rpc_version and indexed by p_statidx. Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | | | | | sunrpc/auth_gss: fix decoder callback prototypesChristoph Hellwig2017-05-153-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Declare the p_decode callbacks with the proper prototype instead of casting to kxdrdproc_t and losing all type safety. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@redhat.com> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
| | * | | | | | sunrpc: fix decoder callback prototypesChristoph Hellwig2017-05-151-12/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Declare the p_decode callbacks with the proper prototype instead of casting to kxdrdproc_t and losing all type safety. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@redhat.com>
| | * | | | | | sunrpc: properly type argument to kxdrdproc_tChristoph Hellwig2017-05-151-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pass struct rpc_request as the first argument instead of an untyped blob. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@redhat.com> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
| | * | | | | | sunrpc/auth_gss: nfsd: fix encoder callback prototypesChristoph Hellwig2017-05-153-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Declare the p_encode callbacks with the proper prototype instead of casting to kxdreproc_t and losing all type safety. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@redhat.com> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
| | * | | | | | sunrpc: fix encoder callback prototypesChristoph Hellwig2017-05-151-11/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Declare the p_encode callbacks with the proper prototype instead of casting to kxdreproc_t and losing all type safety. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@redhat.com> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
| | * | | | | | sunrpc: properly type argument to kxdreproc_tChristoph Hellwig2017-05-151-1/+2
| | | |/ / / / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pass struct rpc_request as the first argument instead of an untyped blob, and mark the data object as const. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@redhat.com>
| * | | | | | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2017-07-134-13/+9Star
| |\ \ \ \ \ \ | | |_|_|_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge yet more updates from Andrew Morton: - various misc things - kexec updates - sysctl core updates - scripts/gdb udpates - checkpoint-restart updates - ipc updates - kernel/watchdog updates - Kees's "rough equivalent to the glibc _FORTIFY_SOURCE=1 feature" - "stackprotector: ascii armor the stack canary" - more MM bits - checkpatch updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (96 commits) writeback: rework wb_[dec|inc]_stat family of functions ARM: samsung: usb-ohci: move inline before return type video: fbdev: omap: move inline before return type video: fbdev: intelfb: move inline before return type USB: serial: safe_serial: move __inline__ before return type drivers: tty: serial: move inline before return type drivers: s390: move static and inline before return type x86/efi: move asmlinkage before return type sh: move inline before return type MIPS: SMP: move asmlinkage before return type m68k: coldfire: move inline before return type ia64: sn: pci: move inline before type ia64: move inline before return type FRV: tlbflush: move asmlinkage before return type CRIS: gpio: move inline before return type ARM: HP Jornada 7XX: move inline before return type ARM: KVM: move asmlinkage before type checkpatch: improve the STORAGE_CLASS test mm, migration: do not trigger OOM killer when migrating memory drm/i915: use __GFP_RETRY_MAYFAIL ...
| | * | | | | mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful ↵Michal Hocko2017-07-133-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | semantic __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to the page allocator. This has been true but only for allocations requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for smaller sizes. This is a bit unfortunate because there is no way to express the same semantic for those requests and they are considered too important to fail so they might end up looping in the page allocator for ever, similarly to GFP_NOFAIL requests. Now that the whole tree has been cleaned up and accidental or misled usage of __GFP_REPEAT flag has been removed for !costly requests we can give the original flag a better name and more importantly a more useful semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user that the allocator would try really hard but there is no promise of a success. This will work independent of the order and overrides the default allocator behavior. Page allocator users have several levels of guarantee vs. cost options (take GFP_KERNEL as an example) - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_ attempt to free memory at all. The most light weight mode which even doesn't kick the background reclaim. Should be used carefully because it might deplete the memory and the next user might hit the more aggressive reclaim - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic allocation without any attempt to free memory from the current context but can wake kswapd to reclaim memory if the zone is below the low watermark. Can be used from either atomic contexts or when the request is a performance optimization and there is another fallback for a slow path. - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) - non sleeping allocation with an expensive fallback so it can access some portion of memory reserves. Usually used from interrupt/bh context with an expensive slow path fallback. - GFP_KERNEL - both background and direct reclaim are allowed and the _default_ page allocator behavior is used. That means that !costly allocation requests are basically nofail but there is no guarantee of that behavior so failures have to be checked properly by callers (e.g. OOM killer victim is allowed to fail currently). - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and all allocation requests fail early rather than cause disruptive reclaim (one round of reclaim in this implementation). The OOM killer is not invoked. - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior and all allocation requests try really hard. The request will fail if the reclaim cannot make any progress. The OOM killer won't be triggered. - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior and all allocation requests will loop endlessly until they succeed. This might be really dangerous especially for larger orders. Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL because they already had their semantic. No new users are added. __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if there is no progress and we have already passed the OOM point. This means that all the reclaim opportunities have been exhausted except the most disruptive one (the OOM killer) and a user defined fallback behavior is more sensible than keep retrying in the page allocator. [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c] [mhocko@suse.com: semantic fix] Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz [mhocko@kernel.org: address other thing spotted by Vlastimil] Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alex Belits <alex.belits@cavium.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David Daney <david.daney@cavium.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: NeilBrown <neilb@suse.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | * | | | | net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info()Michal Hocko2017-07-131-8/+4Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xt_alloc_table_info() basically opencodes kvmalloc() so use the library function instead. Link: http://lkml.kernel.org/r/20170531155145.17111-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | | | net: make dev_close and related functions voidstephen hemminger2017-07-201-15/+11Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no useful return value from dev_close. All paths return 0. Change dev_close and helper functions to void. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | bluetooth: 6lowpan dev_close never returns errorstephen hemminger2017-07-201-5/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function dev_close in current kernel will never return an error. Later changes will make it void. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | net: dsa: unexport dsa_is_port_initializedVivien Didelot2017-07-201-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The dsa_is_port_initialized helper is only used by dsa_switch_resume and dsa_switch_suspend, if CONFIG_PM_SLEEP is enabled. Make it static to dsa.c. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | net/packet: remove unused PGV_FROM_VMALLOC definition.Rosen, Rami2017-07-201-2/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch removes the definition of PGV_FROM_VMALLOC from af_packet.c. The PGV_FROM_VMALLOC definition was already removed by commit 441c793a5650 ("net: cleanup unused macros in net directory"), and its usage was removed even before by commit c56b4d90123b ("af_packet: remove pgv.flags"); but it was added back by mistake later on, in commit f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation"). Signed-off-by: Rami Rosen <rami.rosen@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | tcp: adjust tail loss probe timeoutYuchung Cheng2017-07-202-8/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adjusts the timeout formula to schedule the TCP loss probe (TLP). The previous formula uses 2*SRTT or 1.5*RTT + DelayACKMax if only one packet is in flight. It keeps a lower bound of 10 msec which is too large for short RTT connections (e.g. within a data-center). The new formula = 2*RTT + (inflight == 1 ? 200ms : 2ticks) which performs better for short and fast connections. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | openvswitch: Optimize operations for OvS flow_stats.Tonghao Zhang2017-07-193-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When calling the flow_free() to free the flow, we call many times (cpu_possible_mask, eg. 128 as default) cpumask_next(). That will take up our CPU usage if we call the flow_free() frequently. When we put all packets to userspace via upcall, and OvS will send them back via netlink to ovs_packet_cmd_execute(will call flow_free). The test topo is shown as below. VM01 sends TCP packets to VM02, and OvS forward packtets. When testing, we use perf to report the system performance. VM01 --- OvS-VM --- VM02 Without this patch, perf-top show as below: The flow_free() is 3.02% CPU usage. 4.23% [kernel] [k] _raw_spin_unlock_irqrestore 3.62% [kernel] [k] __do_softirq 3.16% [kernel] [k] __memcpy 3.02% [kernel] [k] flow_free 2.42% libc-2.17.so [.] __memcpy_ssse3_back 2.18% [kernel] [k] copy_user_generic_unrolled 2.17% [kernel] [k] find_next_bit When applied this patch, perf-top show as below: Not shown on the list anymore. 4.11% [kernel] [k] _raw_spin_unlock_irqrestore 3.79% [kernel] [k] __do_softirq 3.46% [kernel] [k] __memcpy 2.73% libc-2.17.so [.] __memcpy_ssse3_back 2.25% [kernel] [k] copy_user_generic_unrolled 1.89% libc-2.17.so [.] _int_malloc 1.53% ovs-vswitchd [.] xlate_actions With this patch, the TCP throughput(we dont use Megaflow Cache + Microflow Cache) between VMs is 1.18Gbs/sec up to 1.30Gbs/sec (maybe ~10% performance imporve). This patch adds cpumask struct, the cpu_used_mask stores the cpu_id that the flow used. And we only check the flow_stats on the cpu we used, and it is unncessary to check all possible cpu when getting, cleaning, and updating the flow_stats. Adding the cpu_used_mask to sw_flow struct does’t increase the cacheline number. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | openvswitch: Optimize updating for OvS flow_stats.Tonghao Zhang2017-07-191-2/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the ovs_flow_stats_update(), we only use the node var to alloc flow_stats struct. But this is not a common case, it is unnecessary to call the numa_node_id() everytime. This patch is not a bugfix, but there maybe a small increase. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | xfrm: add xdst pcpu cacheFlorian Westphal2017-07-183-3/+131
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | retain last used xfrm_dst in a pcpu cache. On next request, reuse this dst if the policies are the same. The cache will not help with strict RR workloads as there is no hit. The cache packet-path part is reasonably small, the notifier part is needed so we do not add long hangs when a device is dismantled but some pcpu xdst still holds a reference, there are also calls to the flush operation when userspace deletes SAs so modules can be removed (there is no hit. We need to run the dst_release on the correct cpu to avoid races with packet path. This is done by adding a work_struct for each cpu and then doing the actual test/release on each affected cpu via schedule_work_on(). Test results using 4 network namespaces and null encryption: ns1 ns2 -> ns3 -> ns4 netperf -> xfrm/null enc -> xfrm/null dec -> netserver what TCP_STREAM UDP_STREAM UDP_RR Flow cache: 14644.61 294.35 327231.64 No flow cache: 14349.81 242.64 202301.72 Pcpu cache: 14629.70 292.21 205595.22 UDP tests used 64byte packets, tests ran for one minute each, value is average over ten iterations. 'Flow cache' is 'net-next', 'No flow cache' is net-next plus this series but without this patch. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | xfrm: remove flow cacheFlorian Westphal2017-07-188-653/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After rcu conversions performance degradation in forward tests isn't that noticeable anymore. See next patch for some numbers. A followup patcg could then also remove genid from the policies as we do not cache bundles anymore. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | xfrm_policy: make xfrm_bundle_lookup return xfrm dst objectFlorian Westphal2017-07-181-16/+12Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This allows to remove flow cache object embedded in struct xfrm_dst. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | xfrm_policy: remove xfrm_policy_lookupFlorian Westphal2017-07-181-32/+4Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This removes the wrapper and renames the __xfrm_policy_lookup variant to get rid of another place that used flow cache objects. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | xfrm_policy: kill flow to policy dir conversionFlorian Westphal2017-07-181-42/+4Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | XFRM_POLICY_IN/OUT/FWD are identical to FLOW_DIR_*, so gcc already removed this function as its just returns the argument. Again, no code change. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | xfrm_policy: remove always true/false branchesFlorian Westphal2017-07-181-60/+14Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | after previous change oldflo and xdst are always NULL. These branches were already removed by gcc, this doesn't change code. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | xfrm_policy: bypass flow_cache_lookupFlorian Westphal2017-07-181-9/+5Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of consulting flow cache, call the xfrm bundle/policy lookup functions directly. This pretends the flow cache had no entry. This helps to gradually remove flow cache integration, followup commit will remove the dead code that this change adds. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>