summaryrefslogtreecommitdiffstats
path: root/net
Commit message (Collapse)AuthorAgeFilesLines
* sctp: not return ENOMEM err back in sctp_packet_transmitXin Long2016-09-191-25/+22Star
| | | | | | | | | | | | | | | | | | | | As David and Marcelo's suggestion, ENOMEM err shouldn't return back to user in transmit path. Instead, sctp's retransmit would take care of the chunks that fail to send because of ENOMEM. This patch is only to do some release job when alloc_skb fails, not to return ENOMEM back any more. Besides, it also cleans up sctp_packet_transmit's err path, and fixes some issues in err path: - It didn't free the head skb in nomem: path. - No need to check nskb in no_route: path. - It should goto err: path if alloc_skb fails for head. - Not all the NOMEMs should free nskb. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sctp: make sctp_outq_flush/tail/uncork return voidXin Long2016-09-192-17/+11Star
| | | | | | | | | sctp_outq_flush return value is meaningless now, this patch is to make sctp_outq_flush return void, as well as sctp_outq_fail and sctp_outq_uncork. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sctp: save transmit error to sk_err in sctp_outq_flushXin Long2016-09-192-10/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Every time when sctp calls sctp_outq_flush, it sends out the chunks of control queue, retransmit queue and data queue. Even if some trunks are failed to transmit, it still has to flush all the transports, as it's the only chance to clean that transmit_list. So the latest transmit error here should be returned back. This transmit error is an internal error of sctp stack. I checked all the places where it uses the transmit error (the return value of sctp_outq_flush), most of them are actually just save it to sk_err. Except for sctp_assoc/endpoint_bh_rcv, they will drop the chunk if it's failed to send a REPLY, which is actually incorrect, as we can't be sure the error that sctp_outq_flush returns is from sending that REPLY. So it's meaningless for sctp_outq_flush to return error back. This patch is to save transmit error to sk_err in sctp_outq_flush, the new error can update the old value. Eventually, sctp_wait_for_* would check for it. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sctp: free msg->chunks when sctp_primitive_SEND return errXin Long2016-09-193-3/+19
| | | | | | | | | | | | | | | | | Last patch "sctp: do not return the transmit err back to sctp_sendmsg" made sctp_primitive_SEND return err only when asoc state is unavailable. In this case, chunks are not enqueued, they have no chance to be freed if we don't take care of them later. This Patch is actually to revert commit 1cd4d5c4326a ("sctp: remove the unused sctp_datamsg_free()"), commit 69b5777f2e57 ("sctp: hold the chunks only after the chunk is enqueued in outq") and commit 8b570dc9f7b6 ("sctp: only drop the reference on the datamsg after sending a msg"), to use sctp_datamsg_free to free the chunks of current msg. Fixes: 8b570dc9f7b6 ("sctp: only drop the reference on the datamsg after sending a msg") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sctp: do not return the transmit err back to sctp_sendmsgXin Long2016-09-191-11/+5Star
| | | | | | | | | | | | | | | | | | | | | Once a chunk is enqueued successfully, sctp queues can take care of it. Even if it is failed to transmit (like because of nomem), it should be put into retransmit queue. If sctp report this error to users, it confuses them, they may resend that msg, but actually in kernel sctp stack is in charge of retransmit it already. Besides, this error probably is not from the failure of transmitting current msg, but transmitting or retransmitting another msg's chunks, as sctp_outq_flush just tries to send out all transports' chunks. This patch is to make sctp_cmd_send_msg return avoid, and not return the transmit err back to sctp_sendmsg Fixes: 8b570dc9f7b6 ("sctp: only drop the reference on the datamsg after sending a msg") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sctp: remove the unnecessary state check in sctp_outq_tailXin Long2016-09-191-39/+14Star
| | | | | | | | | | | | | | | | Data Chunks are only sent by sctp_primitive_SEND, in which sctp checks the asoc's state through statetable before calling sctp_outq_tail. So there's no need to check the asoc's state again in sctp_outq_tail. Besides, sctp_do_sm is protected by lock_sock, even if sending msg is interrupted by timer events, the event's processes still need to acquire lock_sock first. It means no others CMDs can be enqueue into side effect list before CMD_SEND_MSG to change asoc->state, so it's safe to remove it. This patch is to remove redundant asoc->state check from sctp_outq_tail. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ip6_tunnel: add collect_md mode to IPv6 tunnelsAlexei Starovoitov2016-09-171-45/+133
| | | | | | | | | | | | | Similar to gre, vxlan, geneve tunnels allow IPIP6 and IP6IP6 tunnels to operate in 'collect metadata' mode. Unlike ipv4 code here it's possible to reuse ip6_tnl_xmit() function for both collect_md and traditional tunnels. bpf_skb_[gs]et_tunnel_key() helpers and ovs (in the future) are the users. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* ip_tunnel: add collect_md mode to IPIP tunnelAlexei Starovoitov2016-09-172-6/+105
| | | | | | | | | | | | | | | Similar to gre, vxlan, geneve tunnels allow IPIP tunnels to operate in 'collect metadata' mode. bpf_skb_[gs]et_tunnel_key() helpers can make use of it right away. ovs can use it as well in the future (once appropriate ovs-vport abstractions and user apis are added). Note that just like in other tunnels we cannot cache the dst, since tunnel_info metadata can be different for every packet. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* l2tp: constify net_device_ops structuresJulia Lawall2016-09-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Check for net_device_ops structures that are only stored in the netdev_ops field of a net_device structure. This field is declared const, so net_device_ops structures that have this property can be declared as const also. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @r disable optional_qualifier@ identifier i; position p; @@ static struct net_device_ops i@p = { ... }; @ok@ identifier r.i; struct net_device e; position p; @@ e.netdev_ops = &i@p; @bad@ position p != {r.p,ok.p}; identifier r.i; struct net_device_ops e; @@ e@i@p @depends on !bad disable optional_qualifier@ identifier r.i; @@ static +const struct net_device_ops i = { ... }; // </smpl> The result of size on this file before the change is: text data bss dec hex filename 3401 931 44 4376 1118 net/l2tp/l2tp_eth.o and after the change it is: text data bss dec hex filename 3993 347 44 4384 1120 net/l2tp/l2tp_eth.o Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
* llc: switch type to bool as the timeout is only tested versus 0Alan Cox2016-09-171-2/+2
| | | | | | | (As asked by Dave in Februrary) Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: prepare skbs for better sack shiftingEric Dumazet2016-09-171-7/+24
| | | | | | | | | | | | | | | | | | | | | | | | With large BDP TCP flows and lossy networks, it is very important to keep a low number of skbs in the write queue. RACK and SACK processing can perform a linear scan of it. We should avoid putting any payload in skb->head, so that SACK shifting can be done if needed. With this patch, we allow to pack ~0.5 MB per skb instead of the 64KB initially cooked at tcp_sendmsg() time. This gives a reduction of number of skbs in write queue by eight. tcp_rack_detect_loss() likes this. We still allow payload in skb->head for first skb put in the queue, to not impact RPC workloads. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* rxrpc: Make IPv6 support conditional on CONFIG_IPV6David Howells2016-09-178-2/+34
| | | | | | | | | | | | | | Add CONFIG_AF_RXRPC_IPV6 and make the IPv6 support code conditional on it. This is then made conditional on CONFIG_IPV6. Without this, the following can be seen: net/built-in.o: In function `rxrpc_init_peer': >> peer_object.c:(.text+0x18c3c8): undefined reference to `ip6_route_output_flags' Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net-next: dsa: add Qualcomm tag RX/TX handlerJohn Crispin2016-09-165-0/+147
| | | | | | | | | | | | Add support for the 2-bytes Qualcomm tag that gigabit switches such as the QCA8337/N might insert when receiving packets, or that we need to insert while targeting specific switch ports. The tag is inserted directly behind the ethernet header. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: John Crispin <john@phrozen.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: fix a stale ooo_last_skb after a replaceEric Dumazet2016-09-161-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When skb replaces another one in ooo queue, I forgot to also update tp->ooo_last_skb as well, if the replaced skb was the last one in the queue. To fix this, we simply can re-use the code that runs after an insertion, trying to merge skbs at the right of current skb. This not only fixes the bug, but also remove all small skbs that might be a subset of the new one. Example: We receive segments 2001:3001, 4001:5001 Then we receive 2001:8001 : We should replace 2001:3001 with the big skb, but also remove 4001:50001 from the queue to save space. packetdrill test demonstrating the bug 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7> +0.100 < . 1:1(0) ack 1 win 1024 +0 accept(3, ..., ...) = 4 +0.01 < . 1001:2001(1000) ack 1 win 1024 +0 > . 1:1(0) ack 1 <nop,nop, sack 1001:2001> +0.01 < . 1001:3001(2000) ack 1 win 1024 +0 > . 1:1(0) ack 1 <nop,nop, sack 1001:2001 1001:3001> Fixes: 9f5afeae5152 ("tcp: use an RB tree for ooo receive queue") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Yuchung Cheng <ycheng@google.com> Cc: Yaogong Wang <wygivan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge tag 'rxrpc-rewrite-20160913-2' of ↵David S. Miller2016-09-168-119/+179
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Support IPv6 Here is a set of patches that add IPv6 support. They need to be applied on top of the just-posted miscellaneous fix patches. They are: (1) Make autobinding of an unconnected socket work when sendmsg() is called to initiate a client call. (2) Don't specify the protocol when creating the client socket, but rather take the default instead. (3) Use rxrpc_extract_addr_from_skb() in a couple of places that were doing the same thing manually. This allows the IPv6 address extraction to be done in fewer places. (4) Add IPv6 support. With this, calls can be made to IPv6 servers from userspace AF_RXRPC programs; AFS, however, can't use IPv6 yet as the RPC calls need to be upgradeable. ==================== Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * rxrpc: Add IPv6 supportDavid Howells2016-09-147-83/+154
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add IPv6 support to AF_RXRPC. With this, AF_RXRPC sockets can be created: service = socket(AF_RXRPC, SOCK_DGRAM, PF_INET6); instead of: service = socket(AF_RXRPC, SOCK_DGRAM, PF_INET); The AFS filesystem doesn't support IPv6 at the moment, though, since that requires upgrades to some of the RPC calls. Note that a good portion of this patch is replacing "%pI4:%u" in print statements with "%pISpc" which is able to handle both protocols and print the port. Signed-off-by: David Howells <dhowells@redhat.com>
| * rxrpc: Use rxrpc_extract_addr_from_skb() rather than doing this manuallyDavid Howells2016-09-142-34/+11Star
| | | | | | | | | | | | | | | | | | There are two places that want to transmit a packet in response to one just received and manually pick the address to reply to out of the sk_buff. Make them use rxrpc_extract_addr_from_skb() instead so that IPv6 is handled automatically. Signed-off-by: David Howells <dhowells@redhat.com>
| * rxrpc: Don't specify protocol to when creating transport socketDavid Howells2016-09-141-2/+2
| | | | | | | | | | | | | | Pass 0 as the protocol argument when creating the transport socket rather than IPPROTO_UDP. Signed-off-by: David Howells <dhowells@redhat.com>
| * rxrpc: Create an address for sendmsg() to bind unbound socket withDavid Howells2016-09-141-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | Create an address for sendmsg() to bind unbound socket with rather than using a completely blank address otherwise the transport socket creation will fail because it will try to use address family 0. We use the address family specified in the protocol argument when the AF_RXRPC socket was created and SOCK_DGRAM as the default. For anything else, bind() must be used. Signed-off-by: David Howells <dhowells@redhat.com>
* | Merge tag 'rxrpc-rewrite-20160913-1' of ↵David S. Miller2016-09-1611-31/+64
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Miscellaneous fixes Here's a set of miscellaneous fix patches. There are a couple of points of note: (1) There is one non-fix patch that adjusts the call ref tracking tracepoint to make kernel API-held refs on calls more obvious. This is a prerequisite for the patch that fixes prealloc refcounting. (2) The final patch alters how jumbo packets that partially exceed the receive window are handled. Previously, space was being left in the Rx buffer for them, but this significantly hurts performance as the Rx window can't be increased to match the OpenAFS Tx window size. Instead, the excess subpackets are discarded and an EXCEEDS_WINDOW ACK is generated for the first. To avoid the problem of someone trying to run the kernel out of space by feeding the kernel a series of overlapping maximal jumbo packets, we stop allowing jumbo packets on a call if we encounter more than three jumbo packets with duplicate or excessive subpackets. ==================== Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * rxrpc: Correctly initialise, limit and transmit call->rx_winsizeDavid Howells2016-09-136-13/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | call->rx_winsize should be initialised to the sysctl setting and the sysctl setting should be limited to the maximum we want to permit. Further, we need to place this in the ACK info instead of the sysctl setting. Furthermore, discard the idea of accepting the subpackets of a jumbo packet that lie beyond the receive window when the first packet of the jumbo is within the window. Just discard the excess subpackets instead. This allows the receive window to be opened up right to the buffer size less one for the dead slot. Signed-off-by: David Howells <dhowells@redhat.com>
| * rxrpc: Fix prealloc refcountingDavid Howells2016-09-133-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The preallocated call buffer holds a ref on the calls within that buffer. The ref was being released in the wrong place - it worked okay for incoming calls to the AFS cache manager service, but doesn't work right for incoming calls to a userspace service. Instead of releasing an extra ref service calls in rxrpc_release_call(), the ref needs to be released during the acceptance/rejectance process. To this end: (1) The prealloc ref is now normally released during rxrpc_new_incoming_call(). (2) For preallocated kernel API calls, the kernel API's ref needs to be released when the call is discarded on socket close. (3) We shouldn't take a second ref in rxrpc_accept_call(). (4) rxrpc_recvmsg_new_call() needs to get a ref of its own when it adds the call to the to_be_accepted socket queue. In doing (4) above, we would prefer not to put the call's refcount down to 0 as that entails doing cleanup in softirq context, but it's unlikely as there are several refs held elsewhere, at least one of which must be put by someone in process context calling rxrpc_release_call(). However, it's not a problem if we do have to do that. Signed-off-by: David Howells <dhowells@redhat.com>
| * rxrpc: Adjust the call ref tracepoint to show kernel API refsDavid Howells2016-09-134-2/+7
| | | | | | | | | | | | | | | | | | | | | | Adjust the call ref tracepoint to show references held on a call by the kernel API separately as much as possible and add an additional trace to at the allocation point from the preallocation buffer for an incoming call. Note that this doesn't show the allocation of a client call for the kernel separately at the moment. Signed-off-by: David Howells <dhowells@redhat.com>
| * rxrpc: Allow tx_winsize to grow in response to an ACKDavid Howells2016-09-131-3/+5
| | | | | | | | | | | | | | Allow tx_winsize to grow when the ACK info packet shows a larger receive window at the other end rather than only permitting it to shrink. Signed-off-by: David Howells <dhowells@redhat.com>
| * rxrpc: Use skb->len not skb->data_lenDavid Howells2016-09-131-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | skb->len should be used rather than skb->data_len when referring to the amount of data in a packet. This will only cause a malfunction in the following cases: (1) We receive a jumbo packet (validation and splitting both are wrong). (2) We see if there's extra ACK info in an ACK packet (we think it's not there and just ignore it). Signed-off-by: David Howells <dhowells@redhat.com>
| * rxrpc: Add missing unlock in rxrpc_call_accept()David Howells2016-09-131-3/+5
| | | | | | | | | | | | | | Add a missing unlock in rxrpc_call_accept() in the path taken if there's no call to wake up. Signed-off-by: David Howells <dhowells@redhat.com>
| * rxrpc: Requeue call for recvmsg if more dataDavid Howells2016-09-131-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | rxrpc_recvmsg() needs to make sure that the call it has just been processing gets requeued for further attention if the buffer has been filled and there's more data to be consumed. The softirq producer only queues the call and wakes the socket if it fills the first slot in the window, so userspace might end up sleeping forever otherwise, despite there being data available. This is not a problem provided the userspace buffer is big enough or it empties the buffer completely before more data comes in. Signed-off-by: David Howells <dhowells@redhat.com>
| * rxrpc: The IDLE ACK packet should use rxrpc_idle_ack_delayDavid Howells2016-09-131-1/+1
| | | | | | | | | | | | | | The IDLE ACK packet should use the rxrpc_idle_ack_delay setting when the timer is set for it. Signed-off-by: David Howells <dhowells@redhat.com>
| * rxrpc: Add missing wakeup on Tx window rotationDavid Howells2016-09-131-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | We need to wake up the sender when Tx window rotation due to an incoming ACK makes space in the buffer otherwise the sender is liable to just hang endlessly. This problem isn't noticeable if the Tx phase transfers no more than will fit in a single window or the Tx window rotates fast enough that it doesn't get full. Signed-off-by: David Howells <dhowells@redhat.com>
| * rxrpc: Make sure we initialise the peer hash keyDavid Howells2016-09-131-1/+1
| | | | | | | | | | | | | | | | | | Peer records created for incoming connections weren't getting their hash key set. This meant that incoming calls wouldn't see more than one DATA packet - which is not a problem for AFS CM calls with small request data blobs. Signed-off-by: David Howells <dhowells@redhat.com>
* | openvswitch: avoid deferred execution of recirc actionsLance Richardson2016-09-161-2/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ovs kernel data path currently defers the execution of all recirc actions until stack utilization is at a minimum. This is too limiting for some packet forwarding scenarios due to the small size of the deferred action FIFO (10 entries). For example, broadcast traffic sent out more than 10 ports with recirculation results in packet drops when the deferred action FIFO becomes full, as reported here: http://openvswitch.org/pipermail/dev/2016-March/067672.html Since the current recursion depth is available (it is already tracked by the exec_actions_level pcpu variable), we can use it to determine whether to execute recirculation actions immediately (safe when recursion depth is low) or defer execution until more stack space is available. With this change, the deferred action fifo size becomes a non-issue for currently failing scenarios because it is no longer used when there are three or fewer recursions through ovs_execute_actions(). Suggested-by: Pravin Shelar <pshelar@ovn.org> Signed-off-by: Lance Richardson <lrichard@redhat.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/sched: cls_flower: Remove an unused field from the filter key structureOr Gerlitz2016-09-161-1/+0Star
| | | | | | | | | | | | | | | | | | | | | | Commit c3f8324188fa "net: Add full IPv6 addresses to flow_keys" added an unused instance of struct flow_dissector_key_addrs into struct fl_flow_key, remove it. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reported-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/sched: cls_flower: Support masking for matching on tcp/udp portsOr Gerlitz2016-09-161-8/+12
| | | | | | | | | | | | | | | | | | | | Add the definitions for src/dst udp/tcp port masks and use them when setting && dumping the relevant keys. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Paul Blakey <paulb@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net_sched: Introduce skbmod actionJamal Hadi Salim2016-09-163-0/+313
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This action is intended to be an upgrade from a usability perspective from pedit (as well as operational debugability). Compare this: sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \ u32 match ip protocol 1 0xff flowid 1:2 \ action pedit munge offset -14 u8 set 0x02 \ munge offset -13 u8 set 0x15 \ munge offset -12 u8 set 0x15 \ munge offset -11 u8 set 0x15 \ munge offset -10 u16 set 0x1515 \ pipe to: sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \ u32 match ip protocol 1 0xff flowid 1:2 \ action skbmod dmac 02:15:15:15:15:15 Also try to do a MAC address swap with pedit or worse try to debug a policy with destination mac, source mac and etherype. Then make few rules out of those and you'll get my point. In the future common use cases on pedit can be migrated to this action (as an example different fields in ip v4/6, transports like tcp/udp/sctp etc). For this first cut, this allows modifying basic ethernet header. The most important ethernet use case at the moment is when redirecting or mirroring packets to a remote machine. The dst mac address needs a re-write so that it doesnt get dropped or confuse an interconnecting (learning) switch or dropped by a target machine (which looks at the dst mac). And at times when flipping back the packet a swap of the MAC addresses is needed. Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | bpf: use skb_at_tc_ingress helper in tcf_bpfDaniel Borkmann2016-09-161-1/+1
| | | | | | | | | | | | | | | | | | We have a small skb_at_tc_ingress() helper for testing for ingress, so make use of it. cls_bpf already uses it and so should act_bpf. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | bpf: drop unnecessary test in cls_bpf_classify and tcf_bpfDaniel Borkmann2016-09-162-6/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The skb_mac_header_was_set() test in cls_bpf's and act_bpf's fast-path is actually unnecessary and can be removed altogether. This was added by commit a166151cbe33 ("bpf: fix bpf helpers to use skb->mac_header relative offsets"), which was later on improved by 3431205e0397 ("bpf: make programs see skb->data == L2 for ingress and egress"). We're always guaranteed to have valid mac header at the time we invoke cls_bpf_classify() or tcf_bpf(). Reason is that since 6d1ccff62780 ("net: reset mac header in dev_start_xmit()") we do skb_reset_mac_header() in __dev_queue_xmit() before we could call into sch_handle_egress() or any subsequent enqueue. sch_handle_ingress() always sees a valid mac header as well (things like skb_reset_mac_len() would badly fail otherwise). Thus, drop the unnecessary test in classifier and action case. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/sched: act_tunnel_key: Remove rcu_read_lock protectionHadar Hen Zion2016-09-161-13/+4Star
|/ | | | | | | | | | | | | | | | | Remove rcu_read_lock protection from tunnel_key_dump and use rtnl_dereference, dump operation is protected by rtnl lock. Also, remove rcu_read_lock from tunnel_key_release and use rcu_dereference_protected. Both operations are running exclusively and a writer couldn't modify t->params while those functions are executed. Fixes: 54d94fd89d90 ('net/sched: Introduce act_tunnel_key') Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tipc: fix possible memory leak in tipc_udp_enable()Wei Yongjun2016-09-131-1/+2
| | | | | | | | | | 'ub' is malloced in tipc_udp_enable() and should be freed before leaving from the error handling cases, otherwise it will cause memory leak. Fixes: ba5aa84a2d22 ("tipc: split UDP nl address parsing") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: bridge: add helper to call /sbin/bridge-stpVivien Didelot2016-09-131-12/+31
| | | | | | | | | | | | | | | | | | | | | | | | | If /sbin/bridge-stp is available on the system, bridge tries to execute it instead of the kernel implementation when starting/stopping STP. If anything goes wrong with /sbin/bridge-stp, bridge silently falls back to kernel STP, making hard to debug userspace STP. This patch adds a br_stp_call_user helper to start/stop userspace STP and debug errors from the program: abnormal exit status is stored in the lower byte and normal exit status is stored in higher byte. Below is a simple example on a kernel with dynamic debug enabled: # ln -s /bin/false /sbin/bridge-stp # brctl stp br0 on br0: failed to start userspace STP (256) # dmesg br0: /sbin/bridge-stp exited with code 1 br0: failed to start userspace STP (256) br0: using kernel STP Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2016-09-1336-162/+235
|\ | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/mediatek/mtk_eth_soc.c drivers/net/ethernet/qlogic/qed/qed_dcbx.c drivers/net/phy/Kconfig All conflicts were cases of overlapping commits. Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2016-09-1235-160/+233
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: "Mostly small sets of driver fixes scattered all over the place. 1) Mediatek driver fixes from Sean Wang. Forward port not written correctly during TX map, missed handling of EPROBE_DEFER, and mistaken use of put_page() instead of skb_free_frag(). 2) Fix socket double-free in KCM code, from WANG Cong. 3) QED driver fixes from Sudarsana Reddy Kalluru, including a fix for using the dcbx buffers before initializing them. 4) Mellanox Switch driver fixes from Jiri Pirko, including a fix for double fib removals and an error handling fix in mlxsw_sp_module_init(). 5) Fix kernel panic when enabling LLDP in i40e driver, from Dave Ertman. 6) Fix padding of TSO packets in thunderx driver, from Sunil Goutham. 7) TCP's rcv_wup not initialized properly when using fastopen, from Neal Cardwell. 8) Don't use uninitialized flow keys in flow dissector, from Gao Feng. 9) Use after free in l2tp module unload, from Sabrina Dubroca. 10) Fix interrupt registry ordering issues in smsc911x driver, from Jeremy Linton. 11) Fix crashes in bonding having to do with enslaving and rx_handler, from Mahesh Bandewar. 12) AF_UNIX deadlock fixes from Linus. 13) In mlx5 driver, don't read skb->xmit_mode after it might have been freed from the TX reclaim path. From Tariq Toukan. 14) Fix a bug from 2015 in TCP Yeah where the congestion window does not increase, from Artem Germanov. 15) Don't pad frames on receive in NFP driver, from Jakub Kicinski. 16) Fix chunk fragmenting in SCTP wrt. GSO, from Marcelo Ricardo Leitner. 17) Fix deletion of VRF routes, from Mark Tomlinson. 18) Fix device refcount leak when DAD fails in ipv6, from Wei Yongjun" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (101 commits) net/mlx4_en: Fix panic on xmit while port is down net/mlx4_en: Fixes for DCBX net/mlx4_en: Fix the return value of mlx4_en_dcbnl_set_state() net/mlx4_en: Fix the return value of mlx4_en_dcbnl_set_all() net: ethernet: renesas: sh_eth: add POST registers for rz drivers: net: phy: mdio-xgene: Add hardware dependency dwc_eth_qos: do not register semi-initialized device sctp: identify chunks that need to be fragmented at IP level mlxsw: spectrum: Set port type before setting its address mlxsw: spectrum_router: Fix error path in mlxsw_sp_router_init nfp: don't pad frames on receive nfp: drop support for old firmware ABIs nfp: remove linux/version.h includes tcp: cwnd does not increase in TCP YeAH net/mlx5e: Fix parsing of vlan packets when updating lro header net/mlx5e: Fix global PFC counters replication net/mlx5e: Prevent casting overflow net/mlx5e: Move an_disable_cap bit to a new position net/mlx5e: Fix xmit_more counter race issue tcp: fastopen: avoid negative sk_forward_alloc ...
| | * sctp: identify chunks that need to be fragmented at IP levelMarcelo Ricardo Leitner2016-09-101-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, without GSO, it was easy to identify it: if the chunk didn't fit and there was no data chunk in the packet yet, we could fragment at IP level. So if there was an auth chunk and we were bundling a big data chunk, it would fragment regardless of the size of the auth chunk. This also works for the context of PMTU reductions. But with GSO, we cannot distinguish such PMTU events anymore, as the packet is allowed to exceed PMTU. So we need another check: to ensure that the chunk that we are adding, actually fits the current PMTU. If it doesn't, trigger a flush and let it be fragmented at IP level in the next round. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * tcp: cwnd does not increase in TCP YeAHArtem Germanov2016-09-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 76174004a0f19785a328f40388e87e982bbf69b9 (tcp: do not slow start when cwnd equals ssthresh ) introduced regression in TCP YeAH. Using 100ms delay 1% loss virtual ethernet link kernel 4.2 shows bandwidth ~500KB/s for single TCP connection and kernel 4.3 and above (including 4.8-rc4) shows bandwidth ~100KB/s. That is caused by stalled cwnd when cwnd equals ssthresh. This patch fixes it by proper increasing cwnd in this case. Signed-off-by: Artem Germanov <agermanov@anchorfree.com> Acked-by: Dmitry Adamushko <d.adamushko@anchorfree.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * tcp: fastopen: avoid negative sk_forward_allocEric Dumazet2016-09-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When DATA and/or FIN are carried in a SYN/ACK message or SYN message, we append an skb in socket receive queue, but we forget to call sk_forced_mem_schedule(). Effect is that the socket has a negative sk->sk_forward_alloc as long as the message is not read by the application. Josh Hunt fixed a similar issue in commit d22e15371811 ("tcp: fix tcp fin memory accounting") Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Josh Hunt <johunt@akamai.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * Merge branch 'master' of ↵David S. Miller2016-09-086-18/+18
| | |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== ipsec 2016-09-08 1) Fix a crash when xfrm_dump_sa returns an error. From Vegard Nossum. 2) Remove some incorrect WARN() on normal error handling. From Vegard Nossum. 3) Ignore socket policies when rebuilding hash tables, socket policies are not inserted into the hash tables. From Tobias Brunner. 4) Initialize and check tunnel pointers properly before we use it. From Alexey Kodanev. 5) Fix l3mdev oif setting on xfrm dst lookups. From David Ahern. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | | * xfrm: Only add l3mdev oif to dst lookupsDavid Ahern2016-08-222-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Subash reported that commit 42a7b32b73d6 ("xfrm: Add oif to dst lookups") broke a wifi use case that uses fib rules and xfrms. The intent of 42a7b32b73d6 was driven by VRFs with IPsec. As a compromise relax the use of oif in xfrm lookups to L3 master devices only (ie., oif is either an L3 master device or is enslaved to a master device). Fixes: 42a7b32b73d6 ("xfrm: Add oif to dst lookups") Reported-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| | | * net/xfrm_input: fix possible NULL deref of tunnel.ip6->parms.i_keyAlexey Kodanev2016-08-112-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Running LTP 'icmp-uni-basic.sh -6 -p ipcomp -m tunnel' test over openvswitch + veth can trigger kernel panic: BUG: unable to handle kernel NULL pointer dereference at 00000000000000e0 IP: [<ffffffff8169d1d2>] xfrm_input+0x82/0x750 ... [<ffffffff816d472e>] xfrm6_rcv_spi+0x1e/0x20 [<ffffffffa082c3c2>] xfrm6_tunnel_rcv+0x42/0x50 [xfrm6_tunnel] [<ffffffffa082727e>] tunnel6_rcv+0x3e/0x8c [tunnel6] [<ffffffff8169f365>] ip6_input_finish+0xd5/0x430 [<ffffffff8169fc53>] ip6_input+0x33/0x90 [<ffffffff8169f1d5>] ip6_rcv_finish+0xa5/0xb0 ... It seems that tunnel.ip6 can have garbage values and also dereferenced without a proper check, only tunnel.ip4 is being verified. Fix it by adding one more if block for AF_INET6 and initialize tunnel.ip6 with NULL inside xfrm6_rcv_spi() (which is similar to xfrm4_rcv_spi()). Fixes: 049f8e2 ("xfrm: Override skb->mark with tunnel->parm.i_key in xfrm_input") Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| | | * xfrm: Ignore socket policies when rebuilding hash tablesTobias Brunner2016-07-291-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Whenever thresholds are changed the hash tables are rebuilt. This is done by enumerating all policies and hashing and inserting them into the right table according to the thresholds and direction. Because socket policies are also contained in net->xfrm.policy_all but no hash tables are defined for their direction (dir + XFRM_POLICY_MAX) this causes a NULL or invalid pointer dereference after returning from policy_hash_bysel() if the rebuild is done while any socket policies are installed. Since the rebuild after changing thresholds is scheduled this crash could even occur if the userland sets thresholds seemingly before installing any socket policies. Fixes: 53c2e285f970 ("xfrm: Do not hash socket policies") Signed-off-by: Tobias Brunner <tobias@strongswan.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| | | * xfrm: get rid of another incorrect WARNVegard Nossum2016-07-271-3/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During fuzzing I regularly run into this WARN(). According to Herbert Xu, this "certainly shouldn't be a WARN, it probably shouldn't print anything either". Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| | | * xfrm: get rid of incorrect WARNVegard Nossum2016-07-271-3/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | AFAICT this message is just printed whenever input validation fails. This is a normal failure and we shouldn't be dumping the stack over it. Looks like it was originally a printk that was maybe incorrectly upgraded to a WARN: commit 62db5cfd70b1ef53aa21f144a806fe3b78c84fab Author: stephen hemminger <shemminger@vyatta.com> Date: Wed May 12 06:37:06 2010 +0000 xfrm: add severity to printk Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>