summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* batman-adv: add isolation_mark sysfs attributeAntonio Quartulli2014-01-084-0/+83
| | | | | | | | | | | | | | | | | | | This attribute can be used to set and read the value and the mask of the skb mark which will be used to classify the source non-mesh client as ISOLATED. In this way a client can be advertised as such and the mark can potentially be restored at the receiving node before delivering the skb. This can be helpful for creating network wide netfilter policies. This sysfs file expects a string of the shape "$mark/$mask". Where $mark has to be a 32-bit number in any base, while $mask must be a 32bit mask expressed in hex base. Only bits in $mark covered by the bitmask are really stored. Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
* batman-adv: send every DHCP packet as bat-unicastAntonio Quartulli2014-01-087-140/+153
| | | | | | | | | | | | | | | | | | | In different situations it is possible that the DHCP server or client uses broadcast Ethernet frames to send messages to each other. The GW component in batman-adv takes care of using bat-unicast packets to bring broadcast DHCP Discover/Requests to the "best" server. On the way back the DHCP server usually sends unicasts, but upon client request it may decide to use broadcasts as well. This patch improves the GW component so that it now snoops and sends as unicast all the DHCP packets, no matter if they were generated by a DHCP server or client. Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
* batman-adv: remove parenthesis from return statementsAntonio Quartulli2014-01-081-1/+1
| | | | | | | | Remove parenthesis around return expression as suggested by checkpatch. Signed-off-by: Antonio Quartulli <antonio@meshcoding.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
* batman-adv: rename gw_deselect() to gw_reselect()Antonio Quartulli2014-01-084-18/+29
| | | | | | | | | | | | | | The function batadv_gw_deselect() is actually not deselecting anything. It is just informing the GW code to perform a re-election procedure when possible. The current gateway is not being touched at all and therefore the name of this function is rather misleading. Rename it to batadv_gw_reselect() to batadv_gw_reselect() to make its behaviour easier to grasp. Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
* batman-adv: deselect current GW on client mode switch offAntonio Quartulli2014-01-082-0/+14
| | | | | | | | | | | | | | | | When switching from gw_mode client to either off or server the current selected gateway has to be deselected. In this way when client mode is enabled again a gateway re-election is forced and a GW_ADD event is consequently sent. The current behaviour instead is to keep the current gateway leading to no GW_ADD event when gw_mode client is selected for a second time Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
* batman-adv: remove FSF address from GPL disclaimerAntonio Quartulli2014-01-0841-123/+41Star
| | | | | | | | | | | | As suggested by checkpatch, remove all the references to the FSF address since the kernel already has one reference in its documentation. In this way it is easier to update it in case of future changes. Signed-off-by: Antonio Quartulli <antonio@meshcoding.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
* batman-adv: don't switch byte order too often if not neededAntonio Quartulli2014-01-081-3/+5
| | | | | | | | | | If possible, operations like ntohs/ntohl should not be performed too often. Use a variable to locally store the converted value and then use it. Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
* batman-adv: properly rename define in distributed arp table header fileAntonio Quartulli2014-01-081-3/+3
| | | | | Signed-off-by: Antonio Quartulli <antonio@meshcoding.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
* Merge branch 'master' of ↵David S. Miller2014-01-0810-156/+397
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next Jeff Kirsher says: ==================== Intel Wired LAN Driver Updates This series contains updates to i40e only. Anjali adds more functionality to debugfs to assist development and testing of admin queue commands. Greg makes sure broadcast promiscuous is disabled by default, otherwise VLAN tagged packets out of the assigned VLAN domain are received. Also provides a fix when the 8021q driver is loaded, so that VLAN 0 tagged packets are accepted so that upper layers can interpret the priority bits. Then provides a fix to let the VF to request the PF to set its already assigned MAC address without generating an error. Greg also adds helper functions to enable or disable internal switch loopback when VFs are created or destroyed via the sysfs interface. Shannon provides most of the changes, where he adds code to ensure that the hardware waits to make sure that the firmware is ready as well after reset. Also updates the code to use the new features in the firmware. Provides a fix while in MFP mode where resources are reduced, so use a smaller range of test registers than when in SFP mode. Moves the PF ID initialization code to earlier in the driver initialization function since a few operations need the information before the first PF reset is called. Shannon adds a check for MAC type before reading anything from the registers to ensure we dealing with the correct MAC type. Then reworks Shadow RAM read word/buffer functions as to not block whole NVM resources for SR read operations. Mitch lastly provides a fix to correctly setup ARQ descriptors in the cleanup code. Catherine bumps the driver version due to all the recent changes. v2: - Added blank lines after local variable declarations to patch 01 as suggested by David Miller - Used the suggested memset() line in patch 15 as suggested by David Miller ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * i40e: correctly setup ARQ descriptorsMitch Williams2014-01-081-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | When cleaning descriptors, we must set up ALL fields, not just the DMA address. The initial setup does this correctly, but not the cleanup code, so the firmware would process the ring exactly once and then fail. Change-ID: I2930b83c76194b3016a8ac0fa693f9a573995640 Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| * i40e: remove redundant AQ enableKamil Krawczyk2014-01-081-6/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | The admin queue length register is updated in config_a<sq|rq>_regs functions. We should not update it again, as we will trigger firmware to init the AQ again. In this case firmware will lose the information about the AQ Rx tail position and will see Rx queue as full (no free desc for FW to use). Signed-off-by: Kamil Krawczyk <kamil.krawczyk@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| * i40e: Enable/Disable PF switch LB on SR-IOV configure changesGreg Rose2014-01-081-2/+74
| | | | | | | | | | | | | | | | | | | | The PF VSI was never updated to enable or disable internal switch loopback when VFs were created or destroyed via the sysfs interface. Add some helper functions to take care of that. Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Tested-by: Sibai Li <sibai.li@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| * i40e: whitespace paren and comment tweaksShannon Nelson2014-01-083-6/+7
| | | | | | | | | | | | | | | | | | Addresses a few code format issues that have crept in over time. Change-Id: I1a62cbd16b29a218a933b0f7176abe748f9615e8 Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| * i40e: rework shadow ram read functionsShannon Nelson2014-01-081-51/+16Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Rework Shadow RAM read word/buffer functions to not use AQ Request Resource commands. Requesting resource is not needed for SR read operations which are done through SRCTL register. Access to SR through register is controlled through DONE bit within SRCTL. With this change we do not block whole NVM resource for SR read operations. Change-Id: I73e96cdea39a45ee7b5bdf038e527308de2d9efe Signed-off-by: Kamil Krawczyk <kamil.krawczyk@intel.com> Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| * i40e: check MAC type before any REG accessShannon Nelson2014-01-081-8/+8
| | | | | | | | | | | | | | | | | | | | | | We need to check if we are dealing with the correct MAC type before we try to read anything from the registers. Change-Id: I3989516999d06c3009e87d6a2eafc20af305c5c2 Signed-off-by: Kamil Krawczyk <kamil.krawczyk@intel.com> Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| * i40e: move PF ID init from PF reset to SC initShannon Nelson2014-01-081-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | Move PF ID initialization code from PF reset routine to earlier driver initialization function. There are a few operations which need the PF ID before the first PF reset is called. Change-Id: I7e971f7556b46a837149850ec05ce115c35db575 Signed-off-by: Kamil Krawczyk <kamil.krawczyk@intel.com> Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| * i40e: Reduce range of interrupt reg in reg testShannon Nelson2014-01-081-3/+3
| | | | | | | | | | | | | | | | | | | | | | Use a smaller range of test registers in MFP mode as there are fewer resources than when in SFP mode. Change-Id: I08424890c3f57b5dde5ee99e99724ce252e0875a Signed-off-by: Kamil Krawczyk <kamil.krawczyk@intel.com> Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| * i40e: update firmware api to 1.1Shannon Nelson2014-01-084-44/+99
| | | | | | | | | | | | | | | | | | | | The firmware's AdminQ interface has matured a little, so update the code to use the new fields and values. Change-Id: I8fcd7b443f268dcf9346bd6a9e940fe9c2958891 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| * i40e: Add code to wait for FW to complete in reset pathShannon Nelson2014-01-082-0/+42
| | | | | | | | | | | | | | | | | | | | | | The RSTAT comes back with DEVICE READY to indicate that the HW is ready, but we still have to wait for the FW to indicate that the Core and Global modules are back up again. This needs a read of the NVM_ULD register. Change-Id: I88276165f9cd446d2f166fb4b8cff00521af4bec Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com> Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| * i40e: Bump versionCatherine Sullivan2014-01-081-1/+1
| | | | | | | | | | | | | | | | Version update to 0.3.25-k Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| * i40e: Allow VF to set already assigned MAC addressGreg Rose2014-01-081-1/+4
| | | | | | | | | | | | | | | | | | | | | | The VF is allowed to request the PF to set its already assigned MAC address without generating an error. Change-Id: I8dfdf353396995dbbb26cafab4e42b451911da3d Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Acked-by: Shannon Nelson <shannon.nelson@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| * i40e: Stop accepting any VLAN tag on VLAN 0 filter setGreg Rose2014-01-081-11/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | When the 8021q driver is loaded it sets VLAN 0 filters on all devices. This does not mean that any VLAN tagged packet should be accepted. Instead accept only VLAN 0 tagged packets so that upper layers can interpret the priority bits. Change-Id: I17274a540b613749612ffe23a3aef2b8ee6ff6a4 Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Sibai Li <sibai.li@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| * i40e: Do not enable broadcast promiscuous by defaultGreg Rose2014-01-082-17/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Broadcast promiscuous should only be turned on when general promiscuous mode is turned on, otherwise VLAN tagged packets out of the assigned VLAN domain are received. Add a broadcast MAC filter in order to continue to receive broadcast traffic on VLANs, MAIN or VMDQ VSI. Change-Id: I99d8e382a082ee51201228f1226af3b46452ac55 Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Sibai Li <sibai.li@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| * i40e: Expose AQ debugfs hooksAnjali Singhai Jain2014-01-081-0/+114
|/ | | | | | | | | | | | | | | Add more functionality to debugfs to assist development and testing of AQ commands. adds: send aq_cmd send indirect aq_cmd Change-Id: I01710cddd33110a6c1e1aabf84cb6e93cda4c42a Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
* net: xfrm: xfrm_policy: silence compiler warningYing Xue2014-01-081-0/+2
| | | | | | | | | Fix below compiler warning: net/xfrm/xfrm_policy.c:1644:12: warning: ‘xfrm_dst_alloc_copy’ defined but not used [-Wunused-function] Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'tipc'David S. Miller2014-01-088-64/+73
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jon Maloy says: ==================== tipc: link setup and failover improvements This series consists of four unrelated commits with different purposes. - Commit #1 is purely cosmetic and pedagogic, hopefully making the failover/tunneling logics slightly easier to understand. - Commit #2 fixes a bug that has always been in the code, but was not discovered until very recently. - Commit #3 fixes a non-fatal race issue in the neighbour discovery code. - Commit #4 removes an unnecessary indirection step during link startup. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * tipc: make link start event synchronousJon Paul Maloy2014-01-083-13/+7Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a link is created we delay the start event by launching it to be executed later in a tasklet. As we hold all the necessary locks at the moment of creation, and there is no risk of deadlock or contention, this delay serves no purpose in the current code. We remove this obsolete indirection step, and the associated function link_start(). At the same time, we rename the function tipc_link_stop() to the more appropriate tipc_link_purge_queues(). Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tipc: introduce new spinlock to protect struct link_reqYing Xue2014-01-081-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, only 'bearer_lock' is used to protect struct link_req in the function disc_timeout(). This is unsafe, since the member fields 'num_nodes' and 'timer_intv' might be accessed by below three different threads simultaneously, none of them grabbing bearer_lock in the critical region: link_activate() tipc_bearer_add_dest() tipc_disc_add_dest() req->num_nodes++; tipc_link_reset() tipc_bearer_remove_dest() tipc_disc_remove_dest() req->num_nodes-- disc_update() read req->num_nodes write req->timer_intv disc_timeout() read req->num_nodes read/write req->timer_intv Without lock protection, the only symptom of a race is that discovery messages occasionally may not be sent out. This is not fatal, since such messages are best-effort anyway. On the other hand, since discovery messages are not time critical, adding a protecting lock brings no serious overhead either. So we add a new, dedicated spinlock in order to guarantee absolute data consistency in link_req objects. This also helps reduce the overall role of the bearer_lock, which we want to remove completely in a later commit series. Signed-off-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tipc: remove 'has_redundant_link' flag from STATE link protocol messagesJon Paul Maloy2014-01-082-11/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The flag 'has_redundant_link' is defined only in RESET and ACTIVATE protocol messages. Due to an ambiguity in the protocol specification it is currently also transferred in STATE messages. Its value is used to initialize a link state variable, 'permit_changeover', which is used to inhibit futile link failover attempts when it is known that the peer node has no working links at the moment, although the local node may still think it has one. The fact that 'has_redundant_link' incorrectly is read from STATE messages has the effect that 'permit_changeover' sometimes gets a wrong value, and permanently blocks any links from being re-established. Such failures can only occur in in dual-link systems, and are extremely rare. This bug seems to have always been present in the code. Furthermore, since commit b4b5610223f17790419b03eaa962b0e3ecf930d7 ("tipc: Ensure both nodes recognize loss of contact between them"), the 'permit_changeover' field serves no purpose any more. The task of enforcing 'lost contact' cycles at both peer endpoints is now taken by a new mechanism, using the flags WAIT_NODE_DOWN and WAIT_PEER_DOWN in struct tipc_node to abort unnecessary failover attempts. We therefore remove the 'has_redundant_link' flag from STATE messages, as well as the now redundant 'permit_changeover' variable. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tipc: rename functions related to link failover and improve commentsJon Paul Maloy2014-01-085-38/+56
|/ | | | | | | | | | | | | | | | The functionality related to link addition and failover is unnecessarily hard to understand and maintain. We try to improve this by renaming some of the functions, at the same time adding or improving the explanatory comments around them. Names such as "tipc_rcv()" etc. also align better with what is used in other networking components. The changes in this commit are purely cosmetic, no functional changes are made. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: skbuff: const-ify casts in skb_queue_* functionsDaniel Borkmann2014-01-081-3/+3
| | | | | | | | | | We should const-ify comparisons on skb_queue_* inline helper functions as their parameters are const as well, so lets not drop that. Suggested-by: Brad Spengler <spender@grsecurity.net> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: xfrm: xfrm_policy: fix inline not at beginning of declarationDaniel Borkmann2014-01-081-6/+6
| | | | | | | | | | | | | | Fix three warnings related to: net/xfrm/xfrm_policy.c:1644:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration] net/xfrm/xfrm_policy.c:1656:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration] net/xfrm/xfrm_policy.c:1668:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration] Just removing the inline keyword is sufficient as the compiler will decide on its own about inlining or not. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* mlx4_en: Select PTP_1588_CLOCKShawn Bohrer2014-01-071-0/+1
| | | | | | | | | | | | | | | Now that mlx4_en includes a PHC driver it must select PTP_1588_CLOCK. drivers/built-in.o: In function `mlx4_en_get_ts_info': >> en_ethtool.c:(.text+0x391a11): undefined reference to `ptp_clock_index' drivers/built-in.o: In function `mlx4_en_remove_timestamp': >> (.text+0x397913): undefined reference to `ptp_clock_unregister' drivers/built-in.o: In function `mlx4_en_init_timestamp': >> (.text+0x397b20): undefined reference to `ptp_clock_register' Fixes: ad7d4eaed995d ("mlx4_en: Add PTP hardware clock") Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net-gre-gro: Add GRE support to the GRO stackJerry Chu2014-01-076-7/+216
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch built on top of Commit 299603e8370a93dd5d8e8d800f0dff1ce2c53d36 ("net-gro: Prepare GRO stack for the upcoming tunneling support") to add the support of the standard GRE (RFC1701/RFC2784/RFC2890) to the GRO stack. It also serves as an example for supporting other encapsulation protocols in the GRO stack in the future. The patch supports version 0 and all the flags (key, csum, seq#) but will flush any pkt with the S (seq#) flag. This is because the S flag is not support by GSO, and a GRO pkt may end up in the forwarding path, thus requiring GSO support to break it up correctly. Currently the "packet_offload" structure only contains L3 (ETH_P_IP/ ETH_P_IPV6) GRO offload support so the encapped pkts are limited to IP pkts (i.e., w/o L2 hdr). But support for other protocol type can be easily added, so is the support for GRE variations like NVGRE. The patch also support csum offload. Specifically if the csum flag is on and the h/w is capable of checksumming the payload (CHECKSUM_COMPLETE), the code will take advantage of the csum computed by the h/w when validating the GRE csum. Note that commit 60769a5dcd8755715c7143b4571d5c44f01796f1 "ipv4: gre: add GRO capability" already introduces GRO capability to IPv4 GRE tunnels, using the gro_cells infrastructure. But GRO is done after GRE hdr has been removed (i.e., decapped). The following patch applies GRO when pkts first come in (before hitting the GRE tunnel code). There is some performance advantage for applying GRO as early as possible. Also this approach is transparent to other subsystem like Open vSwitch where GRE decap is handled outside of the IP stack hence making it harder for the gro_cells stuff to apply. On the other hand, some NICs are still not capable of hashing on the inner hdr of a GRE pkt (RSS). In that case the GRO processing of pkts from the same remote host will all happen on the same CPU and the performance may be suboptimal. I'm including some rough preliminary performance numbers below. Note that the performance will be highly dependent on traffic load, mix as usual. Moreover it also depends on NIC offload features hence the following is by no means a comprehesive study. Local testing and tuning will be needed to decide the best setting. All tests spawned 50 copies of netperf TCP_STREAM and ran for 30 secs. (super_netperf 50 -H 192.168.1.18 -l 30) An IP GRE tunnel with only the key flag on (e.g., ip tunnel add gre1 mode gre local 10.246.17.18 remote 10.246.17.17 ttl 255 key 123) is configured. The GRO support for pkts AFTER decap are controlled through the device feature of the GRE device (e.g., ethtool -K gre1 gro on/off). 1.1 ethtool -K gre1 gro off; ethtool -K eth0 gro off thruput: 9.16Gbps CPU utilization: 19% 1.2 ethtool -K gre1 gro on; ethtool -K eth0 gro off thruput: 5.9Gbps CPU utilization: 15% 1.3 ethtool -K gre1 gro off; ethtool -K eth0 gro on thruput: 9.26Gbps CPU utilization: 12-13% 1.4 ethtool -K gre1 gro on; ethtool -K eth0 gro on thruput: 9.26Gbps CPU utilization: 10% The following tests were performed on a different NIC that is capable of csum offload. I.e., the h/w is capable of computing IP payload csum (CHECKSUM_COMPLETE). 2.1 ethtool -K gre1 gro on (hence will use gro_cells) 2.1.1 ethtool -K eth0 gro off; csum offload disabled thruput: 8.53Gbps CPU utilization: 9% 2.1.2 ethtool -K eth0 gro off; csum offload enabled thruput: 8.97Gbps CPU utilization: 7-8% 2.1.3 ethtool -K eth0 gro on; csum offload disabled thruput: 8.83Gbps CPU utilization: 5-6% 2.1.4 ethtool -K eth0 gro on; csum offload enabled thruput: 8.98Gbps CPU utilization: 5% 2.2 ethtool -K gre1 gro off 2.2.1 ethtool -K eth0 gro off; csum offload disabled thruput: 5.93Gbps CPU utilization: 9% 2.2.2 ethtool -K eth0 gro off; csum offload enabled thruput: 5.62Gbps CPU utilization: 8% 2.2.3 ethtool -K eth0 gro on; csum offload disabled thruput: 7.69Gbps CPU utilization: 8% 2.2.4 ethtool -K eth0 gro on; csum offload enabled thruput: 8.96Gbps CPU utilization: 5-6% Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Do not enable tx-nocache-copy by defaultBenjamin Poirier2014-01-071-5/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are many cases where this feature does not improve performance or even reduces it. For example, here are the results from tests that I've run using 3.12.6 on one Intel Xeon W3565 and one i7 920 connected by ixgbe adapters. The results are from the Xeon, but they're similar on the i7. All numbers report the mean±stddev over 10 runs of 10s. 1) latency tests similar to what is described in "c6e1a0d net: Allow no-cache copy from user on transmit" There is no statistically significant difference between tx-nocache-copy on/off. nic irqs spread out (one queue per cpu) 200x netperf -r 1400,1 tx-nocache-copy off 692000±1000 tps 50/90/95/99% latency (us): 275±2/643.8±0.4/799±1/2474.4±0.3 tx-nocache-copy on 693000±1000 tps 50/90/95/99% latency (us): 274±1/644.1±0.7/800±2/2474.5±0.7 200x netperf -r 14000,14000 tx-nocache-copy off 86450±80 tps 50/90/95/99% latency (us): 334.37±0.02/838±1/2100±20/3990±40 tx-nocache-copy on 86110±60 tps 50/90/95/99% latency (us): 334.28±0.01/837±2/2110±20/3990±20 2) single stream throughput tests tx-nocache-copy leads to higher service demand throughput cpu0 cpu1 demand (Gb/s) (Gcycle) (Gcycle) (cycle/B) nic irqs and netperf on cpu0 (1x netperf -T0,0 -t omni -- -d send) tx-nocache-copy off 9402±5 9.4±0.2 0.80±0.01 tx-nocache-copy on 9403±3 9.85±0.04 0.838±0.004 nic irqs on cpu0, netperf on cpu1 (1x netperf -T1,1 -t omni -- -d send) tx-nocache-copy off 9401±5 5.83±0.03 5.0±0.1 0.923±0.007 tx-nocache-copy on 9404±2 5.74±0.03 5.523±0.009 0.958±0.002 As a second example, here are some results from Eric Dumazet with latest net-next. tx-nocache-copy also leads to higher service demand (cpu is Intel(R) Xeon(R) CPU X5660 @ 2.80GHz) lpq83:~# ./ethtool -K eth0 tx-nocache-copy on lpq83:~# perf stat ./netperf -H lpq84 -c MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB 87380 16384 16384 10.00 9407.44 2.50 -1.00 0.522 -1.000 Performance counter stats for './netperf -H lpq84 -c': 4282.648396 task-clock # 0.423 CPUs utilized 9,348 context-switches # 0.002 M/sec 88 CPU-migrations # 0.021 K/sec 355 page-faults # 0.083 K/sec 11,812,797,651 cycles # 2.758 GHz [82.79%] 9,020,522,817 stalled-cycles-frontend # 76.36% frontend cycles idle [82.54%] 4,579,889,681 stalled-cycles-backend # 38.77% backend cycles idle [67.33%] 6,053,172,792 instructions # 0.51 insns per cycle # 1.49 stalled cycles per insn [83.64%] 597,275,583 branches # 139.464 M/sec [83.70%] 8,960,541 branch-misses # 1.50% of all branches [83.65%] 10.128990264 seconds time elapsed lpq83:~# ./ethtool -K eth0 tx-nocache-copy off lpq83:~# perf stat ./netperf -H lpq84 -c MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB 87380 16384 16384 10.00 9412.45 2.15 -1.00 0.449 -1.000 Performance counter stats for './netperf -H lpq84 -c': 2847.375441 task-clock # 0.281 CPUs utilized 11,632 context-switches # 0.004 M/sec 49 CPU-migrations # 0.017 K/sec 354 page-faults # 0.124 K/sec 7,646,889,749 cycles # 2.686 GHz [83.34%] 6,115,050,032 stalled-cycles-frontend # 79.97% frontend cycles idle [83.31%] 1,726,460,071 stalled-cycles-backend # 22.58% backend cycles idle [66.55%] 2,079,702,453 instructions # 0.27 insns per cycle # 2.94 stalled cycles per insn [83.22%] 363,773,213 branches # 127.757 M/sec [83.29%] 4,242,732 branch-misses # 1.17% of all branches [83.51%] 10.128449949 seconds time elapsed CC: Tom Herbert <therbert@google.com> Signed-off-by: Benjamin Poirier <bpoirier@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: loopback device: ignore value changes after device is uppedJiri Pirko2014-01-071-0/+2
| | | | | | | | | | | | | When lo is brought up, new ifa is created. Then, devconf and neigh values bitfield should be set so later changes of default values would not affect lo values. Note that the same behaviour is in ipv6. Also note that this is likely not an issue in many distros (for example Fedora 19) because userspace sets address to lo manually before bringing it up. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
* IPv6: add the option to use anycast addresses as source addresses in echo replyFX Le Bail2014-01-075-1/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | This change allows to follow a recommandation of RFC4942. - Add "anycast_src_echo_reply" sysctl to control the use of anycast addresses as source addresses for ICMPv6 echo reply. This sysctl is false by default to preserve existing behavior. - Add inline check ipv6_anycast_destination(). - Use them in icmpv6_echo_reply(). Reference: RFC4942 - IPv6 Transition/Coexistence Security Considerations (http://tools.ietf.org/html/rfc4942#section-2.1.6) 2.1.6. Anycast Traffic Identification and Security [...] To avoid exposing knowledge about the internal structure of the network, it is recommended that anycast servers now take advantage of the ability to return responses with the anycast address as the source address if possible. Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/mlx4_en: fix error return code in mlx4_en_get_qp()Wei Yongjun2014-01-071-2/+3
| | | | | | | | | Fix to return a negative error code from the error handling case instead of 0. Fixes: 837052d0ccc5 ('net/mlx4_en: Add netdev support for TCP/IP offloads of vxlan tunneling') Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
* r8152: correct some messagesHayes Wang2014-01-071-10/+13
| | | | | | | | | | - Replace pr_warn_ratelimited() with net_ratelimit() and netdev_warn(). - Adjust the algnment of some messages. - Remove the peroid. - Fix some messages don't have terminating newline. Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* bna: Fix build due to missing use of dma_unmap_len_set()David S. Miller2014-01-071-2/+2
| | | | | | | | | | | > as reported for linux-next of Dec.20, 2013 > when CONFIG_NEED_DMA_MAP_STATE is not enabled: > > drivers/net/ethernet/brocade/bna/bnad.c: In function 'bnad_start_xmit': > drivers/net/ethernet/brocade/bna/bnad.c:3074:26: error: 'struct bnad_tx_vector' has no member named 'dma_len' Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* gre_offload: statically build GRE offloading supportEric Dumazet2014-01-074-16/+7Star
| | | | | | | | | | | | GRO/GSO layers can be enabled on a node, even if said node is only forwarding packets. This patch permits GSO (and upcoming GRO) support for GRE encapsulated packets, even if the host has no GRE tunnel setup. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: H.K. Jerry Chu <hkchu@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* bgmac: fix typosHauke Mehrtens2014-01-071-2/+2
| | | | | | | | This fixes some typos found by Sergei. Reported-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of ↵David S. Miller2014-01-0717-213/+483
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch Jesse Gross says: ==================== [GIT net-next] Open vSwitch Open vSwitch changes for net-next/3.14. Highlights are: * Performance improvements in the mechanism to get packets to userspace using memory mapped netlink and skb zero copy where appropriate. * Per-cpu flow stats in situations where flows are likely to be shared across CPUs. Standard flow stats are used in other situations to save memory and allocation time. * A handful of code cleanups and rationalization. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * ovs: make functions localStephen Hemminger2014-01-074-6/+7
| | | | | | | | | | | | | | | | Several functions and datastructures could be local Found with 'make namespacecheck' Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Jesse Gross <jesse@nicira.com>
| * openvswitch: Compute checksum in skb_gso_segment() if neededThomas Graf2014-01-071-1/+1
| | | | | | | | | | | | | | | | | | The copy & csum optimization is no longer present with zerocopy enabled. Compute the checksum in skb_gso_segment() directly by dropping the HW CSUM capability from the features passed in. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Jesse Gross <jesse@nicira.com>
| * openvswitch: Use skb_zerocopy() for upcallThomas Graf2014-01-071-8/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use of skb_zerocopy() can avoid the expensive call to memcpy() when copying the packet data into the Netlink skb. Completes checksum through skb_checksum_help() if not already done in GSO segmentation. Zerocopy is only performed if user space supported unaligned Netlink messages. memory mapped netlink i/o is preferred over zerocopy if it is set up. Cost of upcall is significantly reduced from: + 7.48% vhost-8471 [k] memcpy + 5.57% ovs-vswitchd [k] memcpy + 2.81% vhost-8471 [k] csum_partial_copy_generic to: + 5.72% ovs-vswitchd [k] memcpy + 3.32% vhost-5153 [k] memcpy + 0.68% vhost-5153 [k] skb_zerocopy (megaflows disabled) Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Jesse Gross <jesse@nicira.com>
| * openvswitch: Pass datapath into userspace queue functionsThomas Graf2014-01-071-20/+14Star
| | | | | | | | | | | | | | | | Allows removing the net and dp_ifindex argument and simplify the code. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Jesse Gross <jesse@nicira.com>
| * openvswitch: Drop user features if old user space attempted to create datapathThomas Graf2014-01-072-1/+30
| | | | | | | | | | | | | | | | | | Drop user features if an outdated user space instance that does not understand the concept of user_features attempted to create a new datapath. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Jesse Gross <jesse@nicira.com>
| * openvswitch: Allow user space to announce ability to accept unaligned ↵Thomas Graf2014-01-073-0/+20
| | | | | | | | | | | | | | | | Netlink messages Signed-off-by: Thomas Graf <tgraf@suug.ch> Reviewed-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
| * net: Export skb_zerocopy() to zerocopy from one skb to anotherThomas Graf2014-01-073-55/+92
| | | | | | | | | | | | | | | | | | | | Make the skb zerocopy logic written for nfnetlink queue available for use by other modules. Signed-off-by: Thomas Graf <tgraf@suug.ch> Reviewed-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jesse Gross <jesse@nicira.com>