summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* VSOCK: Add Makefile and KconfigAsias He2016-08-024-0/+44
| | | | | | | | Enable virtio-vsock and vhost-vsock. Signed-off-by: Asias He <asias@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* VSOCK: Introduce vhost_vsock.koAsias He2016-08-023-0/+729
| | | | | | | | | VM sockets vhost transport implementation. This driver runs on the host. Signed-off-by: Asias He <asias@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* VSOCK: Introduce virtio_transport.koAsias He2016-08-022-0/+625
| | | | | | | | | VM sockets virtio transport implementation. This driver runs in the guest. Signed-off-by: Asias He <asias@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* VSOCK: Introduce virtio_vsock_common.koAsias He2016-08-028-0/+1398
| | | | | | | | | | This module contains the common code and header files for the following virtio_transporto and vhost_vsock kernel modules. Signed-off-by: Asias He <asias@redhat.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* VSOCK: defer sock removal to transportsStefan Hajnoczi2016-08-023-6/+13
| | | | | | | | | | | | The virtio transport will implement graceful shutdown and the related SO_LINGER socket option. This requires orphaning the sock but keeping it in the table of connections after .release(). This patch adds the vsock_remove_sock() function and leaves it up to the transport when to remove the sock. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* VSOCK: transport-specific vsock_transport functionsStefan Hajnoczi2016-08-022-0/+12
| | | | | | | | | | | | | | struct vsock_transport contains function pointers called by AF_VSOCK core code. The transport may want its own transport-specific function pointers and they can be added after struct vsock_transport. Allow the transport to fetch vsock_transport. It can downcast it to access transport-specific function pointers. The virtio transport will use this. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* vhost: drop vringh dependencyMichael S. Tsirkin2016-08-021-2/+0Star
| | | | | | | vringh isn't used by vhost net or scsi - it's used by CAIF only at the moment. Drop the dependency. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* vop: pull in vhost KconfigMichael S. Tsirkin2016-08-021-0/+4
| | | | | | | VOP selects VHOST_RING. Pull in Kconfig that includes it to make it self-containing. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* virtio: new feature to detect IOMMU device quirkMichael S. Tsirkin2016-08-013-2/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The interaction between virtio and IOMMUs is messy. On most systems with virtio, physical addresses match bus addresses, and it doesn't particularly matter which one we use to program the device. On some systems, including Xen and any system with a physical device that speaks virtio behind a physical IOMMU, we must program the IOMMU for virtio DMA to work at all. On other systems, including SPARC and PPC64, virtio-pci devices are enumerated as though they are behind an IOMMU, but the virtio host ignores the IOMMU, so we must either pretend that the IOMMU isn't there or somehow map everything as the identity. Add a feature bit to detect that quirk: VIRTIO_F_IOMMU_PLATFORM. Any device with this feature bit set to 0 needs a quirk and has to be passed physical addresses (as opposed to bus addresses) even though the device is behind an IOMMU. Note: it has to be a per-device quirk because for example, there could be a mix of passed-through and virtual virtio devices. As another example, some devices could be implemented by an out of process hypervisor backend (in case of qemu vhost, or vhost-user) and so support for an IOMMU needs to be coded up separately. It would be cleanest to handle this in IOMMU core code, but that needs per-device DMA ops. While we are waiting for that to be implemented, use a work-around in virtio core. Note: a "noiommu" feature is a quirk - add a wrapper to make that clear. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* balloon: check the number of available pages in leak balloonKonstantin Neumoin2016-08-011-0/+2
| | | | | | | | | | | | | | | | | | The balloon has a special mechanism that is subscribed to the oom notification which leads to deflation for a fixed number of pages. The number is always fixed even when the balloon is fully deflated. But leak_balloon did not expect that the pages to deflate will be more than taken, and raise a "BUG" in balloon_page_dequeue when page list will be empty. So, the simplest solution would be to check that the number of releases pages is less or equal to the number taken pages. Cc: stable@vger.kernel.org Signed-off-by: Konstantin Neumoin <kneumoin@virtuozzo.com> Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* vhost: lockless enqueuingJason Wang2016-08-012-30/+29Star
| | | | | | | | | | | | | | | We use spinlock to synchronize the work list now which may cause unnecessary contentions. So this patch switch to use llist to remove this contention. Pktgen tests shows about 5% improvement: Before: ~1300000 pps After: ~1370000 pps Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* vhost: simplify work flushingJason Wang2016-08-011-32/+21Star
| | | | | | | | | | | We used to implement the work flushing through tracking queued seq, done seq, and the number of flushing. This patch simplify this by just implement work flushing through another kind of vhost work with completion. This will be used by lockless enqueuing patch. Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* Linux 4.7Linus Torvalds2016-07-241-1/+1
|
* Merge tag 'ceph-for-4.7-rc8' of git://github.com/ceph/ceph-clientLinus Torvalds2016-07-241-43/+113
|\ | | | | | | | | | | | | | | | | | | | | | | Pull ceph fix from Ilya Dryomov: "A fix for a long-standing bug in the incremental osdmap handling code that caused misdirected requests, tagged for stable" The tag is signed with a brand new key - Sage is on vacation and I didn't anticipate this" * tag 'ceph-for-4.7-rc8' of git://github.com/ceph/ceph-client: libceph: apply new_state before new_up_client on incrementals
| * libceph: apply new_state before new_up_client on incrementalsIlya Dryomov2016-07-221-43/+113
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, osd_weight and osd_state fields are updated in the encoding order. This is wrong, because an incremental map may look like e.g. new_up_client: { osd=6, addr=... } # set osd_state and addr new_state: { osd=6, xorstate=EXISTS } # clear osd_state Suppose osd6's current osd_state is EXISTS (i.e. osd6 is down). After applying new_up_client, osd_state is changed to EXISTS | UP. Carrying on with the new_state update, we flip EXISTS and leave osd6 in a weird "!EXISTS but UP" state. A non-existent OSD is considered down by the mapping code 2087 for (i = 0; i < pg->pg_temp.len; i++) { 2088 if (ceph_osd_is_down(osdmap, pg->pg_temp.osds[i])) { 2089 if (ceph_can_shift_osds(pi)) 2090 continue; 2091 2092 temp->osds[temp->size++] = CRUSH_ITEM_NONE; and so requests get directed to the second OSD in the set instead of the first, resulting in OSD-side errors like: [WRN] : client.4239 192.168.122.21:0/2444980242 misdirected client.4239.1:2827 pg 2.5df899f2 to osd.4 not [1,4,6] in e680/680 and hung rbds on the client: [ 493.566367] rbd: rbd0: write 400000 at 11cc00000 (0) [ 493.566805] rbd: rbd0: result -6 xferred 400000 [ 493.567011] blk_update_request: I/O error, dev rbd0, sector 9330688 The fix is to decouple application from the decoding and: - apply new_weight first - apply new_state before new_up_client - twiddle osd_state flags if marking in - clear out some of the state if osd is destroyed Fixes: http://tracker.ceph.com/issues/14901 Cc: stable@vger.kernel.org # 3.15+: 6dd74e44dc1d: libceph: set 'exists' flag for newly up osd Cc: stable@vger.kernel.org # 3.15+ Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2016-07-2364-320/+850
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Fix memory leak in nftables, from Liping Zhang. 2) Need to check result of vlan_insert_tag() in batman-adv otherwise we risk NULL skb derefs, from Sven Eckelmann. 3) Check for dev_alloc_skb() failures in cfg80211, from Gregory Greenman. 4) Handle properly when we have ppp_unregister_channel() happening in parallel with ppp_connect_channel(), from WANG Cong. 5) Fix DCCP deadlock, from Eric Dumazet. 6) Bail out properly in UDP if sk_filter() truncates the packet to be smaller than even the space that the protocol headers need. From Michal Kubecek. 7) Similarly for rose, dccp, and sctp, from Willem de Bruijn. 8) Make TCP challenge ACKs less predictable, from Eric Dumazet. 9) Fix infinite loop in bgmac_dma_tx_add() from Florian Fainelli. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits) packet: propagate sock_cmsg_send() error net/mlx5e: Fix del vxlan port command buffer memset packet: fix second argument of sock_tx_timestamp() net: switchdev: change ageing_time type to clock_t Update maintainer for EHEA driver. net/mlx4_en: Add resilience in low memory systems net/mlx4_en: Move filters cleanup to a proper location sctp: load transport header after sk_filter net/sched/sch_htb: clamp xstats tokens to fit into 32-bit int net: cavium: liquidio: Avoid dma_unmap_single on uninitialized ndata net: nb8800: Fix SKB leak in nb8800_receive() et131x: Fix logical vs bitwise check in et131x_tx_timeout() vlan: use a valid default mtu value for vlan over macsec net: bgmac: Fix infinite loop in bgmac_dma_tx_add() mlxsw: spectrum: Prevent invalid ingress buffer mapping mlxsw: spectrum: Prevent overwrite of DCB capability fields mlxsw: spectrum: Don't emit errors when PFC is disabled mlxsw: spectrum: Indicate support for autonegotiation mlxsw: spectrum: Force link training according to admin state r8152: add MODULE_VERSION ...
| * | packet: propagate sock_cmsg_send() errorSoheil Hassas Yeganeh2016-07-221-3/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sock_cmsg_send() can return different error codes and not only -EINVAL, and we should properly propagate them. Fixes: c14ac9451c34 ("sock: enable timestamping using control messages") Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net/mlx5e: Fix del vxlan port command buffer memsetSaeed Mahameed2016-07-211-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | memset the command buffers rather than the pointers to them. Fixes: b3f63c3d5e2c ("net/mlx5e: Add netdev support for VXLAN tunneling") Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | packet: fix second argument of sock_tx_timestamp()Yoshihiro Shimoda2016-07-201-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes an issue that a syscall (e.g. sendto syscall) cannot work correctly. Since the sendto syscall doesn't have msg_control buffer, the sock_tx_timestamp() in packet_snd() cannot work correctly because the socks.tsflags is set to 0. So, this patch sets the socks.tsflags to sk->sk_tsflags as default. Fixes: c14ac9451c34 ("sock: enable timestamping using control messages") Reported-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com> Reported-by: Keita Kobayashi <keita.kobayashi.ym@renesas.com> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: switchdev: change ageing_time type to clock_tVivien Didelot2016-07-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The switchdev value for the SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME attribute is a clock_t and requires to use helpers such as clock_t_to_jiffies() to convert to milliseconds. Change ageing_time type from u32 to clock_t to make it explicit. Fixes: f55ac58ae64c ("switchdev: add bridge ageing_time attribute") Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Update maintainer for EHEA driver.Douglas Miller2016-07-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since Thadeu left IBM, EHEA has gone mostly unmaintained, since his email address doesn't work anymore. I'm stepping up to help maintain this driver upstream. I'm adding Thadeu's personal e-mail address in Cc, hoping that we can get his ack. CC: Thadeu Lima de Souza Cascardo <cascardo@cascardo.eti.br> Signed-off-by: Douglas Miller <dougmill@linux.vnet.ibm.com> Acked-by: Thadeu Lima de Souza Cascardo <cascardo@cascardo.eti.br> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Merge branch 'mlx4-fixes'David S. Miller2016-07-204-40/+136
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tariq Toukan says: ==================== Safe flow for mlx4_en configuration change This patchset improves the mlx4_en driver resiliency, especially on systems with low memory. Upon a configuration change that requires the allocation of new resources, we first try to allocate, prior to destroying the current ones. Once it is successfully done, we release the old resources and attach the new ones. Otherwise, we stay with a functioning interface having the same old configuration. This improvement became of greater significance after removing the use of vmap. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | net/mlx4_en: Add resilience in low memory systemsEugenia Emantayev2016-07-203-37/+132
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes the lost of Ethernet port on low memory system, when driver frees its resources and fails to allocate new resources. Issue could happen while changing number of channels, rings size or changing the timestamp configuration. This fix is necessary because of removing vmap use in the code. When vmap was in use driver could allocate non-contiguous memory and make it contiguous with vmap. Now it could fail to allocate a large chunk of contiguous memory and lose the port. Current code tries to allocate new resources and then upon success frees the old resources. Fixes: 73898db04301 ('net/mlx4: Avoid wrong virtual mappings') Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | net/mlx4_en: Move filters cleanup to a proper locationEugenia Emantayev2016-07-202-3/+4
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | Filters cleanup should be done once before destroying net device, since filters list is contained in the private data. Fixes: 1eb8c695bda9 ('net/mlx4_en: Add accelerated RFS support') Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | sctp: load transport header after sk_filterWillem de Bruijn2016-07-191-4/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Do not cache pointers into the skb linear segment across sk_filter. The function call can trigger pskb_expand_head. Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net/sched/sch_htb: clamp xstats tokens to fit into 32-bit intKonstantin Khlebnikov2016-07-191-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In kernel HTB keeps tokens in signed 64-bit in nanoseconds. In netlink protocol these values are converted into pshed ticks (64ns for now) and truncated to 32-bit. In struct tc_htb_xstats fields "tokens" and "ctokens" are declared as unsigned 32-bit but they could be negative thus tool 'tc' prints them as signed. Big values loose higher bits and/or become negative. This patch clamps tokens in xstat into range from INT_MIN to INT_MAX. In this way it's easier to understand what's going on here. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: cavium: liquidio: Avoid dma_unmap_single on uninitialized ndataFlorian Fainelli2016-07-171-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The label lio_xmit_failed is used 3 times through liquidio_xmit() but it always makes a call to dma_unmap_single() using potentially uninitialized variables from "ndata" variable. Out of the 3 gotos, 2 run after ndata has been initialized, and had a prior dma_map_single() call. Fix this by adding a new error label: lio_xmit_dma_failed which does this dma_unmap_single() and then processed with the lio_xmit_failed fallthrough. Fixes: f21fb3ed364bb ("Add support of Cavium Liquidio ethernet adapters") Reported-by: coverity (CID 1309740) Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: nb8800: Fix SKB leak in nb8800_receive()Florian Fainelli2016-07-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case nb8800_receive() fails to allocate a fragment, we would leak the SKB freshly allocated and just return, instead, free it. Reported-by: coverity (CID 1341750) Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Mans Rullgard <mans@mansr.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | et131x: Fix logical vs bitwise check in et131x_tx_timeout()Florian Fainelli2016-07-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We should be using a logical check here instead of a bitwise operation to check if the device is closed already in et131x_tx_timeout(). Reported-by: coverity (CID 146498) Fixes: 38df6492eb511 ("et131x: Add PCIe gigabit ethernet driver et131x to drivers/net") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | vlan: use a valid default mtu value for vlan over macsecPaolo Abeni2016-07-173-6/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | macsec can't cope with mtu frames which need vlan tag insertion, and vlan device set the default mtu equal to the underlying dev's one. By default vlan over macsec devices use invalid mtu, dropping all the large packets. This patch adds a netif helper to check if an upper vlan device needs mtu reduction. The helper is used during vlan devices initialization to set a valid default and during mtu updating to forbid invalid, too bit, mtu values. The helper currently only check if the lower dev is a macsec device, if we get more users, we need to update only the helper (possibly reserving an additional IFF bit). Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: bgmac: Fix infinite loop in bgmac_dma_tx_add()Florian Fainelli2016-07-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Nothing is decrementing the index "i" while we are cleaning up the fragments we could not successful transmit. Fixes: 9cde94506eacf ("bgmac: implement scatter/gather support") Reported-by: coverity (CID 1352048) Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Merge branch 'mlxsw-fixes'David S. Miller2016-07-154-29/+26Star
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jiri Pirko says: ==================== mlxsw: Couple of fixes Couple of fixes for mlxsw driver from Ido. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | mlxsw: spectrum: Prevent invalid ingress buffer mappingIdo Schimmel2016-07-153-3/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Packets entering the switch are mapped to a Switch Priority (SP) according to their PCP value (untagged frames are mapped to SP 0). The packets are classified to a priority group (PG) buffer in the port's headroom according to their SP. The switch maintains another mapping (SP to IEEE priority), which is used to generate PFC frames for lossless PGs. This mapping is initialized to IEEE = SP % 8. Therefore, when mapping SP 'x' to PG 'y' we create a situation in which an IEEE priority is mapped to two different PGs: IEEE 'x' ---> SP 'x' ---> PG 'y' IEEE 'x' ---> SP 'x + 8' ---> PG '0' (default) Which is invalid, as a flow can use only one PG buffer. Fix this by mapping both SP 'x' and 'x + 8' to the same PG buffer. Fixes: 8e8dfe9fdf06 ("mlxsw: spectrum: Add IEEE 802.1Qaz ETS support") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | mlxsw: spectrum: Prevent overwrite of DCB capability fieldsIdo Schimmel2016-07-151-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The number of supported traffic classes that can have ETS and PFC simultaneously enabled is not subject to user configuration, so make sure we always initialize them to the correct values following a set operation. Fixes: 8e8dfe9fdf06 ("mlxsw: spectrum: Add IEEE 802.1Qaz ETS support") Fixes: d81a6bdb87ce ("mlxsw: spectrum: Add IEEE 802.1Qbb PFC support") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | mlxsw: spectrum: Don't emit errors when PFC is disabledIdo Schimmel2016-07-151-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can't have PAUSE frames and PFC both enabled on the same port, but the fact that ieee_setpfc() was called doesn't necessarily mean PFC is enabled. Only emit errors when PAUSE frames and PFC are enabled simultaneously. Fixes: d81a6bdb87ce ("mlxsw: spectrum: Add IEEE 802.1Qbb PFC support") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | mlxsw: spectrum: Indicate support for autonegotiationIdo Schimmel2016-07-151-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The device supports link autonegotiation, so let the user know about it by indicating support via ethtool ops. Fixes: 56ade8fe3fe1 ("mlxsw: spectrum: Add initial support for Spectrum ASIC") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | mlxsw: spectrum: Force link training according to admin stateIdo Schimmel2016-07-151-24/+1Star
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When setting a new speed we need to disable and enable the port for the changes to take effect. We currently only do that if the operational state of the port is up. However, setting a new speed following link training failure will require us to explicitly set the port down and then up. Instead, disable and enable the port based on its administrative state. Fixes: 56ade8fe3fe1 ("mlxsw: spectrum: Add initial support for Spectrum ASIC") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Merge branch 'master' of ↵David S. Miller2016-07-154-44/+66
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue Jeff Kirsher says: ==================== Intel Wired LAN Driver Updates 2016-07-14 This series contains fixes to i40e and ixgbe. Alex fixes issues found in i40e_rx_checksum() which was broken, where the checksum was being returned valid when it was not. Kiran fixes a bug which was found when we abruptly remove a cable which caused a panic. Set the VSI broadcast promiscuous mode during VSI add sequence and prevents adding MAC filter if specified MAC address is broadcast. Paolo Abeni fixes a bug by returning the actual work done, capped to weight - 1, since the core doesn't allow to return the full budget when the driver modifies the NAPI status. Guilherme Piccoli fixes an issue where the q_vector initialization routine sets the affinity _mask of a q_vector based on v_idx value. This means a loop iterates on v_idx, which is an incremental value, and the cpumask is created based on this value. This is a problem in systems with multiple logical CPUs per core (like in SMT scenarios). Changed the way q_vector's affinity_mask is created to resolve the issue. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | i40e: use valid online CPU on q_vector initializationGuilherme G. Piccoli2016-07-151-5/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the q_vector initialization routine sets the affinity_mask of a q_vector based on v_idx value. Meaning a loop iterates on v_idx, which is an incremental value, and the cpumask is created based on this value. This is a problem in systems with multiple logical CPUs per core (like in SMT scenarios). If we disable some logical CPUs, by turning SMT off for example, we will end up with a sparse cpu_online_mask, i.e., only the first CPU in a core is online, and incremental filling in q_vector cpumask might lead to multiple offline CPUs being assigned to q_vectors. Example: if we have a system with 8 cores each one containing 8 logical CPUs (SMT == 8 in this case), we have 64 CPUs in total. But if SMT is disabled, only the 1st CPU in each core remains online, so the cpu_online_mask in this case would have only 8 bits set, in a sparse way. In general case, when SMT is off the cpu_online_mask has only C bits set: 0, 1*N, 2*N, ..., C*(N-1) where C == # of cores; N == # of logical CPUs per core. In our example, only bits 0, 8, 16, 24, 32, 40, 48, 56 would be set. This patch changes the way q_vector's affinity_mask is created: it iterates on v_idx, but consumes the CPU index from the cpu_online_mask instead of just using the v_idx incremental value. No functional changes were introduced. Signed-off-by: Guilherme G Piccoli <gpiccoli@linux.vnet.ibm.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| | * | ixgbe: napi_poll must return the work donePaolo Abeni2016-07-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the function ixgbe_poll() returns 0 when it clean completely the rx rings, but this foul budget accounting in core code. Fix this returning the actual work done, capped to weight - 1, since the core doesn't allow to return the full budget when the driver modifies the napi status Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Venkatesh Srinivas <venkateshs@google.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| | * | i40e: enable VSI broadcast promiscuous mode instead of adding broadcast filterKiran Patil2016-07-151-12/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch sets VSI broadcast promiscuous mode during VSI add sequence and prevents adding MAC filter if specified MAC address is broadcast. Change-ID: Ia62251fca095bc449d0497fc44bec3a5a0136773 Signed-off-by: Kiran Patil <kiran.patil@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| | * | i40e/i40evf: Fix i40e_rx_checksumAlexander Duyck2016-07-152-26/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are a couple of issues I found in i40e_rx_checksum while doing some recent testing. As a result I have found the Rx checksum logic is pretty much broken and returning that the checksum is valid for tunnels in cases where it is not. First the inner types are not the correct values to use to test for if a tunnel is present or not. In addition the inner protocol types are not a bitmask as such performing an OR of the values doesn't make sense. I have instead changed the code so that the inner protocol types are used to determine if we report CHECKSUM_UNNECESSARY or not. For anything that does not end in UDP, TCP, or SCTP it doesn't make much sense to report a checksum offload since it won't contain a checksum anyway. This leaves us with the need to set the csum_level based on some value. For that purpose I am using the tunnel_type field. If the tunnel type is GRENAT or greater then this means we have a GRE or UDP tunnel with an inner header. In the case of GRE or UDP we will have a possible checksum present so for this reason it should be safe to set the csum_level to 1 to indicate that we are reporting the state of the inner header. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| * | | r8152: add MODULE_VERSIONGrant Grundler2016-07-151-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ethtool -i provides a driver version that is hard coded. Export the same value via "modinfo". Signed-off-by: Grant Grundler <grundler@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | tcp: enable per-socket rate limiting of all 'challenge acks'Jason Baron2016-07-151-17/+22
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The per-socket rate limit for 'challenge acks' was introduced in the context of limiting ack loops: commit f2b2c582e824 ("tcp: mitigate ACK loops for connections as tcp_sock") And I think it can be extended to rate limit all 'challenge acks' on a per-socket basis. Since we have the global tcp_challenge_ack_limit, this patch allows for tcp_challenge_ack_limit to be set to a large value and effectively rely on the per-socket limit, or set tcp_challenge_ack_limit to a lower value and still prevents a single connections from consuming the entire challenge ack quota. It further moves in the direction of eliminating the global limit at some point, as Eric Dumazet has suggested. This a follow-up to: Subject: tcp: make challenge acks less predictable Cc: Eric Dumazet <edumazet@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Yue Cao <ycao009@ucr.edu> Signed-off-by: Jason Baron <jbaron@akamai.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | bonding: set carrier off for devices created through netlinkBeniamino Galvani2016-07-151-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit e826eafa65c6 ("bonding: Call netif_carrier_off after register_netdevice") moved netif_carrier_off() from bond_init() to bond_create(), but the latter is called only for initial default devices and ones created through sysfs: $ modprobe bonding $ echo +bond1 > /sys/class/net/bonding_masters $ ip link add bond2 type bond $ grep "MII Status" /proc/net/bonding/* /proc/net/bonding/bond0:MII Status: down /proc/net/bonding/bond1:MII Status: down /proc/net/bonding/bond2:MII Status: up Ensure that carrier is initially off also for devices created through netlink. Signed-off-by: Beniamino Galvani <bgalvani@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Merge branch 'sk_filter-trim-limit'David S. Miller2016-07-137-13/+25
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Willem de Bruijn says: ==================== limit sk_filter trim to payload Sockets can apply a filter to incoming packets to drop or trim them. Fix two codepaths that call skb_pull/__skb_pull after sk_filter without checking for packet length. Reading beyond skb->tail after trimming happens in more codepaths, but safety of reading in the linear segment is based on minimum allocation size (MAX_HEADER, GRO_MAX_HEAD, ..). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | dccp: limit sk_filter trim to payloadWillem de Bruijn2016-07-134-6/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dccp verifies packet integrity, including length, at initial rcv in dccp_invalid_packet, later pulls headers in dccp_enqueue_skb. A call to sk_filter in-between can cause __skb_pull to wrap skb->len. skb_copy_datagram_msg interprets this as a negative value, so (correctly) fails with EFAULT. The negative length is reported in ioctl SIOCINQ or possibly in a DCCP_WARN in dccp_close. Introduce an sk_receive_skb variant that caps how small a filter program can trim packets, and call this in dccp with the header length. Excessively trimmed packets are now processed normally and queued for reception as 0B payloads. Fixes: 7c657876b63c ("[DCCP]: Initial implementation") Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | rose: limit sk_filter trim to payloadWillem de Bruijn2016-07-133-7/+12
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sockets can have a filter program attached that drops or trims incoming packets based on the filter program return value. Rose requires data packets to have at least ROSE_MIN_LEN bytes. It verifies this on arrival in rose_route_frame and unconditionally pulls the bytes in rose_recvmsg. The filter can trim packets to below this value in-between, causing pull to fail, leaving the partial header at the time of skb_copy_datagram_msg. Place a lower bound on the size to which sk_filter may trim packets by introducing sk_filter_trim_cap and call this for rose packets. Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Merge branch 'mlx5-fixes'David S. Miller2016-07-131-1/+12
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Saeed Mahameed says: ==================== mlx5 tx timeout watchdog fixes This patch set provides two trivial fixes for the tx timeout series lately applied into net 4.7. From Daniel, detect stuck queues due to BQL From Mohamad, fix tx timeout watchdog false alarm Hopefully those two fixes will make it to -stable, assuming 3947ca185999 ('net/mlx5e: Implement ndo_tx_timeout callback') was also backported to -stable. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | net/mlx5e: start/stop all tx queues upon open/close netdevMohamad Haj Yahia2016-07-131-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Start all tx queues (including inactive ones) when opening the netdev. Stop all tx queues (including inactive ones) when closing the netdev. This is a workaround for the tx timeout watchdog false alarm issue in which the netdev watchdog is polling all the tx queues which may include inactive queues and thus once lowering the real tx queues number (ethtool -L) it will generate tx timeout watchdog false alarms. Fixes: 3947ca185999 ('net/mlx5e: Implement ndo_tx_timeout callback') Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>