summaryrefslogtreecommitdiffstats
path: root/net/core
Commit message (Collapse)AuthorAgeFilesLines
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-07-233-14/+57
|\ | | | | | | | | | | | | | | | | | | Conflicts: net/bridge/br_mdb.c br_mdb.c conflict was a function call being removed to fix a bug in 'net' but whose signature was changed in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: ratelimit warnings about dst entry refcount underflow or overflowKonstantin Khlebnikov2015-07-211-1/+3
| | | | | | | | | | | | | | | | | | | | | | Kernel generates a lot of warnings when dst entry reference counter overflows and becomes negative. That bug was seen several times at machines with outdated 3.10.y kernels. Most like it's already fixed in upstream. Anyway that flood completely kills machine and makes further debugging impossible. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: Fix skb csum races when peekingHerbert Xu2015-07-161-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we calculate the checksum on the recv path, we store the result in the skb as an optimisation in case we need the checksum again down the line. This is in fact bogus for the MSG_PEEK case as this is done without any locking. So multiple threads can peek and then store the result to the same skb, potentially resulting in bogus skb states. This patch fixes this by only storing the result if the skb is not shared. This preserves the optimisations for the few cases where it can be done safely due to locking or other reasons, e.g., SIOCINQ. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: Clone skb before setting peeked flagHerbert Xu2015-07-161-3/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Shared skbs must not be modified and this is crucial for broadcast and/or multicast paths where we use it as an optimisation to avoid unnecessary cloning. The function skb_recv_datagram breaks this rule by setting peeked without cloning the skb first. This causes funky races which leads to double-free. This patch fixes this by cloning the skb and replacing the skb in the list when setting skb->peeked. Fixes: a59322be07c9 ("[UDP]: Only increment counter on first peek/recv") Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
| * rtnetlink: reject non-IFLA_VF_PORT attributes inside IFLA_VF_PORTSDaniel Borkmann2015-07-161-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similarly as in commit 4f7d2cdfdde7 ("rtnetlink: verify IFLA_VF_INFO attributes before passing them to driver"), we have a double nesting of netlink attributes, i.e. IFLA_VF_PORTS only contains IFLA_VF_PORT that is nested itself. While IFLA_VF_PORTS is a verified attribute from ifla_policy[], we only check if the IFLA_VF_PORTS container has IFLA_VF_PORT attributes and then pass the attribute's content itself via nla_parse_nested(). It would be more correct to reject inner types other than IFLA_VF_PORT instead of continuing parsing and also similarly as in commit 4f7d2cdfdde7, to check for a minimum of NLA_HDRLEN. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Roopa Prabhu <roopa@cumulusnetworks.com> Cc: Scott Feldman <sfeldma@gmail.com> Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | openvswitch: Use regular VXLAN net_device deviceThomas Graf2015-07-211-9/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This gets rid of all OVS specific VXLAN code in the receive and transmit path by using a VXLAN net_device to represent the vport. Only a small shim layer remains which takes care of handling the VXLAN specific OVS Netlink configuration. Unexports vxlan_sock_add(), vxlan_sock_release(), vxlan_xmit_skb() since they are no longer needed. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | fib: Add fib rule match on tunnel idThomas Graf2015-07-211-2/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This add the ability to select a routing table based on the tunnel id which allows to maintain separate routing tables for each virtual tunnel network. ip rule add from all tunnel-id 100 lookup 100 ip rule add from all tunnel-id 200 lookup 200 A new static key controls the collection of metadata at tunnel level upon demand. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* | dst: Metadata destinationsThomas Graf2015-07-212-15/+71
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduces a new dst_metadata which enables to carry per packet metadata between forwarding and processing elements via the skb->dst pointer. The structure is set up to be a union. Thus, each separate type of metadata requires its own dst instance. If demand arises to carry multiple types of metadata concurrently, metadata dst entries can be made stackable. The metadata dst entry is refcnt'ed as expected for now but a non reference counted use is possible if the reference is forced before queueing the skb. In order to allow allocating dsts with variable length, the existing dst_alloc() is split into a dst_alloc() and dst_init() function. The existing dst_init() function to initialize the subsystem is being renamed to dst_subsys_init() to make it clear what is what. The check before ip_route_input() is changed to ignore metadata dsts and drop the dst inside the routing function thus allowing to interpret metadata in a later commit. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* | lwtunnel: support dst output redirect functionRoopa Prabhu2015-07-211-0/+56
| | | | | | | | | | | | | | | | | | | | | | | | This patch introduces lwtunnel_output function to call corresponding lwtunnels output function to xmit the packet. It adds two variants lwtunnel_output and lwtunnel_output6 for ipv4 and ipv6 respectively today. But this is subject to change when lwtstate will reside in dst or dst_metadata (as per upstream discussions). Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | lwtunnel: infrastructure for handling light weight tunnels like mplsRoopa Prabhu2015-07-212-0/+180
| | | | | | | | | | | | | | | | | | | | | | | | Provides infrastructure to parse/dump/store encap information for light weight tunnels like mpls. Encap information for such tunnels is associated with fib routes. This infrastructure is based on previous suggestions from Eric Biederman to follow the xfrm infrastructure. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | test_bpf: add bpf_skb_vlan_push/pop() testsAlexei Starovoitov2015-07-211-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | improve accuracy of timing in test_bpf and add two stress tests: - {skb->data[0], get_smp_processor_id} repeated 2k times - {skb->data[0], vlan_push} x 68 followed by {skb->data[0], vlan_pop} x 68 1st test is useful to test performance of JIT implementation of BPF_LD_ABS together with BPF_CALL instructions. 2nd test is stressing skb_vlan_push/pop logic together with skb->data access via BPF_LD_ABS insn which checks that re-caching of skb->data is done correctly. In order to call bpf_skb_vlan_push() from test_bpf.ko have to add three export_symbol_gpl. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | bpf: introduce bpf_skb_vlan_push/pop() helpersAlexei Starovoitov2015-07-211-0/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow eBPF programs attached to TC qdiscs call skb_vlan_push/pop via helper functions. These functions may change skb->data/hlen which are cached by some JITs to improve performance of ld_abs/ld_ind instructions. Therefore JITs need to recognize bpf_skb_vlan_push/pop() calls, re-compute header len and re-cache skb->data/hlen back into cpu registers. Note, skb->data/hlen are not directly accessible from the programs, so any changes to skb->data done either by these helpers or by other TC actions are safe. eBPF JIT supported by three architectures: - arm64 JIT is using bpf_load_pointer() without caching, so it's ok as-is. - x64 JIT re-caches skb->data/hlen unconditionally after vlan_push/pop calls (experiments showed that conditional re-caching is slower). - s390 JIT falls back to interpreter for now when bpf_skb_vlan_push() is present in the program (re-caching is tbd). These helpers allow more scalable handling of vlan from the programs. Instead of creating thousands of vlan netdevs on top of eth0 and attaching TC+ingress+bpf to all of them, the program can be attached to eth0 directly and manipulate vlans as necessary. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: don't reforward packets already forwarded by offload deviceScott Feldman2015-07-211-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just before queuing skb for xmit on port, check if skb has been marked by switchdev port driver as already fordwarded by device. If so, drop skb. A non-zero skb->offload_fwd_mark field is set by the switchdev port driver/device on ingress to indicate the skb has already been forwarded by the device to egress ports with matching dev->skb_mark. The switchdev port driver would assign a non-zero dev->offload_skb_mark for each device port netdev during registration, for example. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ebpf: add helper to retrieve net_cls's classid cookieDaniel Borkmann2015-07-201-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It would be very useful to retrieve the net_cls's classid from an eBPF program to allow for a more fine-grained classification, it could be directly used or in conjunction with additional policies. I.e. docker, but also tooling such as cgexec, can easily run applications via net_cls cgroups: cgcreate -g net_cls:/foo echo 42 > foo/net_cls.classid cgexec -g net_cls:foo <prog> Thus, their respecitve classid cookie of foo can then be looked up on the egress path to apply further policies. The helper is desigend such that a non-zero value returns the cgroup id. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Thomas Graf <tgraf@suug.ch> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | netlink: changes for setting and clearing protodown via netlink.Anuradha Karuppiah2015-07-161-2/+14
| | | | | | | | | | | | | | | | Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com> Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net core: Add protodown support.Anuradha Karuppiah2015-07-162-0/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch introduces the proto_down flag that can be used by user space applications to notify switch drivers that errors have been detected on the device. The switch driver can react to protodown notification by doing a phys down on the associated switch port. Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com> Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-07-144-127/+127
|\| | | | | | | | | | | | | | | | | | | | | Conflicts: net/bridge/br_mdb.c Minor conflict in br_mdb.c, in 'net' we added a memset of the on-stack 'ip' variable whereas in 'net-next' we assign a new member 'vid'. Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: call rcu_read_lock early in process_backlogJulian Anastasov2015-07-111-16/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Incoming packet should be either in backlog queue or in RCU read-side section. Otherwise, the final sequence of flush_backlog() and synchronize_net() may miss packets that can run without device reference: CPU 1 CPU 2 skb->dev: no reference process_backlog:__skb_dequeue process_backlog:local_irq_enable on_each_cpu for flush_backlog => IPI(hardirq): flush_backlog - packet not found in backlog CPU delayed ... synchronize_net - no ongoing RCU read-side sections netdev_run_todo, rcu_barrier: no ongoing callbacks __netif_receive_skb_core:rcu_read_lock - too late free dev process packet for freed dev Fixes: 6e583ce5242f ("net: eliminate refcounting in backlog queue") Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: do not process device backlog during unregistrationJulian Anastasov2015-07-111-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 381c759d9916 ("ipv4: Avoid crashing in ip_error") fixes a problem where processed packet comes from device with destroyed inetdev (dev->ip_ptr). This is not expected because inetdev_destroy is called in NETDEV_UNREGISTER phase and packets should not be processed after dev_close_many() and synchronize_net(). Above fix is still required because inetdev_destroy can be called for other reasons. But it shows the real problem: backlog can keep packets for long time and they do not hold reference to device. Such packets are then delivered to upper levels at the same time when device is unregistered. Calling flush_backlog after NETDEV_UNREGISTER_FINAL still accounts all packets from backlog but before that some packets continue to be delivered to upper levels long after the synchronize_net call which is supposed to wait the last ones. Also, as Eric pointed out, processed packets, mostly from other devices, can continue to add new packets to backlog. Fix the problem by moving flush_backlog early, after the device driver is stopped and before the synchronize_net() call. Then use netif_running check to make sure we do not add more packets to backlog. We have to do it in enqueue_to_backlog context when the local IRQ is disabled. As result, after the flush_backlog and synchronize_net sequence all packets should be accounted. Thanks to Eric W. Biederman for the test script and his valuable feedback! Reported-by: Vittorio Gambaletta <linuxbugs@vittgam.net> Fixes: 6e583ce5242f ("net: eliminate refcounting in backlog queue") Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: pktgen: kill the "Wait for kthread_stop" code in pktgen_thread_worker()Oleg Nesterov2015-07-101-9/+2Star
| | | | | | | | | | | | | | | | | | | | pktgen_thread_worker() doesn't need to wait for kthread_stop(), it can simply exit. Just pktgen_create_thread() and pg_net_exit() should do get_task_struct()/put_task_struct(). kthread_stop(dead_thread) is fine. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: pktgen: fix race between pktgen_thread_worker() and kthread_stop()Oleg Nesterov2015-07-101-1/+3
| | | | | | | | | | | | | | | | | | | | pktgen_thread_worker() is obviously racy, kthread_stop() can come between the kthread_should_stop() check and set_current_state(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Jan Stancek <jstancek@redhat.com> Reported-by: Marcelo Leitner <mleitner@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * rtnetlink: verify IFLA_VF_INFO attributes before passing them to driverDaniel Borkmann2015-07-091-91/+96
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jason Gunthorpe reported that since commit c02db8c6290b ("rtnetlink: make SR-IOV VF interface symmetric"), we don't verify IFLA_VF_INFO attributes anymore with respect to their policy, that is, ifla_vfinfo_policy[]. Before, they were part of ifla_policy[], but they have been nested since placed under IFLA_VFINFO_LIST, that contains the attribute IFLA_VF_INFO, which is another nested attribute for the actual VF attributes such as IFLA_VF_MAC, IFLA_VF_VLAN, etc. Despite the policy being split out from ifla_policy[] in this commit, it's never applied anywhere. nla_for_each_nested() only does basic nla_ok() testing for struct nlattr, but it doesn't know about the data context and their requirements. Fix, on top of Jason's initial work, does 1) parsing of the attributes with the right policy, and 2) using the resulting parsed attribute table from 1) instead of the nla_for_each_nested() loop (just like we used to do when still part of ifla_policy[]). Reference: http://thread.gmane.org/gmane.linux.network/368913 Fixes: c02db8c6290b ("rtnetlink: make SR-IOV VF interface symmetric") Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com> Cc: Greg Rose <gregory.v.rose@intel.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Rony Efraim <ronye@mellanox.com> Cc: Vlad Zolotarov <vladz@cloudius-systems.com> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Vlad Zolotarov <vladz@cloudius-systems.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * Revert "dev: set iflink to 0 for virtual interfaces"Nicolas Dichtel2015-07-091-4/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | This reverts commit e1622baf54df8cc958bf29d71de5ad545ea7d93c. The side effect of this commit is to add a '@NONE' after each virtual interface name with a 'ip link'. It may break existing scripts. Reported-by: Olivier Hartkopp <socketcan@hartkopp.net> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Tested-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: graceful exit from netif_alloc_netdev_queues()Eric Dumazet2015-07-091-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | User space can crash kernel with ip link add ifb10 numtxqueues 100000 type ifb We must replace a BUG_ON() by proper test and return -EINVAL for crazy values. Fixes: 60877a32bce00 ("net: allow large number of tx queues") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net_sched: gen_estimator: extend pps limitEric Dumazet2015-07-081-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rate estimators are limited to 4 Mpps, which was fine years ago, but too small with current hardware generation. Lets use 2^5 scaling instead of 2^10 to get 128 Mpps new limit. On 64bit arch, use an "unsigned long" for temp storage and remove limit. (We do not expect 32bit arches to be able to reach this point) Tested: tc -s -d filter sh dev eth0 parent ffff: filter protocol ip pref 1 u32 filter protocol ip pref 1 u32 fh 800: ht divisor 1 filter protocol ip pref 1 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:15 match 07000000/ff000000 at 12 action order 1: gact action drop random type none pass val 0 index 1 ref 1 bind 1 installed 166 sec Action statistics: Sent 39734251496 bytes 863788076 pkt (dropped 863788117, overlimits 0 requeues 0) rate 4067Mbit 11053596pps backlog 0b 0p requeues 0 Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: skb_defer_rx_timestamp should check for phydev before setting up classifyAlexander Duyck2015-07-091-1/+5
| | | | | | | | | | | | | | | | | | | | | | This change makes it so that the call skb_defer_rx_timestamp will first check for a phydev before going in and manipulating the skb->data and skb->len values. By doing this we can avoid unnecessary work on network devices that don't support phydev. As a result we reduce the total instruction count needed to process this on most devices. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: extend percpu stats helpersEric Dumazet2015-07-081-2/+2
|/ | | | | | | | | | | | | | | | | | qdisc_bstats_update_cpu() and other helpers were added to support percpu stats for qdisc. We want to add percpu stats for tc action, so this patch add common helpers. qdisc_bstats_update_cpu() is renamed to qdisc_bstats_cpu_update() qdisc_qstats_drop_cpu() is renamed to qdisc_qstats_cpu_drop() Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sock_diag: don't broadcast kernel socketsCraig Gallek2015-06-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kernel sockets do not hold a reference for the network namespace to which they point. Socket destruction broadcasting relies on the network namespace and will cause the splat below when a kernel socket is destroyed. This fix simply ignores kernel sockets when they are destroyed. Reported as: general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU: 1 PID: 9130 Comm: kworker/1:1 Not tainted 4.1.0-gelk-debug+ #1 Workqueue: sock_diag_events sock_diag_broadcast_destroy_work Stack: ffff8800b9c586c0 ffff8800b9c586c0 ffff8800ac4692c0 ffff8800936d4a90 ffff8800352efd38 ffffffff8469a93e ffff8800352efd98 ffffffffc09b9b90 ffff8800352efd78 ffff8800ac4692c0 ffff8800b9c586c0 ffff8800831b6ab8 Call Trace: [<ffffffff8469a93e>] ? mutex_unlock+0xe/0x10 [<ffffffffc09b9b90>] ? inet_diag_handler_get_info+0x110/0x1fb [inet_diag] [<ffffffff845c868d>] netlink_broadcast+0x1d/0x20 [<ffffffff8469a93e>] ? mutex_unlock+0xe/0x10 [<ffffffff845b2bf5>] sock_diag_broadcast_destroy_work+0xd5/0x160 [<ffffffff8408ea97>] process_one_work+0x147/0x420 [<ffffffff8408f0f9>] worker_thread+0x69/0x470 [<ffffffff8409fda3>] ? preempt_count_sub+0xa3/0xf0 [<ffffffff8408f090>] ? rescuer_thread+0x320/0x320 [<ffffffff84093cd7>] kthread+0x107/0x120 [<ffffffff84093bd0>] ? kthread_create_on_node+0x1b0/0x1b0 [<ffffffff8469d31f>] ret_from_fork+0x3f/0x70 [<ffffffff84093bd0>] ? kthread_create_on_node+0x1b0/0x1b0 Tested: Using a debug kernel while 'ss -E' is running: ip netns add test-ns ip netns delete test-ns Fixes: eb4cb008529c sock_diag: define destruction multicast groups Fixes: 26abe14379f8 net: Modify sk_alloc to not reference count the netns of kernel sockets. Reported-by: Dave Jones <davej@codemonkey.org.uk> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Kill sock->sk_protinfoDavid Miller2015-06-291-1/+0Star
| | | | | | No more users, so it can now be removed. Signed-off-by: David S. Miller <davem@davemloft.net>
* flow_dissector: Pre-initialize ip_proto in __skb_flow_dissect()Geert Uytterhoeven2015-06-291-1/+1
| | | | | | | | | net/core/flow_dissector.c: In function ‘__skb_flow_dissect’: net/core/flow_dissector.c:132: warning: ‘ip_proto’ may be used uninitialized in this function Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds2015-06-2516-554/+1498
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking updates from David Miller: 1) Add TX fast path in mac80211, from Johannes Berg. 2) Add TSO/GRO support to ibmveth, from Thomas Falcon 3) Move away from cached routes in ipv6, just like ipv4, from Martin KaFai Lau. 4) Lots of new rhashtable tests, from Thomas Graf. 5) Run ingress qdisc lockless, from Alexei Starovoitov. 6) Allow servers to fetch TCP packet headers for SYN packets of new connections, for fingerprinting. From Eric Dumazet. 7) Add mode parameter to pktgen, for testing receive. From Alexei Starovoitov. 8) Cache access optimizations via simplifications of build_skb(), from Alexander Duyck. 9) Move page frag allocator under mm/, also from Alexander. 10) Add xmit_more support to hv_netvsc, from KY Srinivasan. 11) Add a counter guard in case we try to perform endless reclassify loops in the packet scheduler. 12) Extern flow dissector to be programmable and use it in new "Flower" classifier. From Jiri Pirko. 13) AF_PACKET fanout rollover fixes, performance improvements, and new statistics. From Willem de Bruijn. 14) Add netdev driver for GENEVE tunnels, from John W Linville. 15) Add ingress netfilter hooks and filtering, from Pablo Neira Ayuso. 16) Fix handling of epoll edge triggers in TCP, from Eric Dumazet. 17) Add an ECN retry fallback for the initial TCP handshake, from Daniel Borkmann. 18) Add tail call support to BPF, from Alexei Starovoitov. 19) Add several pktgen helper scripts, from Jesper Dangaard Brouer. 20) Add zerocopy support to AF_UNIX, from Hannes Frederic Sowa. 21) Favor even port numbers for allocation to connect() requests, and odd port numbers for bind(0), in an effort to help avoid ip_local_port_range exhaustion. From Eric Dumazet. 22) Add Cavium ThunderX driver, from Sunil Goutham. 23) Allow bpf programs to access skb_iif and dev->ifindex SKB metadata, from Alexei Starovoitov. 24) Add support for T6 chips in cxgb4vf driver, from Hariprasad Shenai. 25) Double TCP Small Queues default to 256K to accomodate situations like the XEN driver and wireless aggregation. From Wei Liu. 26) Add more entropy inputs to flow dissector, from Tom Herbert. 27) Add CDG congestion control algorithm to TCP, from Kenneth Klette Jonassen. 28) Convert ipset over to RCU locking, from Jozsef Kadlecsik. 29) Track and act upon link status of ipv4 route nexthops, from Andy Gospodarek. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1670 commits) bridge: vlan: flush the dynamically learned entries on port vlan delete bridge: multicast: add a comment to br_port_state_selection about blocking state net: inet_diag: export IPV6_V6ONLY sockopt stmmac: troubleshoot unexpected bits in des0 & des1 net: ipv4 sysctl option to ignore routes when nexthop link is down net: track link-status of ipv4 nexthops net: switchdev: ignore unsupported bridge flags net: Cavium: Fix MAC address setting in shutdown state drivers: net: xgene: fix for ACPI support without ACPI ip: report the original address of ICMP messages net/mlx5e: Prefetch skb data on RX net/mlx5e: Pop cq outside mlx5e_get_cqe net/mlx5e: Remove mlx5e_cq.sqrq back-pointer net/mlx5e: Remove extra spaces net/mlx5e: Avoid TX CQE generation if more xmit packets expected net/mlx5e: Avoid redundant dev_kfree_skb() upon NOP completion net/mlx5e: Remove re-assignment of wq type in mlx5e_enable_rq() net/mlx5e: Use skb_shinfo(skb)->gso_segs rather than counting them net/mlx5e: Static mapping of netdev priv resources to/from netdev TX queues net/mlx4_en: Use HW counters for rx/tx bytes/packets in PF device ...
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-06-241-0/+13
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/mellanox/mlx4/main.c net/packet/af_packet.c Both conflicts were cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| | * neigh: do not modify unlinked entriesJulian Anastasov2015-06-211-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The lockless lookups can return entry that is unlinked. Sometimes they get reference before last neigh_cleanup_and_release, sometimes they do not need reference. Later, any modification attempts may result in the following problems: 1. entry is not destroyed immediately because neigh_update can start the timer for dead entry, eg. on change to NUD_REACHABLE state. As result, entry lives for some time but is invisible and out of control. 2. __neigh_event_send can run in parallel with neigh_destroy while refcnt=0 but if timer is started and expired refcnt can reach 0 for second time leading to second neigh_destroy and possible crash. Thanks to Eric Dumazet and Ying Xue for their work and analyze on the __neigh_event_send change. Fixes: 767e97e1e0db ("neigh: RCU conversion of struct neighbour") Fixes: a263b3093641 ("ipv4: Make neigh lookups directly in output packet path.") Fixes: 6fd6ce2056de ("ipv6: Do not depend on rt->n in ip6_finish_output2().") Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ying Xue <ying.xue@windriver.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | switchdev; add VLAN support for port's bridge_getlinkScott Feldman2015-06-231-3/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One more missing piece of the puzzle. Add vlan dump support to switchdev port's bridge_getlink. iproute2 "bridge vlan show" cmd already knows how to show the vlans installed on the bridge and the device , but (until now) no one implemented the port vlan part of the netlink PF_BRIDGE:RTM_GETLINK msg. Before this patch, "bridge vlan show": $ bridge -c vlan show port vlan ids sw1p1 30-34 << bridge side vlans 57 sw1p1 << device side vlans (missing) sw1p2 57 sw1p2 sw1p3 sw1p4 br0 None (When the port is bridged, the output repeats the vlan list for the vlans on the bridge side of the port and the vlans on the device side of the port. The listing above show no vlans for the device side even though they are installed). After this patch: $ bridge -c vlan show port vlan ids sw1p1 30-34 << bridge side vlan 57 sw1p1 30-34 << device side vlans 57 3840 PVID sw1p2 57 sw1p2 57 3840 PVID sw1p3 3842 PVID sw1p4 3843 PVID br0 None I re-used ndo_dflt_bridge_getlink to add vlan fill call-back func. switchdev support adds an obj dump for VLAN objects, using the same call-back scheme as FDB dump. Support included for both compressed and un-compressed vlan dumps. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | bpf: disallow bpf tc programs access current->pid,uidAlexei Starovoitov2015-06-161-6/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | Accessing current->pid/uid from cls_bpf may lead to misleading results and should not be used when TC classifiers need accurate information about pid/uid. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | sock_diag: define destruction multicast groupsCraig Gallek2015-06-162-1/+95
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These groups will contain socket-destruction events for AF_INET/AF_INET6, IPPROTO_TCP/IPPROTO_UDP. Near the end of socket destruction, a check for listeners is performed. In the presence of a listener, rather than completely cleanup the socket, a unit of work will be added to a private work queue which will first broadcast information about the socket and then finish the cleanup operation. Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net/core: Add reading VF statistics through the PF netdeviceEran Ben Elisha2015-06-161-2/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add ndo_get_vf_stats where the PF retrieves and fills the VFs traffic statistics. We encode the VF stats in a nested manner to allow for future extensions. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | bpf: allow networking programs to use bpf_trace_printk() for debuggingAlexei Starovoitov2015-06-161-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bpf_trace_printk() is a helper function used to debug eBPF programs. Let socket and TC programs use it as well. Note, it's DEBUG ONLY helper. If it's used in the program, the kernel will print warning banner to make sure users don't use it in production. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | bpf: introduce current->pid, tgid, uid, gid, comm accessorsAlexei Starovoitov2015-06-161-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | eBPF programs attached to kprobes need to filter based on current->pid, uid and other fields, so introduce helper functions: u64 bpf_get_current_pid_tgid(void) Return: current->tgid << 32 | current->pid u64 bpf_get_current_uid_gid(void) Return: current_gid << 32 | current_uid bpf_get_current_comm(char *buf, int size_of_buf) stores current->comm into buf They can be used from the programs attached to TC as well to classify packets based on current task fields. Update tracex2 example to print histogram of write syscalls for each process instead of aggregated for all. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-06-142-10/+7Star
| |\|
| * | flow_dissector: fix ipv6 dst, hop-by-hop and routing ext hdrsEric Dumazet2015-06-131-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __skb_header_pointer() returns a pointer that must be checked. Fixes infinite loop reported by Alexei, and add __must_check to catch these errors earlier. Fixes: 6a74fcf426f5 ("flow_dissector: add support for dst, hop-by-hop and routing ext hdrs") Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Tested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | flow_dissector: add support for dst, hop-by-hop and routing ext hdrsTom Herbert2015-06-121-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | If dst, hop-by-hop or routing extension headers are present determine length of the options and skip over them in flow dissection. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | flow_dissector: Fix MPLS entropy label handling in flow dissectorTom Herbert2015-06-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | Need to shift after masking to get label value for comparison. Fixes: b3baa0fbd02a1a9d493d8 ("mpls: Add MPLS entropy label in flow_keys") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net/ethtool: Add current supported tunable optionsHadar Hen Zion2015-06-111-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | Add strings array of the current supported tunable options. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Reviewed-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-06-091-9/+2Star
| |\ \
| * | | bpf: allow programs to write to certain skb fieldsAlexei Starovoitov2015-06-071-12/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | allow programs read/write skb->mark, tc_index fields and ((struct qdisc_skb_cb *)cb)->data. mark and tc_index are generically useful in TC. cb[0]-cb[4] are primarily used to pass arguments from one program to another called via bpf_tail_call() which can be seen in sockex3_kern.c example. All fields of 'struct __sk_buff' are readable to socket and tc_cls_act progs. mark, tc_index are writeable from tc_cls_act only. cb[0]-cb[4] are writeable by both sockets and tc_cls_act. Add verifier tests and improve sample code. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | bpf: make programs see skb->data == L2 for ingress and egressAlexei Starovoitov2015-06-071-23/+3Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | eBPF programs attached to ingress and egress qdiscs see inconsistent skb->data. For ingress L2 header is already pulled, whereas for egress it's present. This is known to program writers which are currently forced to use BPF_LL_OFF workaround. Since programs don't change skb internal pointers it is safe to do pull/push right around invocation of the program and earlier taps and later pt->func() will not be affected. Multiple taps via packet_rcv(), tpacket_rcv() are doing the same trick around run_filter/BPF_PROG_RUN even if skb_shared. This fix finally allows programs to use optimized LD_ABS/IND instructions without BPF_LL_OFF for higher performance. tc ingress + cls_bpf + samples/bpf/tcbpf1_kern.o w/o JIT w/JIT before 20.5 23.6 Mpps after 21.8 26.6 Mpps Old programs with BPF_LL_OFF will still work as-is. We can now undo most of the earlier workaround commit: a166151cbe33 ("bpf: fix bpf helpers to use skb->mac_header relative offsets") Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | mpls: Add MPLS entropy label in flow_keysTom Herbert2015-06-051-0/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In flow dissector if an MPLS header contains an entropy label this is saved in the new keyid field of flow_keys. The entropy label is then represented in the flow hash function input. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: Add GRE keyid in flow_keysTom Herbert2015-06-051-1/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In flow dissector if a GRE header contains a keyid this is saved in the new keyid field of flow_keys. The GRE keyid is then represented in the flow hash function input. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: Add IPv6 flow label to flow_keysTom Herbert2015-06-051-19/+10Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In flow_dissector set the flow label in flow_keys for IPv6. This also removes the shortcircuiting of flow dissection when a non-zero label is present, the flow label can be considered to provide additional entropy for a hash. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>