summaryrefslogtreecommitdiffstats
path: root/net/ipv6
Commit message (Collapse)AuthorAgeFilesLines
* ip6_tunnel: Reduce log level in ip6_tnl_err() to debugMatt Bennett2015-09-251-8/+8
| | | | | | | | | | | | | | | Currently error log messages in ip6_tnl_err are printed at 'warn' level. This is different to other tunnel types which don't print any messages. These log messages don't provide any information that couldn't be deduced with networking tools. Also it can be annoying to have one end of the tunnel go down and have the logs fill with pointless messages such as "Path to destination invalid or inactive!". This patch reduces the log level of these messages to 'dbg' level to bring the visible behaviour into line with other tunnel types. Signed-off-by: Matt Bennett <matt.bennett@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
* ip6_gre: Reduce log level in ip6gre_err() to debugMatt Bennett2015-09-251-8/+8
| | | | | | | | | | | | | | | Currently error log messages in ip6gre_err are printed at 'warn' level. This is different to most other tunnel types which don't print any messages. These log messages don't provide any information that couldn't be deduced with networking tools. Also it can be annoying to have one end of the tunnel go down and have the logs fill with pointless messages such as "Path to destination invalid or inactive!". This patch reduces the log level of these messages to 'dbg' level to bring the visible behaviour into line with other tunnel types. Signed-off-by: Matt Bennett <matt.bennett@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Fix behaviour of unreachable, blackhole and prohibit routesNikola Forró2015-09-211-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | Man page of ip-route(8) says following about route types: unreachable - these destinations are unreachable. Packets are dis‐ carded and the ICMP message host unreachable is generated. The local senders get an EHOSTUNREACH error. blackhole - these destinations are unreachable. Packets are dis‐ carded silently. The local senders get an EINVAL error. prohibit - these destinations are unreachable. Packets are discarded and the ICMP message communication administratively prohibited is generated. The local senders get an EACCES error. In the inet6 address family, this was correct, except the local senders got ENETUNREACH error instead of EHOSTUNREACH in case of unreachable route. In the inet address family, all three route types generated ICMP message net unreachable, and the local senders got ENETUNREACH error. In both address families all three route types now behave consistently with documentation. Signed-off-by: Nikola Forró <nforro@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: ip6_fragment: fix headroom tests and skb leakFlorian Westphal2015-09-181-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | David Woodhouse reports skb_under_panic when we try to push ethernet header to fragmented ipv6 skbs: skbuff: skb_under_panic: text:c1277f1e len:1294 put:14 head:dec98000 data:dec97ffc tail:0xdec9850a end:0xdec98f40 dev:br-lan [..] ip6_finish_output2+0x196/0x4da David further debugged this: [..] offending fragments were arriving here with skb_headroom(skb)==10. Which is reasonable, being the Solos ADSL card's header of 8 bytes followed by 2 bytes of PPP frame type. The problem is that if netfilter ipv6 defragmentation is used, skb_cow() in ip6_forward will only see reassembled skb. Therefore, headroom is overestimated by 8 bytes (we pulled fragment header) and we don't check the skbs in the frag_list either. We can't do these checks in netfilter defrag since outdev isn't known yet. Furthermore, existing tests in ip6_fragment did not consider the fragment or ipv6 header size when checking headroom of the fraglist skbs. While at it, also fix a skb leak on memory allocation -- ip6_fragment must consume the skb. I tested this e1000 driver hacked to not allocate additional headroom (we end up in slowpath, since LL_RESERVED_SPACE is 16). If 2 bytes of headroom are allocated, fastpath is taken (14 byte ethernet header was pulled, so 16 byte headroom available in all fragments). Reported-by: David Woodhouse <dwmw2@infradead.org> Diagnosed-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Florian Westphal <fw@strlen.de> Tested-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: include NLM_F_REPLACE in route replace notificationsRoopa Prabhu2015-09-182-5/+6
| | | | | | | | | | | This patch adds NLM_F_REPLACE flag to ipv6 route replace notifications. This makes nlm_flags in ipv6 replace notifications consistent with ipv4. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Replace spinlock with seqlock and rcu in ip6_tunnelMartin KaFai Lau2015-09-152-26/+34
| | | | | | | | | This patch uses a seqlock to ensure consistency between idst->dst and idst->cookie. It also makes dst freeing from fib tree to undergo a rcu grace period. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Avoid double dst_freeMartin KaFai Lau2015-09-153-9/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | It is a prep work to get dst freeing from fib tree undergo a rcu grace period. The following is a common paradigm: if (ip6_del_rt(rt)) dst_free(rt) which means, if rt cannot be deleted from the fib tree, dst_free(rt) now. 1. We don't know the ip6_del_rt(rt) failure is because it was not managed by fib tree (e.g. DST_NOCACHE) or it had already been removed from the fib tree. 2. If rt had been managed by the fib tree, ip6_del_rt(rt) failure means dst_free(rt) has been called already. A second dst_free(rt) is not always obviously safe. The rt may have been destroyed already. 3. If rt is a DST_NOCACHE, dst_free(rt) should not be called. 4. It is a stopper to make dst freeing from fib tree undergo a rcu grace period. This patch is to use a DST_NOCACHE flag to indicate a rt is not managed by the fib tree. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Fix dst_entry refcnt bugs in ip6_tunnelMartin KaFai Lau2015-09-152-46/+114
| | | | | | | | | | | | | | | | | | | | | | | | | Problems in the current dst_entry cache in the ip6_tunnel: 1. ip6_tnl_dst_set is racy. There is no lock to protect it: - One major problem is that the dst refcnt gets messed up. F.e. the same dst_cache can be released multiple times and then triggering the infamous dst refcnt < 0 warning message. - Another issue is the inconsistency between dst_cache and dst_cookie. It can be reproduced by adding and removing the ip6gre tunnel while running a super_netperf TCP_CRR test. 2. ip6_tnl_dst_get does not take the dst refcnt before returning the dst. This patch: 1. Create a percpu dst_entry cache in ip6_tnl 2. Use a spinlock to protect the dst_cache operations 3. ip6_tnl_dst_get always takes the dst refcnt before returning Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Rename the dst_cache helper functions in ip6_tunnelMartin KaFai Lau2015-09-152-8/+8
| | | | | | | | | | | | | | It is a prep work to fix the dst_entry refcnt bugs in ip6_tunnel. This patch rename: 1. ip6_tnl_dst_check() to ip6_tnl_dst_get() to better reflect that it will take a dst refcnt in the next patch. 2. ip6_tnl_dst_store() to ip6_tnl_dst_set() to have a more conventional name matching with ip6_tnl_dst_get(). Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Refactor common ip6gre_tunnel_init codesMartin KaFai Lau2015-09-151-13/+24
| | | | | | | | | | It is a prep work to fix the dst_entry refcnt bugs in ip6_tunnel. This patch refactors some common init codes used by both ip6gre_tunnel_init and ip6gre_tap_init. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2015-09-103-34/+175
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Fix out-of-bounds array access in netfilter ipset, from Jozsef Kadlecsik. 2) Use correct free operation on netfilter conntrack templates, from Daniel Borkmann. 3) Fix route leak in SCTP, from Marcelo Ricardo Leitner. 4) Fix sizeof(pointer) in mac80211, from Thierry Reding. 5) Fix cache pointer comparison in ip6mr leading to missed unlock of mrt_lock. From Richard Laing. 6) rds_conn_lookup() needs to consider network namespace in key comparison, from Sowmini Varadhan. 7) Fix deadlock in TIPC code wrt broadcast link wakeups, from Kolmakov Dmitriy. 8) Fix fd leaks in bpf syscall, from Daniel Borkmann. 9) Fix error recovery when installing ipv6 multipath routes, we would delete the old route before we would know if we could fully commit to the new set of nexthops. Fix from Roopa Prabhu. 10) Fix run-time suspend problems in r8152, from Hayes Wang. 11) In fec, don't program the MAC address into the chip when the clocks are gated off. From Fugang Duan. 12) Fix poll behavior for netlink sockets when using rx ring mmap, from Daniel Borkmann. 13) Don't allocate memory with GFP_KERNEL from get_stats64 in r8169 driver, from Corinna Vinschen. 14) In TCP Cubic congestion control, handle idle periods better where we are application limited, in order to keep cwnd from growing out of control. From Eric Dumzet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits) tcp_cubic: better follow cubic curve after idle period tcp: generate CA_EVENT_TX_START on data frames xen-netfront: respect user provided max_queues xen-netback: respect user provided max_queues r8169: Fix sleeping function called during get_stats64, v2 ether: add IEEE 1722 ethertype - TSN netlink, mmap: fix edge-case leakages in nf queue zero-copy netlink, mmap: don't walk rx ring on poll if receive queue non-empty cxgb4: changes for new firmware 1.14.4.0 net: fec: add netif status check before set mac address r8152: fix the runtime suspend issues r8152: split DRIVER_VERSION ipv6: fix ifnullfree.cocci warnings add microchip LAN88xx phy driver stmmac: fix check for phydev being open net: qlcnic: delete redundant memsets net: mv643xx_eth: use kzalloc net: jme: use kzalloc() instead of kmalloc+memset net: cavium: liquidio: use kzalloc in setup_glist() net: ipv6: use common fib_default_rule_pref ...
| * ipv6: fix ifnullfree.cocci warningsWu Fengguang2015-09-101-2/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | net/ipv6/route.c:2946:3-8: WARNING: NULL check before freeing functions like kfree, debugfs_remove, debugfs_remove_recursive or usb_free_urb is not needed. Maybe consider reorganizing relevant code to avoid passing NULL values. NULL check before some freeing functions is not needed. Based on checkpatch warning "kfree(NULL) is safe this check is probably not required" and kfreeaddr.cocci by Julia Lawall. Generated by: scripts/coccinelle/free/ifnullfree.cocci CC: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: ipv6: use common fib_default_rule_prefPhil Sutter2015-09-092-7/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This switches IPv6 policy routing to use the shared fib_default_rule_pref() function of IPv4 and DECnet. It is also used in multicast routing for IPv4 as well as IPv6. The motivation for this patch is a complaint about iproute2 behaving inconsistent between IPv4 and IPv6 when adding policy rules: Formerly, IPv6 rules were assigned a fixed priority of 0x3FFF whereas for IPv4 the assigned priority value was decreased with each rule added. Since then all users of the default_pref field have been converted to assign the generic function fib_default_rule_pref(), fib_nl_newrule() may just use it directly instead. Therefore get rid of the function pointer altogether and make fib_default_rule_pref() static, as it's not used outside fib_rules.c anymore. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: fix multipath route replace error recoveryRoopa Prabhu2015-09-091-26/+175
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: The ecmp route replace support for ipv6 in the kernel, deletes the existing ecmp route too early, ie when it installs the first nexthop. If there is an error in installing the subsequent nexthops, its too late to recover the already deleted existing route leaving the fib in an inconsistent state. This patch reduces the possibility of this by doing the following: a) Changes the existing multipath route add code to a two stage process: build rt6_infos + insert them ip6_route_add rt6_info creation code is moved into ip6_route_info_create. b) This ensures that most errors are caught during building rt6_infos and we fail early c) Separates multipath add and del code. Because add needs the special two stage mode in a) and delete essentially does not care. d) In any event if the code fails during inserting a route again, a warning is printed (This should be unlikely) Before the patch: $ip -6 route show 3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024 /* Try replacing the route with a duplicate nexthop */ $ip -6 route change 3000:1000:1000:1000::2/128 nexthop via fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1 RTNETLINK answers: File exists $ip -6 route show /* previously added ecmp route 3000:1000:1000:1000::2 dissappears from * kernel */ After the patch: $ip -6 route show 3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024 /* Try replacing the route with a duplicate nexthop */ $ip -6 route change 3000:1000:1000:1000::2/128 nexthop via fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1 RTNETLINK answers: File exists $ip -6 route show 3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024 Fixes: 27596472473a ("ipv6: fix ECMP route replacement") Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ipv6: Correct PIM6 mrt_lock handlingRichard Laing2015-09-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | In the IPv6 multicast routing code the mrt_lock was not being released correctly in the MFC iterator, as a result adding or deleting a MIF would cause a hang because the mrt_lock could not be acquired. This fix is a copy of the code for the IPv4 case and ensures that the lock is released correctly. Signed-off-by: Richard Laing <richard.laing@alliedtelesis.co.nz> Acked-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge tag 'for-linus' of ↵Linus Torvalds2015-09-091-31/+0Star
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull inifiniband/rdma updates from Doug Ledford: "This is a fairly sizeable set of changes. I've put them through a decent amount of testing prior to sending the pull request due to that. There are still a few fixups that I know are coming, but I wanted to go ahead and get the big, sizable chunk into your hands sooner rather than waiting for those last few fixups. Of note is the fact that this creates what is intended to be a temporary area in the drivers/staging tree specifically for some cleanups and additions that are coming for the RDMA stack. We deprecated two drivers (ipath and amso1100) and are waiting to hear back if we can deprecate another one (ehca). We also put Intel's new hfi1 driver into this area because it needs to be refactored and a transfer library created out of the factored out code, and then it and the qib driver and the soft-roce driver should all be modified to use that library. I expect drivers/staging/rdma to be around for three or four kernel releases and then to go away as all of the work is completed and final deletions of deprecated drivers are done. Summary of changes for 4.3: - Create drivers/staging/rdma - Move amso1100 driver to staging/rdma and schedule for deletion - Move ipath driver to staging/rdma and schedule for deletion - Add hfi1 driver to staging/rdma and set TODO for move to regular tree - Initial support for namespaces to be used on RDMA devices - Add RoCE GID table handling to the RDMA core caching code - Infrastructure to support handling of devices with differing read and write scatter gather capabilities - Various iSER updates - Kill off unsafe usage of global mr registrations - Update SRP driver - Misc mlx4 driver updates - Support for the mr_alloc verb - Support for a netlink interface between kernel and user space cache daemon to speed path record queries and route resolution - Ininitial support for safe hot removal of verbs devices" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (136 commits) IB/ipoib: Suppress warning for send only join failures IB/ipoib: Clean up send-only multicast joins IB/srp: Fix possible protection fault IB/core: Move SM class defines from ib_mad.h to ib_smi.h IB/core: Remove unnecessary defines from ib_mad.h IB/hfi1: Add PSM2 user space header to header_install IB/hfi1: Add CSRs for CONFIG_SDMA_VERBOSITY mlx5: Fix incorrect wc pkey_index assignment for GSI messages IB/mlx5: avoid destroying a NULL mr in reg_user_mr error flow IB/uverbs: reject invalid or unknown opcodes IB/cxgb4: Fix if statement in pick_local_ip6adddrs IB/sa: Fix rdma netlink message flags IB/ucma: HW Device hot-removal support IB/mlx4_ib: Disassociate support IB/uverbs: Enable device removal when there are active user space applications IB/uverbs: Explicitly pass ib_dev to uverbs commands IB/uverbs: Fix race between ib_uverbs_open and remove_one IB/uverbs: Fix reference counting usage of event files IB/core: Make ib_dealloc_pd return void IB/srp: Create an insecure all physical rkey only if needed ...
| * net/ipv6: Export addrconf_ifid_eui48Matan Barak2015-08-311-31/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | For loopback purposes, RoCE devices should have a default GID in the port GID table, even when the interface is down. In order to do so, we use the IPv6 link local address which would have been genenrated for the related Ethernet netdevice when it goes up as a default GID. addrconf_ifid_eui48 is used to gernerate this address, export it. Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
* | netfilter: nf_dup{4, 6}: fix build error when nf_conntrack disabledDaniel Borkmann2015-09-031-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While testing various Kconfig options on another issue, I found that the following one triggers as well on allmodconfig and nf_conntrack disabled: net/ipv4/netfilter/nf_dup_ipv4.c: In function ‘nf_dup_ipv4’: net/ipv4/netfilter/nf_dup_ipv4.c:72:20: error: ‘nf_skb_duplicated’ undeclared (first use in this function) if (this_cpu_read(nf_skb_duplicated)) [...] net/ipv6/netfilter/nf_dup_ipv6.c: In function ‘nf_dup_ipv6’: net/ipv6/netfilter/nf_dup_ipv6.c:66:20: error: ‘nf_skb_duplicated’ undeclared (first use in this function) if (this_cpu_read(nf_skb_duplicated)) Fix it by including directly the header where it is defined. Fixes: bbde9fc1824a ("netfilter: factor out packet duplication for IPv4/IPv6") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: fix exthdrs offload registration in out_rt pathDaniel Borkmann2015-09-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | We previously register IPPROTO_ROUTING offload under inet6_add_offload(), but in error path, we try to unregister it with inet_del_offload(). This doesn't seem correct, it should actually be inet6_del_offload(), also ipv6_exthdrs_offload_exit() from that commit seems rather incorrect (it also uses rthdr_offload twice), but it got removed entirely later on. Fixes: 3336288a9fea ("ipv6: Switch to using new offload infrastructure.") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: send only one NEWLINK when RA causes changesMarius Tomaschewski2015-09-011-3/+10
| | | | | | | | | | Signed-off-by: Marius Tomaschewski <mt@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: send NEWLINK on RA managed/otherconf changesMarius Tomaschewski2015-09-011-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | The kernel is applying the RA managed/otherconf flags silently and forgets to send ifinfo notify to inform about their change when the router provides a zero reachable_time and retrans_timer as dnsmasq and many routers send it, which just means unspecified by this router and the host should continue using whatever value it is already using. Userspace may monitor the ifinfo notifications to activate dhcpv6. Signed-off-by: Marius Tomaschewski <mt@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: use dctcp if enabled on the route to the initiatorDaniel Borkmann2015-08-311-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the following case doesn't use DCTCP, even if it should: A responder has f.e. Cubic as system wide default, but for a specific route to the initiating host, DCTCP is being set in RTAX_CC_ALGO. The initiating host then uses DCTCP as congestion control, but since the initiator sets ECT(0), tcp_ecn_create_request() doesn't set ecn_ok, and we have to fall back to Reno after 3WHS completes. We were thinking on how to solve this in a minimal, non-intrusive way without bloating tcp_ecn_create_request() needlessly: lets cache the CA ecn option flag in RTAX_FEATURES. In other words, when ECT(0) is set on the SYN packet, set ecn_ok=1 iff route RTAX_FEATURES contains the unexposed (internal-only) DST_FEATURE_ECN_CA. This allows to only do a single metric feature lookup inside tcp_ecn_create_request(). Joint work with Florian Westphal. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | fib, fib6: reject invalid feature bitsDaniel Borkmann2015-08-311-0/+2
| | | | | | | | | | | | | | | | | | Feature bits that are invalid should not be accepted by the kernel, only the lower 4 bits may be configured, but not the remaining ones. Even from these 4, 2 of them are unused. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: fib6: reduce identation in ip6_convert_metricsDaniel Borkmann2015-08-311-16/+16
| | | | | | | | | | | | | | | | Reduce the identation a bit, there's no need to artificically have it increased. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Optimize snmp stat aggregation by walking all the percpu data at onceRaghavendra K T2015-08-311-10/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Docker container creation linearly increased from around 1.6 sec to 7.5 sec (at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field. reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks through per cpu data of an item (iteratively for around 36 items). idea: This patch tries to aggregate the statistics by going through all the items of each cpu sequentially which is reducing cache misses. Docker creation got faster by more than 2x after the patch. Result: Before After Docker creation time 6.836s 3.25s cache miss 2.7% 1.41% perf before: 50.73% docker [kernel.kallsyms] [k] snmp_fold_field 9.07% swapper [kernel.kallsyms] [k] snooze_loop 3.49% docker [kernel.kallsyms] [k] veth_stats_one 2.85% swapper [kernel.kallsyms] [k] _raw_spin_lock perf after: 10.57% docker docker [.] scanblock 8.37% swapper [kernel.kallsyms] [k] snooze_loop 6.91% docker [kernel.kallsyms] [k] snmp_get_cpu_field 6.67% docker [kernel.kallsyms] [k] veth_stats_one changes/ideas suggested: Using buffer in stack (Eric), Usage of memset (David), Using memcpy in place of unaligned_put (Joe). Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | vxlan: do not receive IPv4 packets on IPv6 socketJiri Benc2015-08-291-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | By default (subject to the sysctl settings), IPv6 sockets listen also for IPv4 traffic. Vxlan is not prepared for that and expects IPv6 header in packets received through an IPv6 socket. In addition, it's currently not possible to have both IPv4 and IPv6 vxlan tunnel on the same port (unless bindv6only sysctl is enabled), as it's not possible to create and bind both IPv4 and IPv6 vxlan interfaces and there's no way to specify both IPv4 and IPv6 remote/group IP addresses. Set IPV6_V6ONLY on vxlan sockets to fix both of these issues. This is not done globally in udp_tunnel, as l2tp and tipc seems to work okay when receiving IPv4 packets on IPv6 socket and people may rely on this behavior. The other tunnels (geneve and fou) do not support IPv6. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ip_tunnels: convert the mode field of ip_tunnel_info to flagsJiri Benc2015-08-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | The mode field holds a single bit of information only (whether the ip_tunnel_info struct is for rx or tx). Change the mode field to bit flags. This allows more mode flags to be added. Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2015-08-294-18/+17Star
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter/IPVS updates for your net-next tree. In sum, patches to address fallout from the previous round plus updates from the IPVS folks via Simon Horman, they are: 1) Add a new scheduler to IPVS: The weighted overflow scheduling algorithm directs network connections to the server with the highest weight that is currently available and overflows to the next when active connections exceed the node's weight. From Raducu Deaconu. 2) Fix locking ordering in IPVS, always take rtnl_lock in first place. Patch from Julian Anastasov. 3) Allow to indicate the MTU to the IPVS in-kernel state sync daemon. From Julian Anastasov. 4) Enhance multicast configuration for the IPVS state sync daemon. Also from Julian. 5) Resolve sparse warnings in the nf_dup modules. 6) Fix a linking problem when CONFIG_NF_DUP_IPV6 is not set. 7) Add ICMP codes 5 and 6 to IPv6 REJECT target, they are more informative subsets of code 1. From Andreas Herz. 8) Revert the jumpstack size calculation from mark_source_chains due to chain depth miscalculations, from Florian Westphal. 9) Calm down more sparse warning around the Netfilter tree, again from Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | netfilter: reduce sparse warningsFlorian Westphal2015-08-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bridge/netfilter/ebtables.c:290:26: warning: incorrect type in assignment (different modifiers) -> remove __pure annotation. ipv6/netfilter/ip6t_SYNPROXY.c:240:27: warning: cast from restricted __be16 -> switch ntohs to htons and vice versa. netfilter/core.c:391:30: warning: symbol 'nfq_ct_nat_hook' was not declared. Should it be static? -> delete it, got removed net/netfilter/nf_synproxy_core.c:221:48: warning: cast to restricted __be32 -> Use __be32 instead of u32. Tested with objdiff that these changes do not affect generated code. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | Revert "netfilter: xtables: compute exact size needed for jumpstack"Florian Westphal2015-08-281-15/+8Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 98d1bd802cdbc8f56868fae51edec13e86b59515. mark_source_chains will not re-visit chains, so *filter :INPUT ACCEPT [365:25776] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [217:45832] :t1 - [0:0] :t2 - [0:0] :t3 - [0:0] :t4 - [0:0] -A t1 -i lo -j t2 -A t2 -i lo -j t3 -A t3 -i lo -j t4 # -A INPUT -j t4 # -A INPUT -j t3 # -A INPUT -j t2 -A INPUT -j t1 COMMIT Will compute a chain depth of 2 if the comments are removed. Revert back to counting the number of chains for the time being. Reported-by: Cong Wang <cwang@twopensource.com> Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: ip6t_REJECT: added missing icmpv6 codesAndreas Herz2015-08-261-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RFC 4443 added two new codes values for ICMPv6 type 1: 5 - Source address failed ingress/egress policy 6 - Reject route to destination And RFC 7084 states in L-14 that IPv6 Router MUST send ICMPv6 Destination Unreachable with code 5 for packets forwarded to it that use an address from a prefix that has been invalidated. Codes 5 and 6 are more informative subsets of code 1. Signed-off-by: Andreas Herz <andi@geekosphere.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: nf_dup: fix sparse warningsPablo Neira Ayuso2015-08-211-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | >> net/ipv4/netfilter/nft_dup_ipv4.c:29:37: sparse: incorrect type in initializer (different base types) net/ipv4/netfilter/nft_dup_ipv4.c:29:37: expected restricted __be32 [user type] s_addr net/ipv4/netfilter/nft_dup_ipv4.c:29:37: got unsigned int [unsigned] <noident> >> net/ipv6/netfilter/nf_dup_ipv6.c:48:23: sparse: incorrect type in assignment (different base types) net/ipv6/netfilter/nf_dup_ipv6.c:48:23: expected restricted __be32 [addressable] [assigned] [usertype] flowlabel net/ipv6/netfilter/nf_dup_ipv6.c:48:23: got int Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-08-281-0/+1
|\ \ \
| * | | ip6_gre: release cached dst on tunnel removalhuaibin Wang2015-08-251-0/+1
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a tunnel is deleted, the cached dst entry should be released. This problem may prevent the removal of a netns (seen with a x-netns IPv6 gre tunnel): unregister_netdevice: waiting for lo to become free. Usage count = 3 CC: Dmitry Kozlov <xeb@mail.ru> Fixes: c12b395a4664 ("gre: Support GRE over IPv6") Signed-off-by: huaibin Wang <huaibin.wang@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv6: Export nf_ct_frag6_gather()Joe Stringer2015-08-271-0/+1
| | | | | | | | | | | | | | | | | | | | | Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ah6: fix error return codeJulia Lawall2015-08-251-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Return a negative error code on failure. A simplified version of the semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ identifier ret; expression e1,e2; @@ ( if (\(ret < 0\|ret != 0\)) { ... return ret; } | ret = 0 ) ... when != ret = e1 when != &ret *if(...) { ... when != ret = e2 when forall return ret; } // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | xfrm: Use VRF master index if output device is enslavedDavid Ahern2015-08-251-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Directs route lookups to VRF table. Compiles out if NET_VRF is not enabled. With this patch able to successfully bring up ipsec tunnels in VRFs, even with duplicate network configuration. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ila: Precompute checksum difference for translationsTom Herbert2015-08-241-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the ILA build state for LWT compute the checksum difference to apply to transport checksums that include the IPv6 pseudo header. The difference is between the route destination (from fib6_config) and the locator to write. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | lwt: Add cfg argument to build_stateTom Herbert2015-08-242-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add cfg and family arguments to lwt build state functions. cfg is a void pointer and will either be a pointer to a fib_config or fib6_config structure. The family parameter indicates which one (either AF_INET or AF_INET6). LWT encpasulation implementation may use the fib configuration to build the LWT state. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-08-213-39/+75
|\| | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/usb/qmi_wwan.c Overlapping additions of new device IDs to qmi_wwan.c Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipv6: Fix a potential deadlock when creating pcpu rtMartin KaFai Lau2015-08-172-11/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rt6_make_pcpu_route() is called under read_lock(&table->tb6_lock). rt6_make_pcpu_route() calls ip6_rt_pcpu_alloc(rt) which then calls dst_alloc(). dst_alloc() _may_ call ip6_dst_gc() which takes the write_lock(&tabl->tb6_lock). A visualized version: read_lock(&table->tb6_lock); rt6_make_pcpu_route(); => ip6_rt_pcpu_alloc(); => dst_alloc(); => ip6_dst_gc(); => write_lock(&table->tb6_lock); /* oops */ The fix is to do a read_unlock first before calling ip6_rt_pcpu_alloc(). A reported stack: [141625.537638] INFO: rcu_sched self-detected stall on CPU { 27} (t=60000 jiffies g=4159086 c=4159085 q=2139) [141625.547469] Task dump for CPU 27: [141625.550881] mtr R running task 0 22121 22081 0x00000008 [141625.558069] 0000000000000000 ffff88103f363d98 ffffffff8106e488 000000000000001b [141625.565641] ffffffff81684900 ffff88103f363db8 ffffffff810702b0 0000000008000000 [141625.573220] ffffffff81684900 ffff88103f363de8 ffffffff8108df9f ffff88103f375a00 [141625.580803] Call Trace: [141625.583345] <IRQ> [<ffffffff8106e488>] sched_show_task+0xc1/0xc6 [141625.589650] [<ffffffff810702b0>] dump_cpu_task+0x35/0x39 [141625.595144] [<ffffffff8108df9f>] rcu_dump_cpu_stacks+0x6a/0x8c [141625.601320] [<ffffffff81090606>] rcu_check_callbacks+0x1f6/0x5d4 [141625.607669] [<ffffffff810940c8>] update_process_times+0x2a/0x4f [141625.613925] [<ffffffff8109fbee>] tick_sched_handle+0x32/0x3e [141625.619923] [<ffffffff8109fc2f>] tick_sched_timer+0x35/0x5c [141625.625830] [<ffffffff81094a1f>] __hrtimer_run_queues+0x8f/0x18d [141625.632171] [<ffffffff81094c9e>] hrtimer_interrupt+0xa0/0x166 [141625.638258] [<ffffffff8102bf2a>] local_apic_timer_interrupt+0x4e/0x52 [141625.645036] [<ffffffff8102c36f>] smp_apic_timer_interrupt+0x39/0x4a [141625.651643] [<ffffffff8140b9e8>] apic_timer_interrupt+0x68/0x70 [141625.657895] <EOI> [<ffffffff81346ee8>] ? dst_destroy+0x7c/0xb5 [141625.664188] [<ffffffff813d45b5>] ? fib6_flush_trees+0x20/0x20 [141625.670272] [<ffffffff81082b45>] ? queue_write_lock_slowpath+0x60/0x6f [141625.677140] [<ffffffff8140aa33>] _raw_write_lock_bh+0x23/0x25 [141625.683218] [<ffffffff813d4553>] __fib6_clean_all+0x40/0x82 [141625.689124] [<ffffffff813d45b5>] ? fib6_flush_trees+0x20/0x20 [141625.695207] [<ffffffff813d6058>] fib6_clean_all+0xe/0x10 [141625.700854] [<ffffffff813d60d3>] fib6_run_gc+0x79/0xc8 [141625.706329] [<ffffffff813d0510>] ip6_dst_gc+0x85/0xf9 [141625.711718] [<ffffffff81346d68>] dst_alloc+0x55/0x159 [141625.717105] [<ffffffff813d09b5>] __ip6_dst_alloc.isra.32+0x19/0x63 [141625.723620] [<ffffffff813d1830>] ip6_pol_route+0x36a/0x3e8 [141625.729441] [<ffffffff813d18d6>] ip6_pol_route_output+0x11/0x13 [141625.735700] [<ffffffff813f02c8>] fib6_rule_action+0xa7/0x1bf [141625.741698] [<ffffffff813d18c5>] ? ip6_pol_route_input+0x17/0x17 [141625.748043] [<ffffffff81357c48>] fib_rules_lookup+0xb5/0x12a [141625.754050] [<ffffffff81141628>] ? poll_select_copy_remaining+0xf9/0xf9 [141625.761002] [<ffffffff813f0535>] fib6_rule_lookup+0x37/0x5c [141625.766914] [<ffffffff813d18c5>] ? ip6_pol_route_input+0x17/0x17 [141625.773260] [<ffffffff813d008c>] ip6_route_output+0x7a/0x82 [141625.779177] [<ffffffff813c44c8>] ip6_dst_lookup_tail+0x53/0x112 [141625.785437] [<ffffffff813c45c3>] ip6_dst_lookup_flow+0x2a/0x6b [141625.791604] [<ffffffff813ddaab>] rawv6_sendmsg+0x407/0x9b6 [141625.797423] [<ffffffff813d7914>] ? do_ipv6_setsockopt.isra.8+0xd87/0xde2 [141625.804464] [<ffffffff8139d4b4>] inet_sendmsg+0x57/0x8e [141625.810028] [<ffffffff81329ba3>] sock_sendmsg+0x2e/0x3c [141625.815588] [<ffffffff8132be57>] SyS_sendto+0xfe/0x143 [141625.821063] [<ffffffff813dd551>] ? rawv6_setsockopt+0x5e/0x67 [141625.827146] [<ffffffff8132c9f8>] ? sock_common_setsockopt+0xf/0x11 [141625.833660] [<ffffffff8132c08c>] ? SyS_setsockopt+0x81/0xa2 [141625.839565] [<ffffffff8140ac17>] entry_SYSCALL_64_fastpath+0x12/0x6a Fixes: d52d3997f843 ("pv6: Create percpu rt6_info") Signed-off-by: Martin KaFai Lau <kafai@fb.com> CC: Hannes Frederic Sowa <hannes@stressinduktion.org> Reported-by: Steinar H. Gunderson <sgunderson@bigfoot.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipv6: Add rt6_make_pcpu_route()Martin KaFai Lau2015-08-171-4/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is a prep work for fixing a potential deadlock when creating a pcpu rt. The current rt6_get_pcpu_route() will also create a pcpu rt if one does not exist. This patch moves the pcpu rt creation logic into another function, rt6_make_pcpu_route(). Signed-off-by: Martin KaFai Lau <kafai@fb.com> CC: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipv6: Remove un-used argument from ip6_dst_alloc()Martin KaFai Lau2015-08-171-12/+9Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After 4b32b5ad31a6 ("ipv6: Stop rt6_info from using inet_peer's metrics"), ip6_dst_alloc() does not need the 'table' argument. This patch cleans it up. Signed-off-by: Martin KaFai Lau <kafai@fb.com> CC: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: fix wrong skb_get() usage / crash in IGMP/MLD parsing codeLinus Lüssing2015-08-141-15/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The recent refactoring of the IGMP and MLD parsing code into ipv6_mc_check_mld() / ip_mc_check_igmp() introduced a potential crash / BUG() invocation for bridges: I wrongly assumed that skb_get() could be used as a simple reference counter for an skb which is not the case. skb_get() bears additional semantics, a user count. This leads to a BUG() invocation in pskb_expand_head() / kernel panic if pskb_may_pull() is called on an skb with a user count greater than one - unfortunately the refactoring did just that. Fixing this by removing the skb_get() call and changing the API: The caller of ipv6_mc_check_mld() / ip_mc_check_igmp() now needs to additionally check whether the returned skb_trimmed is a clone. Fixes: 9afd85c9e455 ("net: Export IGMP/MLD message validation code") Reported-by: Brenden Blanco <bblanco@plumgrid.com> Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2015-08-217-11/+233
|\ \ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter updates for net-next This is second pull request includes the conflict resolution patch that resulted from the updates that we got for the conntrack template through kmalloc. No changes with regards to the previously sent 15 patches. The following patchset contains Netfilter updates for your net-next tree, they are: 1) Rework the existing nf_tables counter expression to make it per-cpu. 2) Prepare and factor out common packet duplication code from the TEE target so it can be reused from the new dup expression. 3) Add the new dup expression for the nf_tables IPv4 and IPv6 families. 4) Convert the nf_tables limit expression to use a token-based approach with 64-bits precision. 5) Enhance the nf_tables limit expression to support limiting at packet byte. This comes after several preparation patches. 6) Add a burst parameter to indicate the amount of packets or bytes that can exceed the limiting. 7) Add netns support to nfacct, from Andreas Schultz. 8) Pass the nf_conn_zone structure instead of the zone ID in nf_tables to allow accessing more zone specific information, from Daniel Borkmann. 9) Allow to define zone per-direction to support netns containers with overlapping network addressing, also from Daniel. 10) Extend the CT target to allow setting the zone based on the skb->mark as a way to support simple mappings from iptables, also from Daniel. 11) Make the nf_tables payload expression aware of the fact that VLAN offload may have removed a vlan header, from Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Merge branch 'master' of ↵Pablo Neira Ayuso2015-08-2113-24/+389
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Resolve conflicts with conntrack template fixes. Conflicts: net/netfilter/nf_conntrack_core.c net/netfilter/nf_synproxy_core.c net/netfilter/xt_CT.c Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | netfilter: nf_conntrack: add efficient mark to zone mappingDaniel Borkmann2015-08-181-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work adds the possibility of deriving the zone id from the skb->mark field in a scalable manner. This allows for having only a single template serving hundreds/thousands of different zones, for example, instead of the need to have one match for each zone as an extra CT jump target. Note that we'd need to have this information attached to the template as at the time when we're trying to lookup a possible ct object, we already need to know zone information for a possible match when going into __nf_conntrack_find_get(). This work provides a minimal implementation for a possible mapping. In order to not add/expose an extra ct->status bit, the zone structure has been extended to carry a flag for deriving the mark. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | netfilter: nf_conntrack: add direction support for zonesDaniel Borkmann2015-08-181-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work adds a direction parameter to netfilter zones, so identity separation can be performed only in original/reply or both directions (default). This basically opens up the possibility of doing NAT with conflicting IP address/port tuples from multiple, isolated tenants on a host (e.g. from a netns) without requiring each tenant to NAT twice resp. to use its own dedicated IP address to SNAT to, meaning overlapping tuples can be made unique with the zone identifier in original direction, where the NAT engine will then allocate a unique tuple in the commonly shared default zone for the reply direction. In some restricted, local DNAT cases, also port redirection could be used for making the reply traffic unique w/o requiring SNAT. The consensus we've reached and discussed at NFWS and since the initial implementation [1] was to directly integrate the direction meta data into the existing zones infrastructure, as opposed to the ct->mark approach we proposed initially. As we pass the nf_conntrack_zone object directly around, we don't have to touch all call-sites, but only those, that contain equality checks of zones. Thus, based on the current direction (original or reply), we either return the actual id, or the default NF_CT_DEFAULT_ZONE_ID. CT expectations are direction-agnostic entities when expectations are being compared among themselves, so we can only use the identifier in this case. Note that zone identifiers can not be included into the hash mix anymore as they don't contain a "stable" value that would be equal for both directions at all times, f.e. if only zone->id would unconditionally be xor'ed into the table slot hash, then replies won't find the corresponding conntracking entry anymore. If no particular direction is specified when configuring zones, the behaviour is exactly as we expect currently (both directions). Support has been added for the CT netlink interface as well as the x_tables raw CT target, which both already offer existing interfaces to user space for the configuration of zones. Below a minimal, simplified collision example (script in [2]) with netperf sessions: +--- tenant-1 ---+ mark := 1 | netperf |--+ +----------------+ | CT zone := mark [ORIGINAL] [ip,sport] := X +--------------+ +--- gateway ---+ | mark routing |--| SNAT |-- ... + +--------------+ +---------------+ | +--- tenant-2 ---+ | ~~~|~~~ | netperf |--+ +-----------+ | +----------------+ mark := 2 | netserver |------ ... + [ip,sport] := X +-----------+ [ip,port] := Y On the gateway netns, example: iptables -t raw -A PREROUTING -j CT --zone mark --zone-dir ORIGINAL iptables -t nat -A POSTROUTING -o <dev> -j SNAT --to-source <ip> --random-fully iptables -t mangle -A PREROUTING -m conntrack --ctdir ORIGINAL -j CONNMARK --save-mark iptables -t mangle -A POSTROUTING -m conntrack --ctdir REPLY -j CONNMARK --restore-mark conntrack dump from gateway netns: netperf -H 10.1.1.2 -t TCP_STREAM -l60 -p12865,5555 from each tenant netns tcp 6 431995 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=1 src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=1024 [ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1 tcp 6 431994 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=2 src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=5555 [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=1 tcp 6 299 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=39438 dport=33768 zone-orig=1 src=10.1.1.2 dst=10.1.1.1 sport=33768 dport=39438 [ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1 tcp 6 300 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=32889 dport=40206 zone-orig=2 src=10.1.1.2 dst=10.1.1.1 sport=40206 dport=32889 [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=2 Taking this further, test script in [2] creates 200 tenants and runs original-tuple colliding netperf sessions each. A conntrack -L dump in the gateway netns also confirms 200 overlapping entries, all in ESTABLISHED state as expected. I also did run various other tests with some permutations of the script, to mention some: SNAT in random/random-fully/persistent mode, no zones (no overlaps), static zones (original, reply, both directions), etc. [1] http://thread.gmane.org/gmane.comp.security.firewalls.netfilter.devel/57412/ [2] https://paste.fedoraproject.org/242835/65657871/ Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | netfilter: nf_conntrack: push zone object into functionsDaniel Borkmann2015-08-113-10/+7Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch replaces the zone id which is pushed down into functions with the actual zone object. It's a bigger one-time change, but needed for later on extending zones with a direction parameter, and thus decoupling this additional information from all call-sites. No functional changes in this patch. The default zone becomes a global const object, namely nf_ct_zone_dflt and will be returned directly in various cases, one being, when there's f.e. no zoning support. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | netfilter: nf_tables: add nft_dup expressionPablo Neira Ayuso2015-08-074-1/+116
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This new expression uses the nf_dup engine to clone packets to a given gateway. Unlike xt_TEE, we use an index to indicate output interface which should be fine at this stage. Moreover, change to the preemtion-safe this_cpu_read(nf_skb_duplicated) from nf_dup_ipv{4,6} to silence a lockdep splat. Based on the original tee expression from Arturo Borrero Gonzalez, although this patch has diverted quite a bit from this initial effort due to the change to support maps. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>