summaryrefslogtreecommitdiffstats
path: root/net
Commit message (Collapse)AuthorAgeFilesLines
* netfilter: nat: remove l4proto->unique_tupleFlorian Westphal2018-12-175-123/+56Star
| | | | | | | | | fold remaining users (icmp, icmpv6, gre) into nf_nat_l4proto_unique_tuple. The static-save of old incarnation of resolved key in gre and icmp is removed as well, just use the prandom based offset like the others. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: nat: un-export nf_nat_l4proto_unique_tupleFlorian Westphal2018-12-176-129/+75Star
| | | | | | | | | | | almost all l4proto->unique_tuple implementations just call this helper, so make ->unique_tuple() optional and call its helper directly if the l4proto doesn't override it. This is an intermediate step to get rid of ->unique_tuple completely. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: remove NF_NAT_RANGE_PROTO_RANDOM supportFlorian Westphal2018-12-173-21/+2Star
| | | | | | | | | | | | | | | | Historically this was net_random() based, and was then converted to a hash based algorithm (private boot seed + hash of endpoint addresses) due to concerns of leaking net_random() bits. RANDOM_FULLY mode was added later to avoid problems with hash based mode (see commit 34ce324019e76, "netfilter: nf_nat: add full port randomization support" for details). Just make prandom_u32() the default search starting point and get rid of ->secure_port() altogether. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: remove unused parameters in nf_ct_l4proto_[un]register_sysctl()Yafang Shao2018-12-171-14/+7Star
| | | | | | | | | These parameters aren't used now. So remove them. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: nat: limit port clash resolution attemptsFlorian Westphal2018-12-171-6/+23
| | | | | | | | | | | | | | | | | In case almost or all available ports are taken, clash resolution can take a very long time, resulting in soft lockup. This can happen when many to-be-natted hosts connect to same destination:port (e.g. a proxy) and all connections pass the same SNAT. Pick a random offset in the acceptable range, then try ever smaller number of adjacent port numbers, until either the limit is reached or a useable port was found. This results in at most 248 attempts (128 + 64 + 32 + 16 + 8, i.e. 4 restarts with new search offset) instead of 64000+, Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: nat: remove unnecessary 'else if' branchXiaozhou Liu2018-12-171-2/+0Star
| | | | | | | | | Since a pseudo-random starting point is used in finding a port in the default case, that 'else if' branch above is no longer a necessity. So remove it to simplify code. Signed-off-by: Xiaozhou Liu <liuxiaozhou@bytedance.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: ipset: replace a strncpy() with strscpy()Qian Cai2018-12-141-2/+4
| | | | | | | | | | | | | To make overflows as obvious as possible and to prevent code from blithely proceeding with a truncated string. This also has a side-effect to fix a compilation warning when using GCC 8.2.1. net/netfilter/ipset/ip_set_core.c: In function 'ip_set_sockfn_get': net/netfilter/ipset/ip_set_core.c:2027:3: warning: 'strncpy' writing 32 bytes into a region of size 2 overflows the destination [-Wstringop-overflow=] Signed-off-by: Qian Cai <cai@gmx.us> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: ipset: fix ip_set_byindex functionFlorent Fourcot2018-12-141-1/+1
| | | | | | | | | New function added by "Introduction of new commands and protocol version 7" is not working, since we return skb2 to user Signed-off-by: Victorien Molle <victorien.molle@wifirst.fr> Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: nat: remove l4 protocol port roversFlorian Westphal2018-12-015-26/+7Star
| | | | | | | | | | | | | | | | | | | | | | | This is a leftover from days where single-cpu systems were common: Store last port used to resolve a clash to use it as a starting point when the next conflict needs to be resolved. When we have parallel attempt to connect to same address:port pair, its likely that both cores end up computing the same "available" port, as both use same starting port, and newly used ports won't become visible to other cores until the conntrack gets confirmed later. One of the cores then has to drop the packet at insertion time because the chosen new tuple turns out to be in use after all. Lets simplify this: remove port rover and use a pseudo-random starting point. Note that this doesn't make netfilter default to 'fully random' mode; the 'rover' was only used if NAT could not reuse source port as-is. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: Replace call_rcu_bh(), rcu_barrier_bh(), and synchronize_rcu_bh()Paul E. McKenney2018-12-014-8/+8
| | | | | | | | | | | | Now that call_rcu()'s callback is not invoked until after bh-disable regions of code have completed (in addition to explicitly marked RCU read-side critical sections), call_rcu() can be used in place of call_rcu_bh(). Similarly, rcu_barrier() can be used in place of rcu_barrier_bh() and synchronize_rcu() in place of synchronize_rcu_bh(). This commit therefore makes these changes. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: nf_flow_table: simplify nf_flow_offload_gc_step()Taehee Yoo2018-11-121-27/+7Star
| | | | | | | | | | nf_flow_offload_gc_step() and nf_flow_table_iterate() are very similar. so that many duplicate code can be removed. After this patch, nf_flow_offload_gc_step() is simple callback function of nf_flow_table_iterate() like nf_flow_table_do_cleanup(). Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: nf_flow_table: make nf_flow_table_iterate() staticTaehee Yoo2018-11-121-4/+4
| | | | | | | nf_flow_table_iterate() is local function, make it static. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: ctnetlink: always honor CTA_MARK_MASKAndreas Jaggi2018-11-121-10/+19
| | | | | | | | | | | | | | | Useful to only set a particular range of the conntrack mark while leaving existing parts of the value alone, e.g. when updating conntrack marks via netlink from userspace. For NFQUEUE it was already implemented in commit 534473c6080e ("netfilter: ctnetlink: honor CTA_MARK_MASK when setting ctmark"). This now adds the same functionality also for the other netlink conntrack mark changes. Signed-off-by: Andreas Jaggi <andreas.jaggi@waterwave.ch> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* Merge branch 'master' of git://blackhole.kfki.hu/nf-nextPablo Neira Ayuso2018-11-124-40/+174
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jozsef Kadlecsik says: ==================== - Introduction of new commands and thus protocol version 7. The new commands makes possible to eliminate the getsockopt interface of ipset and use solely netlink to communicate with the kernel. Due to the strict attribute checking both in user/kernel space, a new protocol number was introduced. Both the kernel/userspace is fully backward compatible. - Make invalid MAC address checks consisten, from Stefano Brivio. The patch depends on the next one. - Allow matching on destination MAC address for mac and ipmac sets, also from Stefano Brivio. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: ipset: Introduction of new commands and protocol version 7Jozsef Kadlecsik2018-10-271-17/+147
| | | | | | | | | | | | | | | | | | | | | | | | | | Two new commands (IPSET_CMD_GET_BYNAME, IPSET_CMD_GET_BYINDEX) are introduced. The new commands makes possible to eliminate the getsockopt operation (in iptables set/SET match/target) and thus use only netlink communication between userspace and kernel for ipset. With the new protocol version, userspace can exactly know which functionality is supported by the running kernel. Both the kernel and userspace is fully backward compatible. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| * netfilter: ipset: Make invalid MAC address checks consistentStefano Brivio2018-10-222-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Set types bitmap:ipmac and hash:ipmac check that MAC addresses are not all zeroes. Introduce one missing check, and make the remaining ones consistent, using is_zero_ether_addr() instead of comparing against an array containing zeroes. This was already done for hash:mac sets in commit 26c97c5d8dac ("netfilter: ipset: Use is_zero_ether_addr instead of static and memcmp"). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| * netfilter: ipset: Allow matching on destination MAC address for mac and ↵Stefano Brivio2018-10-223-16/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ipmac sets There doesn't seem to be any reason to restrict MAC address matching to source MAC addresses in set types bitmap:ipmac, hash:ipmac and hash:mac. With this patch, and this setup: ip netns add A ip link add veth1 type veth peer name veth2 netns A ip addr add 192.0.2.1/24 dev veth1 ip -net A addr add 192.0.2.2/24 dev veth2 ip link set veth1 up ip -net A link set veth2 up ip netns exec A ipset create test hash:mac dst=$(ip netns exec A cat /sys/class/net/veth2/address) ip netns exec A ipset add test ${dst} ip netns exec A iptables -P INPUT DROP ip netns exec A iptables -I INPUT -m set --match-set test dst -j ACCEPT ipset will match packets based on destination MAC address: # ping -c1 192.0.2.2 >/dev/null # echo $? 0 Reported-by: Yi Chen <yiche@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-11-126-31/+39
|\ \
| * | act_mirred: clear skb->tstamp on redirectEric Dumazet2018-11-112-10/+2Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If sch_fq is used at ingress, skbs that might have been timestamped by net_timestamp_set() if a packet capture is requesting timestamps could be delayed by arbitrary amount of time, since sch_fq time base is MONOTONIC. Fix this problem by moving code from sch_netem.c to act_mirred.c. Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tipc: fix link re-establish failureJon Maloy2018-11-111-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a link failure is detected locally, the link is reset, the flag link->in_session is set to false, and a RESET_MSG with the 'stopping' bit set is sent to the peer. The purpose of this bit is to inform the peer that this endpoint just is going down, and that the peer should handle the reception of this particular RESET message as a local failure. This forces the peer to accept another RESET or ACTIVATE message from this endpoint before it can re-establish the link. This again is necessary to ensure that link session numbers are properly exchanged before the link comes up again. If a failure is detected locally at the same time at the peer endpoint this will do the same, which is also a correct behavior. However, when receiving such messages, the endpoints will not distinguish between 'stopping' RESETs and ordinary ones when it comes to updating session numbers. Both endpoints will copy the received session number and set their 'in_session' flags to true at the reception, while they are still expecting another RESET from the peer before they can go ahead and re-establish. This is contradictory, since, after applying the validation check referred to below, the 'in_session' flag will cause rejection of all such messages, and the link will never come up again. We now fix this by not only handling received RESET/STOPPING messages as a local failure, but also by omitting to set a new session number and the 'in_session' flag in such cases. Fixes: 7ea817f4e832 ("tipc: check session number before accepting link protocol messages") Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: sched: cls_flower: validate nested enc_opts_policy to avoid warningJakub Kicinski2018-11-101-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TCA_FLOWER_KEY_ENC_OPTS and TCA_FLOWER_KEY_ENC_OPTS_MASK can only currently contain further nested attributes, which are parsed by hand, so the policy is never actually used resulting in a W=1 build warning: net/sched/cls_flower.c:492:1: warning: ‘enc_opts_policy’ defined but not used [-Wunused-const-variable=] enc_opts_policy[TCA_FLOWER_KEY_ENC_OPTS_MAX + 1] = { Add the validation anyway to avoid potential bugs when other attributes are added and to make the attribute structure slightly more clear. Validation will also set extact to point to bad attribute on error. Fixes: 0a6e77784f49 ("net/sched: allow flower to match tunnel options") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Simon Horman <simon.horman@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | flow_dissector: do not dissect l4 ports for fragments배석진2018-11-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Only first fragment has the sport/dport information, not the following ones. If we want consistent hash for all fragments, we need to ignore ports even for first fragment. This bug is visible for IPv6 traffic, if incoming fragments do not have a flow label, since skb_get_hash() will give different results for first fragment and following ones. It is also visible if any routing rule wants dissection and sport or dport. See commit 5e5d6fed3741 ("ipv6: route: dissect flow in input path if fib rules need it") for details. [edumazet] rewrote the changelog completely. Fixes: 06635a35d13d ("flow_dissect: use programable dissector in skb_flow_dissect and friends") Signed-off-by: 배석진 <soukjin.bae@samsung.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | inet: frags: better deal with smp racesEric Dumazet2018-11-091-14/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Multiple cpus might attempt to insert a new fragment in rhashtable, if for example RPS is buggy, as reported by 배석진 in https://patchwork.ozlabs.org/patch/994601/ We use rhashtable_lookup_get_insert_key() instead of rhashtable_insert_fast() to let cpus losing the race free their own inet_frag_queue and use the one that was inserted by another cpu. Fixes: 648700f76b03 ("inet: frags: use rhashtables for reassembly units") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: 배석진 <soukjin.bae@samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net_sched: sch_fq: add dctcp-like markingEric Dumazet2018-11-111-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similar to 80ba92fa1a92 ("codel: add ce_threshold attribute") After EDT adoption, it became easier to implement DCTCP-like CE marking. In many cases, queues are not building in the network fabric but on the hosts themselves. If packets leaving fq missed their Earliest Departure Time by XXX usec, we mark them with ECN CE. This gives a feedback (after one RTT) to the sender to slow down and find better operating mode. Example : tc qd replace dev eth0 root fq ce_threshold 2.5ms Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: tsq: no longer use limit_output_bytes for paced flowsEric Dumazet2018-11-112-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | FQ pacing guarantees that paced packets queued by one flow do not add head-of-line blocking for other flows. After TCP GSO conversion, increasing limit_output_bytes to 1 MB is safe, since this maps to 16 skbs at most in qdisc or device queues. (or slightly more if some drivers lower {gso_max_segs|size}) We still can queue at most 1 ms worth of traffic (this can be scaled by wifi drivers if they need to) Tested: # ethtool -c eth0 | egrep "tx-usecs:|tx-frames:" # 40 Gbit mlx4 NIC tx-usecs: 16 tx-frames: 16 # tc qdisc replace dev eth0 root fq # for f in {1..10};do netperf -P0 -H lpaa24,6 -o THROUGHPUT;done Before patch: 27711 26118 27107 27377 27712 27388 27340 27117 27278 27509 After patch: 37434 36949 36658 36998 37711 37291 37605 36659 36544 37349 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: get rid of tcp_tso_should_defer() dependency on HZ/jiffiesEric Dumazet2018-11-111-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tcp_tso_should_defer() first heuristic is to not defer if last send is "old enough". Its current implementation uses jiffies and its low granularity. TSO autodefer performance should not rely on kernel HZ :/ After EDT conversion, we have state variables in nanoseconds that can allow us to properly implement the heuristic. This patch increases TSO chunk sizes on medium rate flows, especially when receivers do not use GRO or similar aggregation. It also reduces bursts for HZ=100 or HZ=250 kernels, making TCP behavior more uniform. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: refine tcp_tso_should_defer() after EDT adoptionEric Dumazet2018-11-111-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tcp_tso_should_defer() last step tries to check if the probable next ACK packet is coming in less than half rtt. Problem is that the head->tstamp might be in the future, so we need to use signed arithmetics to avoid overflows. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: do not try to defer skbs with eor mark (MSG_EOR)Eric Dumazet2018-11-111-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Applications using MSG_EOR are giving a strong hint to TCP stack : Subsequent sendmsg() can not append more bytes to skbs having the EOR mark. Do not try to TSO defer suchs skbs, there is really no hope. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: minor optimization in tcp ack fast path processingYafang Shao2018-11-111-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bitwise operation is a little faster. So I replace after() with using the flag FLAG_SND_UNA_ADVANCED as it is already set before. In addtion, there's another similar improvement in tcp_cwnd_reduction(). Cc: Joe Perches <joe@perches.com> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tipc: improve broadcast retransmission algorithmLUU Duc Canh2018-11-112-46/+14Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the broadcast retransmission algorithm is using the 'prev_retr' field in struct tipc_link to time stamp the latest broadcast retransmission occasion. This helps to restrict retransmission of individual broadcast packets to max once per 10 milliseconds, even though all other criteria for retransmission are met. We now move this time stamp to the control block of each individual packet, and remove other limiting criteria. This simplifies the retransmission algorithm, and eliminates any risk of logical errors in selecting which packets can be retransmitted. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: LUU Duc Canh <canh.d.luu@dektech.com.au> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: sched: register callbacks for indirect tc block bindsJohn Hurley2018-11-111-1/+255
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently drivers can register to receive TC block bind/unbind callbacks by implementing the setup_tc ndo in any of their given netdevs. However, drivers may also be interested in binds to higher level devices (e.g. tunnel drivers) to potentially offload filters applied to them. Introduce indirect block devs which allows drivers to register callbacks for block binds on other devices. The callback is triggered when the device is bound to a block, allowing the driver to register for rules applied to that block using already available functions. Freeing an indirect block callback will trigger an unbind event (if necessary) to direct the driver to remove any offloaded rules and unreg any block rule callbacks. It is the responsibility of the implementing driver to clean any registered indirect block callbacks before exiting, if the block it still active at such a time. Allow registering an indirect block dev callback for a device that is already bound to a block. In this case (if it is an ingress block), register and also trigger the callback meaning that any already installed rules can be replayed to the calling driver. Signed-off-by: John Hurley <john.hurley@netronome.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | sctp: Fix SKB list traversal in sctp_intl_store_ordered().David S. Miller2018-11-111-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Same change as made to sctp_intl_store_reasm(). To be fully correct, an iterator has an undefined value when something like skb_queue_walk() naturally terminates. This will actually matter when SKB queues are converted over to list_head. Formalize what this code ends up doing with the current implementation. Signed-off-by: David S. Miller <davem@davemloft.net>
* | | sctp: Fix SKB list traversal in sctp_intl_store_reasm().David S. Miller2018-11-111-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To be fully correct, an iterator has an undefined value when something like skb_queue_walk() naturally terminates. This will actually matter when SKB queues are converted over to list_head. Formalize what this code ends up doing with the current implementation. Signed-off-by: David S. Miller <davem@davemloft.net>
* | | iucv: Remove SKB list assumptions.David S. Miller2018-11-111-26/+15Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Eliminate the assumption that SKBs and SKB list heads can be cast to eachother in SKB list handling code. This change also appears to fix a bug since the list->next pointer is sampled outside of holding the SKB queue lock. Signed-off-by: David S. Miller <davem@davemloft.net>
* | | OVS: remove VLAN_TAG_PRESENT - fixupMichał Mirosław2018-11-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | It turns out I missed one VLAN_TAG_PRESENT in OVS code while rebasing. This fixes it. Fixes: 9df46aefafa6 ("OVS: remove use of VLAN_TAG_PRESENT") Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | udp6: cleanup stats accounting in recvmsg()Paolo Abeni2018-11-101-25/+7Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the udp6 code path, we needed multiple tests to select the correct mib to be updated. Since we touch at least a counter at each iteration, it's convenient to use the recently introduced __UDPX_MIB() helper once and remove some code duplication. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: tcp: remove BUG_ON from tcp_v4_errLi RongQing2018-11-101-1/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | if skb is NULL pointer, and the following access of skb's skb_mstamp_ns will trigger panic, which is same as BUG_ON Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: sched: red: inform offloads about harddrop settingJakub Kicinski2018-11-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To mirror software behaviour on offload more precisely inform the drivers about the state of the harddrop flag. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp_bbr: update comments to reflect pacing_margin_percentNeal Cardwell2018-11-091-8/+7Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recently, in commit ab408b6dc744 ("tcp: switch tcp and sch_fq to new earliest departure time model"), the TCP BBR code switched to a new approach of using an explicit bbr_pacing_margin_percent for shaving a pacing rate "haircut", rather than the previous implict approach. Update an old comment to reflect the new approach. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv4/tunnel: use __vlan_hwaccel helpersMichał Mirosław2018-11-091-1/+1
| | | | | | | | | | | | | | | Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | bridge: use __vlan_hwaccel helpersMichał Mirosław2018-11-093-10/+13
| | | | | | | | | | | | | | | | | | | | | This removes assumption than vlan_tci != 0 when tag is present. Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | 8021q: use __vlan_hwaccel helpersMichał Mirosław2018-11-091-1/+1
| | | | | | | | | | | | | | | Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | nfnetlink/queue: use __vlan_hwaccel helpersMichał Mirosław2018-11-091-2/+3
| | | | | | | | | | | | | | | Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net/core: use __vlan_hwaccel helpersMichał Mirosław2018-11-093-5/+7
| | | | | | | | | | | | | | | | | | | | | This removes assumptions about VLAN_TAG_PRESENT bit. Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: move __skb_checksum_complete*() to skbuff.cCong Wang2018-11-092-43/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __skb_checksum_complete_head() and __skb_checksum_complete() are both declared in skbuff.h, they fit better in skbuff.c than datagram.c. Cc: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: 8021q: vlan_core: allow use list of vlans for real deviceIvan Khoronzhuk2018-11-091-0/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's redundancy for the drivers to hold the list of vlans when absolutely the same list exists in vlan core. In most cases it's needed only to traverse the vlan devices, their vids and sync some settings with h/w, so add API to simplify this. At least some of these drivers also can benefit: grep "for_each.*vid" -r drivers/net/ethernet/ drivers/net/ethernet/hisilicon/hns3/hns3_enet.c: drivers/net/ethernet/synopsys/dwc-xlgmac-hw.c: drivers/net/ethernet/qlogic/qlge/qlge_main.c: drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c: drivers/net/ethernet/via/via-rhine.c: drivers/net/ethernet/via/via-velocity.c: drivers/net/ethernet/intel/igb/igb_main.c: drivers/net/ethernet/intel/ice/ice_main.c: drivers/net/ethernet/intel/e1000/e1000_main.c: drivers/net/ethernet/intel/i40e/i40e_main.c: drivers/net/ethernet/intel/e1000e/netdev.c: drivers/net/ethernet/intel/igbvf/netdev.c: drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c: drivers/net/ethernet/intel/ixgb/ixgb_main.c: drivers/net/ethernet/intel/ixgbe/ixgbe_main.c: drivers/net/ethernet/amd/xgbe/xgbe-dev.c: drivers/net/ethernet/emulex/benet/be_main.c: drivers/net/ethernet/neterion/vxge/vxge-main.c: drivers/net/ethernet/adaptec/starfire.c: drivers/net/ethernet/brocade/bna/bnad.c: Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: core: dev_addr_lists: add auxiliary func to handle reference address ↵Ivan Khoronzhuk2018-11-091-0/+97
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | updates In order to avoid all table update, and only remove or add new address, the auxiliary function exists, named __hw_addr_sync_dev(). It allows end driver do nothing when nothing changed and add/rm when concrete address is firstly added or lastly removed. But it doesn't include cases when an address of real device or vlan was reused by other vlans or vlan/macval devices. For handaling events when address was reused/unreused the patch adds new auxiliary routine - __hw_addr_ref_sync_dev(). It allows to do nothing when nothing was changed and do updates only for an address being added/reused/deleted/unreused. Thus, clone address changes for vlans can be mirrored in the table. The function is exclusive with __hw_addr_sync_dev(). It's responsibility of the end driver to identify address vlan device, if it needs so. Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | OVS: remove use of VLAN_TAG_PRESENTMichał Mirosław2018-11-094-18/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | This is a minimal change to allow removing of VLAN_TAG_PRESENT. It leaves OVS unable to use CFI bit, as fixing this would need a deeper surgery involving userspace interface. Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | sock: Reset dst when changing sk_mark via setsockoptDavid Barmann2018-11-091-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When setting the SO_MARK socket option, if the mark changes, the dst needs to be reset so that a new route lookup is performed. This fixes the case where an application wants to change routing by setting a new sk_mark. If this is done after some packets have already been sent, the dst is cached and has no effect. Signed-off-by: David Barmann <david.barmann@stackpath.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | openvswitch: remove BUG_ON from get_dpdevLi RongQing2018-11-091-1/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | if local is NULL pointer, and the following access of local's dev will trigger panic, which is same as BUG_ON Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net>