summaryrefslogtreecommitdiffstats
path: root/net/sched
Commit message (Collapse)AuthorAgeFilesLines
...
* | rtnetlink: make rtnl_register accept a flags parameterFlorian Westphal2017-08-103-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | This change allows us to later indicate to rtnetlink core that certain doit functions should be called without acquiring rtnl_mutex. This change should have no effect, we simply replace the last (now unused) calcit argument with the new flag. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-08-101-10/+10
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | The UDP offload conflict is dealt with by simply taking what is in net-next where we have removed all of the UFO handling code entirely. The TCP conflict was a case of local variables in a function being removed from both net and net-next. In netvsc we had an assignment right next to where a missing set of u64 stats sync object inits were added. Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sched: set xt_tgchk_param par.net properly in ipt_init_targetXin Long2017-08-091-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now xt_tgchk_param par in ipt_init_target is a local varibale, par.net is not initialized there. Later when xt_check_target calls target's checkentry in which it may access par.net, it would cause kernel panic. Jaroslav found this panic when running: # ip link add TestIface type dummy # tc qd add dev TestIface ingress handle ffff: # tc filter add dev TestIface parent ffff: u32 match u32 0 0 \ action xt -j CONNMARK --set-mark 4 This patch is to pass net param into ipt_init_target and set par.net with it properly in there. v1->v2: As Wang Cong pointed, I missed ipt_net_id != xt_net_id, so fix it by also passing net_id to __tcf_ipt_init. v2->v3: Missed the fixes tag, so add it. Fixes: ecb2421b5ddf ("netfilter: add and use nf_ct_netns_get/put") Reported-by: Jaroslav Aster <jaster@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net_sched: get rid of some forward declarationsWANG Cong2017-08-091-111/+103Star
| | | | | | | | | | | | | | | | | | | | | | | | | | If we move up tcf_fill_node() we can get rid of these forward declarations. Also, move down tfilter_notify_chain() to group them together. Reported-by: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net_sched: use void pointer for filter handleWANG Cong2017-08-0712-146/+136Star
| | | | | | | | | | | | | | | | | | | | | | | | Now we use 'unsigned long fh' as a pointer in every place, it is safe to convert it to a void pointer now. This gets rid of many casts to pointer. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net_sched: refactor notification code for RTM_DELTFILTERWANG Cong2017-08-071-5/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is confusing to use 'unsigned long fh' as both a handle and a pointer, especially commit 9ee7837449b3 ("net sched filters: fix notification of filter delete with proper handle"). This patch introduces tfilter_del_notify() so that we can pass it as a pointer as before, and we don't need to check RTM_DELTFILTER in tcf_fill_node() any more. This prepares for the next patch. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: get rid of struct tc_to_netdevJiri Pirko2017-08-075-112/+86Star
| | | | | | | | | | | | | | | | | | | | | | Get rid of struct tc_to_netdev which is now just unnecessary container and rather pass per-type structures down to drivers directly. Along with that, consolidate the naming of per-type structure variables in cls_*. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: move prio into cls_commonJiri Pirko2017-08-071-3/+0Star
| | | | | | | | | | | | | | | | | | | | prio is not cls_flower specific, but it is meaningful for all classifiers. Seems that only mlxsw cares about the value. Obviously, cls offload in other drivers is broken. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: push cls related args into cls_common structureJiri Pirko2017-08-075-32/+22Star
| | | | | | | | | | | | | | | | | | | | As ndo_setup_tc is generic offload op for whole tc subsystem, does not really make sense to have cls-specific args. So move them under cls_common structurure which is embedded in all cls structs. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: make egress_dev flag part of flower offload structJiri Pirko2017-08-071-1/+1
| | | | | | | | | | | | | | | | | | Since this is specific to flower now, make it part of the flower offload struct. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: rename TC_SETUP_MATCHALL to TC_SETUP_CLSMATCHALLJiri Pirko2017-08-071-2/+2
| | | | | | | | | | | | | | | | | | In order to be aligned with the rest of the types, rename TC_SETUP_MATCHALL to TC_SETUP_CLSMATCHALL. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: make type an argument for ndo_setup_tcJiri Pirko2017-08-075-35/+26Star
| | | | | | | | | | | | | | | | | | Since the type is always present, push it to be a separate argument to ndo_setup_tc. On the way, name the type enum and use it for arg type. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: avoid atomic swap in tcf_exts_changeJiri Pirko2017-08-043-13/+7Star
| | | | | | | | | | | | | | | | tcf_exts_change is always called on newly created exts, which are not used on fastpath. Therefore, simple struct copy is enough. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: cls_u32: no need to call tcf_exts_change for newly allocated structJiri Pirko2017-08-041-14/+4Star
| | | | | | | | | | | | | | | | | | As the n struct was allocated right before u32_set_parms call, no need to use tcf_exts_change to do atomic change, and we can just fill-up the unused exts struct directly by tcf_exts_validate. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: cls_route: no need to call tcf_exts_change for newly allocated ↵Jiri Pirko2017-08-041-21/+9Star
| | | | | | | | | | | | | | | | | | | | | | struct As the f struct was allocated right before route4_set_parms call, no need to use tcf_exts_change to do atomic change, and we can just fill-up the unused exts struct directly by tcf_exts_validate. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: cls_flow: no need to call tcf_exts_change for newly allocated structJiri Pirko2017-08-041-25/+16Star
| | | | | | | | | | | | | | | | | | As the fnew struct just was allocated, so no need to use tcf_exts_change to do atomic change, and we can just fill-up the unused exts struct directly by tcf_exts_validate. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: cls_cgroup: no need to call tcf_exts_change for newly allocated ↵Jiri Pirko2017-08-041-12/+2Star
| | | | | | | | | | | | | | | | | | | | | | struct As the new struct just was allocated, so no need to use tcf_exts_change to do atomic change, and we can just fill-up the unused exts struct directly by tcf_exts_validate. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: cls_bpf: no need to call tcf_exts_change for newly allocated structJiri Pirko2017-08-041-19/+6Star
| | | | | | | | | | | | | | | | | | As the prog struct was allocated right before cls_bpf_set_parms call, no need to use tcf_exts_change to do atomic change, and we can just fill-up the unused exts struct directly by tcf_exts_validate. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: cls_basic: no need to call tcf_exts_change for newly allocated ↵Jiri Pirko2017-08-041-11/+2Star
| | | | | | | | | | | | | | | | | | | | | | struct As the f struct was allocated right before basic_set_parms call, no need to use tcf_exts_change to do atomic change, and we can just fill-up the unused exts struct directly by tcf_exts_validate. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: cls_matchall: no need to call tcf_exts_change for newly ↵Jiri Pirko2017-08-041-12/+2Star
| | | | | | | | | | | | | | | | | | | | | | allocated struct As the head struct was allocated right before mall_set_parms call, no need to use tcf_exts_change to do atomic change, and we can just fill-up the unused exts struct directly by tcf_exts_validate. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: cls_fw: no need to call tcf_exts_change for newly allocated structJiri Pirko2017-08-041-16/+5Star
| | | | | | | | | | | | | | | | | | As the f struct was allocated right before fw_set_parms call, no need to use tcf_exts_change to do atomic change, and we can just fill-up the unused exts struct directly by tcf_exts_validate. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: cls_flower: no need to call tcf_exts_change for newly allocated ↵Jiri Pirko2017-08-041-11/+2Star
| | | | | | | | | | | | | | | | | | | | | | struct As the f struct was allocated right before fl_set_parms call, no need to use tcf_exts_change to do atomic change, and we can just fill-up the unused exts struct directly by tcf_exts_validate. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: cls_fw: rename fw_change_attrs functionJiri Pirko2017-08-041-6/+5Star
| | | | | | | | | | | | | | | | Since the function name is misleading since it is not changing anything, name it similarly to other cls. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: cls_bpf: rename cls_bpf_modify_existing functionJiri Pirko2017-08-041-6/+4Star
| | | | | | | | | | | | | | | | | | The name cls_bpf_modify_existing is highly misleading, as it indeed does not modify anything existing. It does not modify at all. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: use tcf_exts_has_actions instead of exts->nr_actionsJiri Pirko2017-08-041-1/+1
| | | | | | | | | | | | | | | | For check in tcf_exts_dump use tcf_exts_has_actions helper instead of exts->nr_actions for checking if there are any actions present. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: remove check for number of actions in tcf_exts_execJiri Pirko2017-08-041-1/+2
| | | | | | | | | | | | | | | | Leave it to tcf_action_exec to return TC_ACT_OK in case there is no action present. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: remove redundant helpers tcf_exts_is_predicative and ↵Jiri Pirko2017-08-043-3/+3
| | | | | | | | | | | | | | | | | | | | tcf_exts_is_available These two helpers are doing the same as tcf_exts_has_actions, so remove them and use tcf_exts_has_actions instead. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: change names of action number helpers to be aligned with the restJiri Pirko2017-08-041-1/+1
| | | | | | | | | | | | | | | | | | | | The rest of the helpers are named tcf_exts_*, so change the name of the action number helpers to be aligned. While at it, change to inline functions. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: remove unneeded tcf_em_tree_changeJiri Pirko2017-08-043-13/+7Star
| | | | | | | | | | | | | | | | Since tcf_em_tree_validate could be always called on a newly created filter, there is no need for this change function. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: sch_atm: use Qdisc_class_common structureJiri Pirko2017-08-041-6/+6
| | | | | | | | | | | | | | | | | | Even if it is only for classid now, use this common struct a be aligned with the rest of the classful qdiscs. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net sched actions: add time filter for action dumpingJamal Hadi Salim2017-07-311-1/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds support for filtering based on time since last used. When we are dumping a large number of actions it is useful to have the option of filtering based on when the action was last used to reduce the amount of data crossing to user space. With this patch the user space app sets the TCA_ROOT_TIME_DELTA attribute with the value in milliseconds with "time of interest since now". The kernel converts this to jiffies and does the filtering comparison matching entries that have seen activity since then and returns them to user space. Old kernels and old tc continue to work in legacy mode since they dont specify this attribute. Some example (we have 400 actions bound to 400 filters); at installation time. Using updated when tc setting the time of interest to 120 seconds earlier (we see 400 actions): prompt$ hackedtc actions ls action gact since 120000| grep index | wc -l 400 go get some coffee and wait for > 120 seconds and try again: prompt$ hackedtc actions ls action gact since 120000 | grep index | wc -l 0 Lets see a filter bound to one of these actions: .... filter pref 10 u32 filter pref 10 u32 fh 800: ht divisor 1 filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:10 (rule hit 2 success 1) match 7f000002/ffffffff at 12 (success 1 ) action order 1: gact action pass random type none pass val 0 index 23 ref 2 bind 1 installed 1145 sec used 802 sec Action statistics: Sent 84 bytes 1 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 .... that coffee took long, no? It was good. Now lets ping -c 1 127.0.0.2, then run the actions again: prompt$ hackedtc actions ls action gact since 120 | grep index | wc -l 1 More details please: prompt$ hackedtc -s actions ls action gact since 120000 action order 0: gact action pass random type none pass val 0 index 23 ref 2 bind 1 installed 1270 sec used 30 sec Action statistics: Sent 168 bytes 2 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 And the filter? filter pref 10 u32 filter pref 10 u32 fh 800: ht divisor 1 filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:10 (rule hit 4 success 2) match 7f000002/ffffffff at 12 (success 2 ) action order 1: gact action pass random type none pass val 0 index 23 ref 2 bind 1 installed 1324 sec used 84 sec Action statistics: Sent 168 bytes 2 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net sched actions: dump more than TCA_ACT_MAX_PRIO actions per batchJamal Hadi Salim2017-07-311-10/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When you dump hundreds of thousands of actions, getting only 32 per dump batch even when the socket buffer and memory allocations allow is inefficient. With this change, the user will get as many as possibly fitting within the given constraints available to the kernel. The top level action TLV space is extended. An attribute TCA_ROOT_FLAGS is used to carry flags; flag TCA_FLAG_LARGE_DUMP_ON is set by the user indicating the user is capable of processing these large dumps. Older user space which doesnt set this flag doesnt get the large (than 32) batches. The kernel uses the TCA_ROOT_COUNT attribute to tell the user how many actions are put in a single batch. As such user space app knows how long to iterate (independent of the type of action being dumped) instead of hardcoded maximum of 32 thus maintaining backward compat. Some results dumping 1.5M actions below: first an unpatched tc which doesnt understand these features... prompt$ time -p tc actions ls action gact | grep index | wc -l 1500000 real 1388.43 user 2.07 sys 1386.79 Now lets see a patched tc which sets the correct flags when requesting a dump: prompt$ time -p updatedtc actions ls action gact | grep index | wc -l 1500000 real 178.13 user 2.02 sys 176.96 That is about 8x performance improvement for tc app which sets its receive buffer to about 32K. Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net sched actions: Use proper root attribute table for actionsJamal Hadi Salim2017-07-311-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | Bug fix for an issue which has been around for about a decade. We got away with it because the enumeration was larger than needed. Fixes: 7ba699c604ab ("[NET_SCHED]: Convert actions from rtnetlink to new netlink API") Suggested-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-07-212-3/+3
|\|
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2017-07-211-2/+2
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) BPF verifier signed/unsigned value tracking fix, from Daniel Borkmann, Edward Cree, and Josef Bacik. 2) Fix memory allocation length when setting up calls to ->ndo_set_mac_address, from Cong Wang. 3) Add a new cxgb4 device ID, from Ganesh Goudar. 4) Fix FIB refcount handling, we have to set it's initial value before the configure callback (which can bump it). From David Ahern. 5) Fix double-free in qcom/emac driver, from Timur Tabi. 6) A bunch of gcc-7 string format overflow warning fixes from Arnd Bergmann. 7) Fix link level headroom tests in ip_do_fragment(), from Vasily Averin. 8) Fix chunk walking in SCTP when iterating over error and parameter headers. From Alexander Potapenko. 9) TCP BBR congestion control fixes from Neal Cardwell. 10) Fix SKB fragment handling in bcmgenet driver, from Doug Berger. 11) BPF_CGROUP_RUN_PROG_SOCK_OPS needs to check for null __sk, from Cong Wang. 12) xmit_recursion in ppp driver needs to be per-device not per-cpu, from Gao Feng. 13) Cannot release skb->dst in UDP if IP options processing needs it. From Paolo Abeni. 14) Some netdev ioctl ifr_name[] NULL termination fixes. From Alexander Levin and myself. 15) Revert some rtnetlink notification changes that are causing regressions, from David Ahern. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (83 commits) net: bonding: Fix transmit load balancing in balance-alb mode rds: Make sure updates to cp_send_gen can be observed net: ethernet: ti: cpsw: Push the request_irq function to the end of probe ipv4: initialize fib_trie prior to register_netdev_notifier call. rtnetlink: allocate more memory for dev_set_mac_address() net: dsa: b53: Add missing ARL entries for BCM53125 bpf: more tests for mixed signed and unsigned bounds checks bpf: add test for mixed signed and unsigned bounds checks bpf: fix up test cases with mixed signed/unsigned bounds bpf: allow to specify log level and reduce it for test_verifier bpf: fix mixed signed/unsigned derived min/max value bounds ipv6: avoid overflow of offset in ip6_find_1stfragopt net: tehuti: don't process data if it has not been copied from userspace Revert "rtnetlink: Do not generate notifications for CHANGEADDR event" net: dsa: mv88e6xxx: Enable CMODE config support for 6390X dt-binding: ptp: Add SoC compatibility strings for dte ptp clock NET: dwmac: Make dwmac reset unconditional net: Zero terminate ifr_name in dev_ifname(). wireless: wext: terminate ifr name coming from userspace netfilter: fix netfilter_net_init() return ...
| | * net sched actions: rename act_get_notify() to tcf_get_notify()Roman Mashak2017-07-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | Make name consistent with other TC event notification routines, such as tcf_add_notify() and tcf_del_notify() Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful ↵Michal Hocko2017-07-131-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | semantic __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to the page allocator. This has been true but only for allocations requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for smaller sizes. This is a bit unfortunate because there is no way to express the same semantic for those requests and they are considered too important to fail so they might end up looping in the page allocator for ever, similarly to GFP_NOFAIL requests. Now that the whole tree has been cleaned up and accidental or misled usage of __GFP_REPEAT flag has been removed for !costly requests we can give the original flag a better name and more importantly a more useful semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user that the allocator would try really hard but there is no promise of a success. This will work independent of the order and overrides the default allocator behavior. Page allocator users have several levels of guarantee vs. cost options (take GFP_KERNEL as an example) - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_ attempt to free memory at all. The most light weight mode which even doesn't kick the background reclaim. Should be used carefully because it might deplete the memory and the next user might hit the more aggressive reclaim - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic allocation without any attempt to free memory from the current context but can wake kswapd to reclaim memory if the zone is below the low watermark. Can be used from either atomic contexts or when the request is a performance optimization and there is another fallback for a slow path. - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) - non sleeping allocation with an expensive fallback so it can access some portion of memory reserves. Usually used from interrupt/bh context with an expensive slow path fallback. - GFP_KERNEL - both background and direct reclaim are allowed and the _default_ page allocator behavior is used. That means that !costly allocation requests are basically nofail but there is no guarantee of that behavior so failures have to be checked properly by callers (e.g. OOM killer victim is allowed to fail currently). - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and all allocation requests fail early rather than cause disruptive reclaim (one round of reclaim in this implementation). The OOM killer is not invoked. - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior and all allocation requests try really hard. The request will fail if the reclaim cannot make any progress. The OOM killer won't be triggered. - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior and all allocation requests will loop endlessly until they succeed. This might be really dangerous especially for larger orders. Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL because they already had their semantic. No new users are added. __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if there is no progress and we have already passed the OOM point. This means that all the reclaim opportunities have been exhausted except the most disruptive one (the OOM killer) and a user defined fallback behavior is more sensible than keep retrying in the page allocator. [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c] [mhocko@suse.com: semantic fix] Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz [mhocko@kernel.org: address other thing spotted by Vlastimil] Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alex Belits <alex.belits@cavium.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David Daney <david.daney@cavium.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: NeilBrown <neilb@suse.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* / net: Remove all references to SKB_GSO_UDP.David S. Miller2017-07-171-6/+0Star
|/ | | | | | Such packets are no longer possible. Signed-off-by: David S. Miller <davem@davemloft.net>
* net, sched: convert Qdisc.refcnt from atomic_t to refcount_tReshetova, Elena2017-07-042-8/+8
| | | | | | | | | | | | | | refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: convert sock.sk_refcnt from atomic_t to refcount_tReshetova, Elena2017-07-011-1/+1
| | | | | | | | | | | | | | | | | | | refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. This patch uses refcount_inc_not_zero() instead of atomic_inc_not_zero_hint() due to absense of a _hint() version of refcount API. If the hint() version must be used, we might need to revisit API. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: convert sock.sk_wmem_alloc from atomic_t to refcount_tReshetova, Elena2017-07-011-1/+1
| | | | | | | | | | | | | | refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-06-301-1/+2
|\ | | | | | | | | | | | | A set of overlapping changes in macvlan and the rocker driver, nothing serious. Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sched: Fix one possible panic when no destroy callbackGao Feng2017-06-291-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When qdisc fail to init, qdisc_create would invoke the destroy callback to cleanup. But there is no check if the callback exists really. So it would cause the panic if there is no real destroy callback like the qdisc codel, fq, and so on. Take codel as an example following: When a malicious user constructs one invalid netlink msg, it would cause codel_init->codel_change->nla_parse_nested failed. Then kernel would invoke the destroy callback directly but qdisc codel doesn't define one. It causes one panic as a result. Now add one the check for destroy to avoid the possible panic. Fixes: 87b60cfacf9f ("net_sched: fix error recovery at qdisc creation") Signed-off-by: Gao Feng <gfree.wind@vip.163.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | bpf: expose prog id for cls_bpf and act_bpfDaniel Borkmann2017-06-212-0/+6
| | | | | | | | | | | | | | | | | | | | | | In order to be able to retrieve the attached programs from cls_bpf and act_bpf, we need to expose the prog ids via netlink so that an application can later on get an fd based on the id through the BPF_PROG_GET_FD_BY_ID command, and dump related prog info via BPF_OBJ_GET_INFO_BY_FD command for bpf(2). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: act_tunnel_key: make UDP checksum configurableJiri Benc2017-06-151-3/+12
| | | | | | | | | | | | | | | | | | Allow requesting of zero UDP checksum for encapsulated packets. The name and meaning of the attribute is "NO_CSUM" in order to have the same meaning of the attribute missing and being 0. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: act_tunnel_key: request UDP checksum by defaultJiri Benc2017-06-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | There's currently no way to request (outer) UDP checksum with act_tunnel_key. This is problem especially for IPv6. Right now, tunnel_key action with IPv6 does not work without going through hassles: both sides have to have udp6zerocsumrx configured on the tunnel interface. This is obviously not a good solution universally. It makes more sense to compute the UDP checksum by default even for IPv4. Just set the default to request the checksum when using act_tunnel_key. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-06-152-6/+6
|\| | | | | | | | | | | | | The conflicts were two cases of overlapping changes in batman-adv and the qed driver. Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/act_pedit: fix an error codeDan Carpenter2017-06-141-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | I'm reviewing static checker warnings where we do ERR_PTR(0), which is the same as NULL. I'm pretty sure we intended to return ERR_PTR(-EINVAL) here. Sometimes these bugs lead to a NULL dereference but I don't immediately see that problem here. Fixes: 71d0ed7079df ("net/act_pedit: Support using offset relative to the conventional network headers") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Amir Vadai <amir@vadai.me> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net_sched: move tcf_lock down after gen_replace_estimator()WANG Cong2017-06-141-5/+3Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Laura reported a sleep-in-atomic kernel warning inside tcf_act_police_init() which calls gen_replace_estimator() with spinlock protection. It is not necessary in this case, we already have RTNL lock here so it is enough to protect concurrent writers. For the reader, i.e. tcf_act_police(), it needs to make decision based on this rate estimator, in the worst case we drop more/less packets than necessary while changing the rate in parallel, it is still acceptable. Reported-by: Laura Abbott <labbott@redhat.com> Reported-by: Nick Huber <nicholashuber@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: propagate tc filter chain index down the ndo_setup_tc callJiri Pirko2017-06-085-14/+23
| | | | | | | | | | | | | | | | | | | | | | We need to push the chain index down to the drivers, so they have the information to which chain the rule belongs. For now, no driver supports multichain offload, so only chain 0 is supported. This is needed to prevent chain squashes during offload for now. Later this will be used to implement multichain offload. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>