summaryrefslogtreecommitdiffstats
path: root/net/ipv4/tcp.c
Commit message (Collapse)AuthorAgeFilesLines
* mm: add a low limit to alloc_large_system_hashTim Bird2012-05-241-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | UDP stack needs a minimum hash size value for proper operation and also uses alloc_large_system_hash() for proper NUMA distribution of its hash tables and automatic sizing depending on available system memory. On some low memory situations, udp_table_init() must ignore the alloc_large_system_hash() result and reallocs a bigger memory area. As we cannot easily free old hash table, we leak it and kmemleak can issue a warning. This patch adds a low limit parameter to alloc_large_system_hash() to solve this problem. We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table allocation. Reported-by: Mark Asselstine <mark.asselstine@windriver.com> Reported-by: Tim Bird <tim.bird@am.sony.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2012-05-211-2/+1Star
|\
| * tcp: do_tcp_sendpages() must try to push data out on oom conditionsWilly Tarreau2012-05-181-2/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since recent changes on TCP splicing (starting with commits 2f533844 "tcp: allow splice() to build full TSO packets" and 35f9c09f "tcp: tcp_sendpages() should call tcp_push() once"), I started seeing massive stalls when forwarding traffic between two sockets using splice() when pipe buffers were larger than socket buffers. Latest changes (net: netdev_alloc_skb() use build_skb()) made the problem even more apparent. The reason seems to be that if do_tcp_sendpages() fails on out of memory condition without being able to send at least one byte, tcp_push() is not called and the buffers cannot be flushed. After applying the attached patch, I cannot reproduce the stalls at all and the data rate it perfectly stable and steady under any condition which previously caused the problem to be permanent. The issue seems to have been there since before the kernel migrated to git, which makes me think that the stalls I occasionally experienced with tux during stress-tests years ago were probably related to the same issue. This issue was first encountered on 3.0.31 and 3.2.17, so please backport to -stable. Signed-off-by: Willy Tarreau <w@1wt.eu> Acked-by: Eric Dumazet <edumazet@google.com> Cc: <stable@vger.kernel.org>
* | net/ipv4: replace simple_strtoul with kstrtoulEldad Zack2012-05-201-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | Replace simple_strtoul with kstrtoul in three similar occurrences, all setup handlers: * route.c: set_rhash_entries * tcp.c: set_thash_entries * udp.c: set_uhash_entries Also check if the conversion failed. Signed-off-by: Eldad Zack <eldad@fogrefinery.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: bool conversionsEric Dumazet2012-05-171-10/+10
| | | | | | | | | | | | | | | | | | | | | | bool conversions where possible. __inline__ -> inline space cleanups Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: include/net/sock.h cleanupEric Dumazet2012-05-171-6/+6
| | | | | | | | | | | | | | | | | | | | | | bool/const conversions where possible __inline__ -> inline space cleanups Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Convert net_ratelimit uses to net_<level>_ratelimitedJoe Perches2012-05-151-7/+7
| | | | | | | | | | | | | | | | | | | | Standardize the net core ratelimited logging functions. Coalesce formats, align arguments. Change a printk then vprintk sequence to use printf extension %pV. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: Move rcvq sending to tcp_input.cPavel Emelyanov2012-05-111-33/+0Star
| | | | | | | | | | | | | | | | | | It actually works on the input queue and will use its read mem routines, thus it's better to have in in the tcp_input.c file. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2012-05-081-4/+5
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/intel/e1000e/param.c drivers/net/wireless/iwlwifi/iwl-agn-rx.c drivers/net/wireless/iwlwifi/iwl-trans-pcie-rx.c drivers/net/wireless/iwlwifi/iwl-trans.h Resolved the iwlwifi conflict with mainline using 3-way diff posted by John Linville and Stephen Rothwell. In 'net' we added a bug fix to make iwlwifi report a more accurate skb->truesize but this conflicted with RX path changes that happened meanwhile in net-next. In e1000e a conflict arose in the validation code for settings of adapter->itr. 'net-next' had more sophisticated logic so that logic was used. Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: change tcp_adv_win_scale and tcp_rmem[2]Eric Dumazet2012-05-031-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tcp_adv_win_scale default value is 2, meaning we expect a good citizen skb to have skb->len / skb->truesize ratio of 75% (3/4) In 2.6 kernels we (mis)accounted for typical MSS=1460 frame : 1536 + 64 + 256 = 1856 'estimated truesize', and 1856 * 3/4 = 1392. So these skbs were considered as not bloated. With recent truesize fixes, a typical MSS=1460 frame truesize is now the more precise : 2048 + 256 = 2304. But 2304 * 3/4 = 1728. So these skb are not good citizen anymore, because 1460 < 1728 (GRO can escape this problem because it build skbs with a too low truesize.) This also means tcp advertises a too optimistic window for a given allocated rcvspace : When receiving frames, sk_rmem_alloc can hit sk_rcvbuf limit and we call tcp_prune_queue()/tcp_collapse() too often, especially when application is slow to drain its receive queue or in case of losses (netperf is fast, scp is slow). This is a major latency source. We should adjust the len/truesize ratio to 50% instead of 75% This patch : 1) changes tcp_adv_win_scale default to 1 instead of 2 2) increase tcp_rmem[2] limit from 4MB to 6MB to take into account better truesize tracking and to allow autotuning tcp receive window to reach same value than before. Note that same amount of kernel memory is consumed compared to 2.6 kernels. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: implement tcp coalescing in tcp_queue_rcv()Eric Dumazet2012-05-031-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Extend tcp coalescing implementing it from tcp_queue_rcv(), the main receiver function when application is not blocked in recvmsg(). Function tcp_queue_rcv() is moved a bit to allow its call from tcp_data_queue() This gives good results especially if GRO could not kick, and if skb head is a fragment. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: early retransmitYuchung Cheng2012-05-031-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements RFC 5827 early retransmit (ER) for TCP. It reduces DUPACK threshold (dupthresh) if outstanding packets are less than 4 to recover losses by fast recovery instead of timeout. While the algorithm is simple, small but frequent network reordering makes this feature dangerous: the connection repeatedly enter false recovery and degrade performance. Therefore we implement a mitigation suggested in the appendix of the RFC that delays entering fast recovery by a small interval, i.e., RTT/4. Currently ER is conservative and is disabled for the rest of the connection after the first reordering event. A large scale web server experiment on the performance impact of ER is summarized in section 6 of the paper "Proportional Rate Reduction for TCP”, IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf Note that Linux has a similar feature called THIN_DUPACK. The differences are THIN_DUPACK do not mitigate reorderings and is only used after slow start. Currently ER is disabled if THIN_DUPACK is enabled. I would be happy to merge THIN_DUPACK feature with ER if people think it's a good idea. ER is enabled by sysctl_tcp_early_retrans: 0: Disables ER 1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4. 2: (Default) reduce dupthresh like mode 1. In addition, delay entering fast recovery by RTT/4. Note: mode 2 is implemented in the third part of this patch series. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp repair: Fix unaligned access when repairing options (v2)Pavel Emelyanov2012-04-261-39/+21Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't pick __u8/__u16 values directly from raw pointers, but instead use an array of structures of code:value pairs. This is OK, since the buffer we take options from is not an skb memory, but a user-to-kernel one. For those options which don't require any value now, require this to be zero (for potential future extension of this API). v2: Changed tcp_repair_opt to use two __u32-s as spotted by David Laight. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: skb_can_coalesce returns a booleanEric Dumazet2012-04-241-1/+2
| | | | | | | | | | Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: move duplicate code from tcp_v4_init_sock()/tcp_v6_init_sock()Neal Cardwell2012-04-211-0/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit moves the (substantial) common code shared between tcp_v4_init_sock() and tcp_v6_init_sock() to a new address-family independent function, tcp_init_sock(). Centralizing this functionality should help avoid drift issues, e.g. where the IPv4 side is updated without a corresponding update to IPv6. There was already some drift: IPv4 initialized snd_cwnd to TCP_INIT_CWND, while the IPv6 side was still initializing snd_cwnd to 2 (in this case it should not matter, since snd_cwnd is also initialized in tcp_init_metrics(), but the general risks and maintenance overhead remain). When diffing the old and new code, note that new tcp_init_sock() function uses the order of steps from the tcp_v4_init_sock() implementation (the order is slightly different in tcp_v6_init_sock()). Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: Repair connection-time negotiated parametersPavel Emelyanov2012-04-211-0/+71
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are options, which are set up on a socket while performing TCP handshake. Need to resurrect them on a socket while repairing. A new sockoption accepts a buffer and parses it. The buffer should be CODE:VALUE sequence of bytes, where CODE is standard option code and VALUE is the respective value. Only 4 options should be handled on repaired socket. To read 3 out of 4 of these options the TCP_INFO sockoption can be used. An ability to get the last one (the mss_clamp) was added by the previous patch. Now the restore. Three of these options -- timestamp_ok, mss_clamp and snd_wscale -- are just restored on a coket. The sack_ok flags has 2 issues. First, whether or not to do sacks at all. This flag is just read and set back. No other sack info is saved or restored, since according to the standart and the code dropping all sack-ed segments is OK, the sender will resubmit them again, so after the repair we will probably experience a pause in connection. Next, the fack bit. It's just set back on a socket if the respective sysctl is set. No collected stats about packets flow is preserved. As far as I see (plz, correct me if I'm wrong) the fack-based congestion algorithm survives dropping all of the stats and repairs itself eventually, probably losing the performance for that period. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: Report mss_clamp with TCP_MAXSEG option in repair modePavel Emelyanov2012-04-211-0/+2
| | | | | | | | | | | | | | | | | | The mss_clamp is the only connection-time negotiated option which cannot be obtained from the user space. Make the TCP_MAXSEG sockopt report one in the repair mode. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: Repair socket queuesPavel Emelyanov2012-04-211-3/+86
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reading queues under repair mode is done with recvmsg call. The queue-under-repair set by TCP_REPAIR_QUEUE option is used to determine which queue should be read. Thus both send and receive queue can be read with this. Caller must pass the MSG_PEEK flag. Writing to queues is done with sendmsg call and yet again -- the repair-queue option can be used to push data into the receive queue. When putting an skb into receive queue a zero tcp header is appented to its head to address the tcp_hdr(skb)->syn and the ->fin checks by the (after repair) tcp_recvmsg. These flags flags are both set to zero and that's why. The fin cannot be met in the queue while reading the source socket, since the repair only works for closed/established sockets and queueing fin packet always changes its state. The syn in the queue denotes that the respective skb's seq is "off-by-one" as compared to the actual payload lenght. Thus, at the rcv queue refill we can just drop this flag and set the skb's sequences to precice values. When the repair mode is turned off, the write queue seqs are updated so that the whole queue is considered to be 'already sent, waiting for ACKs' (write_seq = snd_nxt <= snd_una). From the protocol POV the send queue looks like it was sent, but the data between the write_seq and snd_nxt is lost in the network. This helps to avoid another sockoption for setting the snd_nxt sequence. Leaving the whole queue in a 'not yet sent' state (as it will be after sendmsg-s) will not allow to receive any acks from the peer since the ack_seq will be after the snd_nxt. Thus even the ack for the window probe will be dropped and the connection will be 'locked' with the zero peer window. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: Initial repair modePavel Emelyanov2012-04-211-1/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This includes (according the the previous description): * TCP_REPAIR sockoption This one just puts the socket in/out of the repair mode. Allowed for CAP_NET_ADMIN and for closed/establised sockets only. When repair mode is turned off and the socket happens to be in the established state the window probe is sent to the peer to 'unlock' the connection. * TCP_REPAIR_QUEUE sockoption This one sets the queue which we're about to repair. The 'no-queue' is set by default. * TCP_QUEUE_SEQ socoption Sets the write_seq/rcv_nxt of a selected repaired queue. Allowed for TCP_CLOSE-d sockets only. When the socket changes its state the other seq-s are changed by the kernel according to the protocol rules (most of the existing code is actually reused). * Ability to forcibly bind a socket to a port The sk->sk_reuse is set to SK_FORCE_REUSE. * Immediate connect modification The connect syscall initializes the connection, then directly jumps to the code which finalizes it. * Silent close modification The close just aborts the connection (similar to SO_LINGER with 0 time) but without sending any FIN/RST-s to peer. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: Move code aroundPavel Emelyanov2012-04-211-1/+1
| | | | | | | | | | | | | | | | This is just the preparation patch, which makes the needed for TCP repair code ready for use. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: cleanup unsigned to unsigned intEric Dumazet2012-04-151-4/+4
|/ | | | | | | Use of "unsigned int" is preferred to bare "unsigned" in net tree. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2012-04-121-6/+5Star
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Fix bluetooth userland regression reported by Keith Packard, from Gustavo Padovan. 2) Revert ath9k PS idle change, from Sujith Manoharan. 3) Correct default TCP memory limits (again), from Eric Dumazet. 4) Fix tcp_rcv_rtt_update() accidental use of unscaled RTT, from Neal Cardwell. 5) We made a facility for layers like wireless to say how much tailroom they need in the SKB for link layer stuff such as wireless encryption etc., but TCP works hard to fill every SKB out to the end defeating this specification. This leads to every TCP packet getting reallocated by the wireless code in order to have the right amount of tailroom available. Fix TCP to only fill SKBs out to the real amount of data area it asked for during the allocation, this way it won't eat into the slack added for the device's tailroom needs. Reported by Marc Merlin and fixed by Eric Dumazet. 6) Leaks, endian bugs, and new device IDs in bluetooth from Santosh Nayak, João Paulo Rechi Vita, Cho, Yu-Chen, Andrei Emeltchenko, AceLan Kao, and Andrei Emeltchenko. 7) OOPS on tty_close fix in bluetooth's hci_ldisc from Johan Hovold. 8) netfilter erroneously scales TCP window twice, fix from Changli Gao. 9) Memleak fix in wext-core from Julia Lawall. 10) Consistently handle invalid TCP packets in ipv4 vs. ipv6 conntrack, from Jozsef Kadlecsik. 11) Validate IP header length properly in netfilter conntrack's ipv4_get_l4proto(). * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (39 commits) NFC: Fix the LLCP Tx fragmentation loop rtlwifi: Add missing DMA buffer unmapping for PCI drivers rtlwifi: Preallocate USB read buffers and eliminate kalloc in read routine tcp: avoid order-1 allocations on wifi and tx path net: allow pskb_expand_head() to get maximum tailroom bridge: Do not send queries on multicast group leaves MAINTAINERS: Mark NATSEMI driver as orphan'd. tcp: fix tcp_rcv_rtt_update() use of an unscaled RTT sample tcp: restore correct limit Revert "ath9k: fix going to full-sleep on PS idle" rt2x00: Fix rfkill_polling register function. bcma: fix build error on MIPS; implicit pcibios_enable_device netfilter: nf_conntrack: fix incorrect logic in nf_conntrack_init_net netfilter: nf_ct_ipv4: packets with wrong ihl are invalid netfilter: nf_ct_ipv4: handle invalid IPv4 and IPv6 packets consistently net/wireless/wext-core.c: add missing kfree rtlwifi: Fix oops on rate-control failure mac80211: Convert WARN_ON to WARN_ON_ONCE rtlwifi: rtl8192de: Fix firmware initialization nl80211: ensure interface is up in various APIs ...
| * tcp: avoid order-1 allocations on wifi and tx pathEric Dumazet2012-04-111-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Marc Merlin reported many order-1 allocations failures in TX path on its wireless setup, that dont make any sense with MTU=1500 network, and non SG capable hardware. After investigation, it turns out TCP uses sk_stream_alloc_skb() and used as a convention skb_tailroom(skb) to know how many bytes of data payload could be put in this skb (for non SG capable devices) Note : these skb used kmalloc-4096 (MTU=1500 + MAX_HEADER + sizeof(struct skb_shared_info) being above 2048) Later, mac80211 layer need to add some bytes at the tail of skb (IEEE80211_ENCRYPT_TAILROOM = 18 bytes) and since no more tailroom is available has to call pskb_expand_head() and request order-1 allocations. This patch changes sk_stream_alloc_skb() so that only sk->sk_prot->max_header bytes of headroom are reserved, and use a new skb field, avail_size to hold the data payload limit. This way, order-0 allocations done by TCP stack can leave more than 2 KB of tailroom and no more allocation is performed in mac80211 layer (or any layer needing some tailroom) avail_size is unioned with mark/dropcount, since mark will be set later in IP stack for output packets. Therefore, skb size is unchanged. Reported-by: Marc MERLIN <marc@merlins.org> Tested-by: Marc MERLIN <marc@merlins.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: restore correct limitEric Dumazet2012-04-101-2/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit c43b874d5d714f (tcp: properly initialize tcp memory limits) tried to fix a regression added in commits 4acb4190 & 3dc43e3, but still get it wrong. Result is machines with low amount of memory have too small tcp_rmem[2] value and slow tcp receives : Per socket limit being 1/1024 of memory instead of 1/128 in old kernels, so rcv window is capped to small values. Fix this to match comment and previous behavior. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge tag 'dmaengine-fixes' of ↵Linus Torvalds2012-04-111-2/+2
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine Pull dmaengine fixes from Dan Williams: 1/ regression fix for Xen as it now trips over a broken assumption about the dma address size on 32-bit builds 2/ new quirk for netdma to ignore dma channels that cannot meet netdma alignment requirements 3/ fixes for two long standing issues in ioatdma (ring size overflow) and iop-adma (potential stack corruption) * tag 'dmaengine-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine: netdma: adding alignment check for NETDMA ops ioatdma: DMA copy alignment needed to address IOAT DMA silicon errata ioat: ring size variables need to be 32bit to avoid overflow iop-adma: Corrected array overflow in RAID6 Xscale(R) test. ioat: fix size of 'completion' for Xen
| * netdma: adding alignment check for NETDMA opsDave Jiang2012-04-061-2/+2
| | | | | | | | | | | | | | | | | | | | This is the fallout from adding memcpy alignment workaround for certain IOATDMA hardware. NetDMA will only use DMA engine that can handle byte align ops. Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* | tcp: tcp_sendpages() should call tcp_push() onceEric Dumazet2012-04-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 2f533844242 (tcp: allow splice() to build full TSO packets) added a regression for splice() calls using SPLICE_F_MORE. We need to call tcp_flush() at the end of the last page processed in tcp_sendpages(), or else transmits can be deferred and future sends stall. Add a new internal flag, MSG_SENDPAGE_NOTLAST, acting like MSG_MORE, but with different semantic. For all sendpage() providers, its a transparent change. Only sock_sendpage() and tcp_sendpages() can differentiate the two different flags provided by pipe_to_sendpage() Reported-by: Tom Herbert <therbert@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: H.K. Jerry Chu <hkchu@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Eric Dumazet <eric.dumazet@gmail>com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: allow splice() to build full TSO packetsEric Dumazet2012-04-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096) The call to tcp_push() at the end of do_tcp_sendpages() forces an immediate xmit when pipe is not already filled, and tso_fragment() try to split these skb to MSS multiples. 4096 bytes are usually split in a skb with 2 MSS, and a remaining sub-mss skb (assuming MTU=1500) This makes slow start suboptimal because many small frames are sent to qdisc/driver layers instead of big ones (constrained by cwnd and packets in flight of course) In fact, applications using sendmsg() (adding an additional memory copy) instead of vmsplice()/splice()/sendfile() are a bit faster because of this anomaly, especially if serving small files in environments with large initial [c]wnd. Call tcp_push() only if MSG_MORE is not set in the flags parameter. This bit is automatically provided by splice() internals but for the last page, or on all pages if user specified SPLICE_F_MORE splice() flag. In some workloads, this can reduce number of sent logical packets by an order of magnitude, making zero-copy TCP actually faster than one-copy :) Reported-by: Tom Herbert <therbert@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: H.K. Jerry Chu <hkchu@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Eric Dumazet <eric.dumazet@gmail>com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: ipv4: Standardize prefixes for message loggingJoe Perches2012-03-131-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add #define pr_fmt(fmt) as appropriate. Add "IPv4: ", "TCP: ", and "IPsec: " to appropriate files. Standardize on "UDPLite: " for appropriate uses. Some prefixes were previously "UDPLITE: " and "UDP-Lite: ". Add KBUILD_MODNAME ": " to icmp and gre. Remove embedded prefixes as appropriate. Add missing "\n" to pr_info in gre.c. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Convert printks to pr_<level>Joe Perches2012-03-121-4/+3Star
|/ | | | | | | | | | | | | | | | | | | | | | | | | | Use a more current kernel messaging style. Convert a printk block to print_hex_dump. Coalesce formats, align arguments. Use %s, __func__ instead of embedding function names. Some messages that were prefixed with <foo>_close are now prefixed with <foo>_fini. Some ah4 and esp messages are now not prefixed with "ip ". The intent of this patch is to later add something like #define pr_fmt(fmt) "IPv4: " fmt. to standardize the output messages. Text size is trivially reduced. (x86-32 allyesconfig) $ size net/ipv4/built-in.o* text data bss dec hex filename 887888 31558 249696 1169142 11d6f6 net/ipv4/built-in.o.new 887934 31558 249800 1169292 11d78c net/ipv4/built-in.o.old Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* vfs: fix panic in __d_lookup() with high dentry hashtable countsDimitri Sivanich2012-02-141-2/+3
| | | | | | | | | | | | | | When the number of dentry cache hash table entries gets too high (2147483648 entries), as happens by default on a 16TB system, use of a signed integer in the dcache_init() initialization loop prevents the dentry_hashtable from getting initialized, causing a panic in __d_lookup(). Fix this in dcache_init() and similar areas. Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* tcp: properly initialize tcp memory limitsJason Wang2012-02-021-2/+2
| | | | | | | | | | | | | Commit 4acb4190 tries to fix the using uninitialized value introduced by commit 3dc43e3, but it would make the per-socket memory limits too small. This patch fixes this and also remove the redundant codes introduced in 4acb4190. Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Glauber Costa <glommer@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Disambiguate kernel messageArun Sharma2012-02-011-4/+15
| | | | | | | | | | | | | | | | | | | | | | | | | Some of our machines were reporting: TCP: too many of orphaned sockets even when the number of orphaned sockets was well below the limit. We print a different message depending on whether we're out of TCP memory or there are too many orphaned sockets. Also move the check out of line and cleanup the messages that were printed. Signed-off-by: Arun Sharma <asharma@fb.com> Suggested-by: Mohan Srinivasan <mohan@fb.com> Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: David Miller <davem@davemloft.net> Cc: Glauber Costa <glommer@parallels.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/tcp: Fix tcp memory limits initialization when !CONFIG_SYSCTLGlauber Costa2012-01-301-3/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | sysctl_tcp_mem() initialization was moved to sysctl_tcp_ipv4.c in commit 3dc43e3e4d0b52197d3205214fe8f162f9e0c334, since it became a per-ns value. That code, however, will never run when CONFIG_SYSCTL is disabled, leading to bogus values on those fields - causing hung TCP sockets. This patch fixes it by keeping an initialization code in tcp_init(). It will be overwritten by the first net namespace init if CONFIG_SYSCTL is compiled in, and do the right thing if it is compiled out. It is also named properly as tcp_init_mem(), to properly signal its non-sysctl side effect on TCP limits. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: David S. Miller <davem@davemloft.net> Link: http://lkml.kernel.org/r/4F22D05A.8030604@parallels.com [ renamed the function, tidied up the changelog a bit ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: David S. Miller <davem@davemloft.net>
* per-netns ipv4 sysctl_tcp_memGlauber Costa2011-12-131-9/+2Star
| | | | | | | | | | | | | | This patch allows each namespace to independently set up its levels for tcp memory pressure thresholds. This patch alone does not buy much: we need to make this values per group of process somehow. This is achieved in the patches that follows in this patchset. Signed-off-by: Glauber Costa <glommer@parallels.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> CC: David S. Miller <davem@davemloft.net> CC: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: remove TCP_OFF and TCP_PAGE macrosEric Dumazet2011-12-061-13/+10Star
| | | | | | | | As mentioned by Joe Perches, TCP_OFF() and TCP_PAGE() macros are useless. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: tcp_sendmsg() page recyclingEric Dumazet2011-12-041-1/+6
| | | | | | | | | | | | | | | If our TCP_PAGE(sk) is not shared (page_count() == 1), we can set page offset to 0. This permits better filling of the pages on small to medium tcp writes. "tbench 16" results on my dev server (2x4x2 machine) : Before : 3072 MB/s After : 3146 MB/s (2.4 % gain) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: avoid frag allocation for small framesEric Dumazet2011-11-291-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | tcp_sendmsg() uses select_size() helper to choose skb head size when a new skb must be allocated. If GSO is enabled for the socket, current strategy is to force all payload data to be outside of headroom, in PAGE fragments. This strategy is not welcome for small packets, wasting memory. Experiments show that best results are obtained when using 2048 bytes for skb head (This includes the skb overhead and various headers) This patch provides better len/truesize ratios for packets sent to loopback device, and reduce memory needs for in-flight loopback packets, particularly on arches with big pages. If a sender sends many 1-byte packets to an unresponsive application, receiver rmem_alloc will grow faster and will stop queuing these packets sooner, or will collapse its receive queue to free excess memory. netperf -t TCP_RR results are improved by ~4 %, and many workloads are improved as well (tbench, mysql...) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: tcp_sendmsg() wrong access to sk_route_capsEric Dumazet2011-11-291-4/+4
| | | | | | | | | | Now sk_route_caps is u64, its dangerous to use an integer to store result of an AND operator. It wont work if NETIF_F_SG is moved on the upper part of u64. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: introduce and use netdev_features_t for device features setsMichał Mirosław2011-11-161-1/+2
| | | | | | | | | | v2: add couple missing conversions in drivers split unexporting netdev_fix_features() implemented %pNF convert sock::sk_route_(no?)caps Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
* TCP: remove TCP_DEBUGFlavio Leitner2011-10-241-2/+0Star
| | | | | | | | It was enabled by default and the messages guarded by the define are useful. Signed-off-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: md5: dont write skb head in tcp_md5_hash_header()Eric Dumazet2011-10-241-6/+8
| | | | | | | | | | | tcp_md5_hash_header() writes into skb header a temporary zero value, this might confuse other users of this area. Since tcphdr is small (20 bytes), copy it in a temporary variable and make the change in the copy. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: add const qualifiers where possibleEric Dumazet2011-10-211-9/+9
| | | | | | | | | | | | Adding const qualifiers to pointers can ease code review, and spot some bugs. It might allow compiler to optimize code further. For example, is it legal to temporary write a null cksum into tcphdr in tcp_md5_hash_header() ? I am afraid a sniffer could catch the temporary null value... Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: add skb frag size accessorsEric Dumazet2011-10-191-5/+4Star
| | | | | | | | | | | To ease skb->truesize sanitization, its better to be able to localize all references to skb frags size. Define accessors : skb_frag_size() to fetch frag size, and skb_frag_size_{set|add|sub}() to manipulate it. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: report ECN_SEEN in tcp_infoEric Dumazet2011-10-031-1/+3
| | | | | | | | | | | | | | | | | | | | Allows ss command (iproute2) to display "ecnseen" if at least one packet with ECT(0) or ECT(1) or ECN was received by this socket. "ecn" means ECN was negotiated at session establishment (TCP level) "ecnseen" means we received at least one packet with ECT fields set (IP level) ss -i ... ESTAB 0 0 192.168.20.110:22 192.168.20.144:38016 ino:5950 sk:f178e400 mem:(r0,w0,f0,t0) ts sack ecn ecnseen bic wscale:7,8 rto:210 rtt:12.5/7.5 cwnd:10 send 9.3Mbps rcv_space:14480 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: rename tcp_skb_cb flagsEric Dumazet2011-09-271-4/+4
| | | | | | | | | | Rename struct tcp_skb_cb "flags" to "tcp_flags" to ease code review and maintenance. Its content is a combination of FIN/SYN/RST/PSH/ACK/URG/ECE/CWR flags Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: md5: remove one indirection level in tcp_md5sig_poolEric Dumazet2011-09-171-24/+17Star
| | | | | | | | | | | | tcp_md5sig_pool is currently an 'array' (a percpu object) of pointers to struct tcp_md5sig_pool. Only the pointers are NUMA aware, but objects themselves are all allocated on a single node. Remove this extra indirection to get proper percpu memory (NUMA aware) and make code simpler. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: ipv4: convert to SKB frag APIsIan Campbell2011-08-251-1/+2
| | | | | | | | | | | | Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
* net: refine {udp|tcp|sctp}_mem limitsEric Dumazet2011-07-071-8/+2Star
| | | | | | | | | | | | | | | | | Current tcp/udp/sctp global memory limits are not taking into account hugepages allocations, and allow 50% of ram to be used by buffers of a single protocol [ not counting space used by sockets / inodes ...] Lets use nr_free_buffer_pages() and allow a default of 1/8 of kernel ram per protocol, and a minimum of 128 pages. Heavy duty machines sysadmins probably need to tweak limits anyway. References: https://bugzilla.stlinux.com/show_bug.cgi?id=38032 Reported-by: starlight <starlight@binnacle.cx> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Allow no-cache copy from user on transmitTom Herbert2011-04-051-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch uses __copy_from_user_nocache on transmit to bypass data cache for a performance improvement. skb_add_data_nocache and skb_copy_to_page_nocache can be called by sendmsg functions to use this feature, initial support is in tcp_sendmsg. This functionality is configurable per device using ethtool. Presumably, this feature would only be useful when the driver does not touch the data. The feature is turned on by default if a device indicates that it does some form of checksum offload; it is off by default for devices that do no checksum offload or indicate no checksum is necessary. For the former case copy-checksum is probably done anyway, in the latter case the device is likely loopback in which case the no cache copy is probably not beneficial. This patch was tested using 200 instances of netperf TCP_RR with 1400 byte request and one byte reply. Platform is 16 core AMD x86. No-cache copy disabled: 672703 tps, 97.13% utilization 50/90/99% latency:244.31 484.205 1028.41 No-cache copy enabled: 702113 tps, 96.16% utilization, 50/90/99% latency 238.56 467.56 956.955 Using 14000 byte request and response sizes demonstrate the effects more dramatically: No-cache copy disabled: 79571 tps, 34.34 %utlization 50/90/95% latency 1584.46 2319.59 5001.76 No-cache copy enabled: 83856 tps, 34.81% utilization 50/90/95% latency 2508.42 2622.62 2735.88 Note especially the effect on latency tail (95th percentile). This seems to provide a nice performance improvement and is consistent in the tests I ran. Presumably, this would provide the greatest benfits in the presence of an application workload stressing the cache and a lot of transmit data happening. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>