summaryrefslogtreecommitdiffstats
path: root/src/net/tcp.c
Commit message (Collapse)AuthorAgeFilesLines
* [tcp] Display "connecting" status until connection is establishedMichael Brown2019-03-101-0/+21
| | | | | | | | Provide increased visibility into the progress of TCP connections by displaying an explicit "connecting" status message while waiting for the TCP handshake to complete. Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [init] Show startup and shutdown function names in debug messagesMichael Brown2019-01-251-0/+1
| | | | Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Use correct length for memset()Michael Brown2017-03-221-1/+1
| | | | Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Send TCP keepalives on idle established connectionsMichael Brown2016-06-131-0/+38
| | | | | | | | | | | | | | | | | | | | | | | | | In some circumstances, intermediate devices may lose state in a way that temporarily prevents the successful delivery of packets from a TCP peer. For example, a firewall may drop a NAT forwarding table entry. Since iPXE spends most of its time downloading files (and hence purely receiving data, sending only TCP ACKs), this can easily happen in a situation in which there is no reason for iPXE's TCP stack to generate any retransmissions. The temporary loss of connectivity can therefore effectively become permanent. Work around this problem by sending TCP keepalives after a period of inactivity on an established connection. TCP keepalives usually send a single garbage byte in sequence number space that has already been ACKed by the peer. Since we do not need to elicit a response from the peer, we instead send pure ACKs (with no garbage data) in order to keep the transmit code path simple. Originally-implemented-by: Ladi Prosek <lprosek@redhat.com> Debugged-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Guard against malformed TCP optionsMichael Brown2016-01-281-11/+53
| | | | Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Ensure FIN is actually sent if connection is closed while idleMichael Brown2015-07-221-0/+1
| | | | Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Gracefully close connections during shutdownMichael Brown2015-07-041-1/+56
| | | | | | | | | | | | | | | We currently do not wait for a received FIN before exiting to boot a loaded OS. In the common case of booting from an HTTP server, this means that the TCP connection is left consuming resources on the server side: the server will retransmit the FIN several times before giving up. Fix by initiating a graceful close of all TCP connections and waiting (for up to one second) for all connections to finish closing gracefully (i.e. for the outgoing FIN to have been sent and ACKed, and for the incoming FIN to have been received and ACKed at least once). Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Do not shrink window when discarding received packetsMichael Brown2015-06-251-20/+3Star
| | | | | | | | | | | | | | | | | | | | | We currently shrink the TCP window permanently if we are ever forced (by a low-memory condition) to discard a previously received TCP packet. This behaviour was intended to reduce the number of retransmissions in a lossy network, since lost packets might potentially result in the entire window contents being retransmitted. Since commit e0fc8fe ("[tcp] Implement support for TCP Selective Acknowledgements (SACK)") the cost of lost packets has been reduced by around one order of magnitude, and the reduction in the window size (which affects the maximum throughput) is now the more significant cost. Remove the code which reduces the TCP maximum window size when a received packet is discarded. Reported-by: Wissam Shoukair <wissams@mellanox.com> Tested-by: Wissam Shoukair <wissams@mellanox.com> Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Implement support for TCP Selective Acknowledgements (SACK)Michael Brown2015-03-121-4/+158
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The TCP Selective Acknowledgement option (specified in RFC2018) provides a mechanism for the receiver to indicate packets that have been received out of order (e.g. due to earlier dropped packets). iPXE often operates in environments in which there is a high probability of packet loss. For example, the legacy USB keyboard emulation in some BIOSes involves polling the USB bus from within a system management interrupt: this introduces an invisible delay of around 500us which is long enough for around 40 full-length packets to be dropped. Similarly, almost all 1Gbps USB2 devices will eventually end up dropping packets because the USB2 bus does not provide enough bandwidth to sustain a 1Gbps stream, and most devices will not provide enough internal buffering to hold a full TCP window's worth of received packets. Add support for sending TCP Selective Acknowledgements. This provides the sender with more detailed information about which packets have been lost, and so allows for a more efficient retransmission strategy. We include a SACK-permitted option in our SYN packet, since experimentation shows that at least Linux peers will not include a SACK-permitted option in the SYN-ACK packet if one was not present in the initial SYN. (RFC2018 does not seem to mandate this behaviour, but it is consistent with the approach taken in RFC1323.) We ignore any received SACK options; this is safe to do since SACK is only ever advisory and we never have to send non-trivial amounts of data. Since our TCP receive queue is a candidate for cache discarding under low memory conditions, we may end up discarding data that has been reported as received via a SACK option. This is permitted by RFC2018. We follow the stricture that SACK blocks must not report data which is no longer held by the receiver: previously-reported blocks are validated against the current receive queue before being included within the current SACK block list. Experiments in a qemu VM using forced packet drops (by setting NETDEV_DISCARD_RATE to 32) show that implementing SACK improves throughput by around 400%. Experiments with a USB2 NIC (an SMSC7500) show that implementing SACK improves throughput by around 700%, increasing the download rate from 35Mbps up to 250Mbps (which is approximately the usable bandwidth limit for USB2). Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [legal] Relicense files under GPL2_OR_LATER_OR_UBDLMichael Brown2015-03-021-1/+1
| | | | | | | | | | These files cannot be automatically relicensed by util/relicense.pl since they either contain unusual but trivial contributions (such as the addition of __nonnull function attributes), or contain lines dating back to the initial git revision (and so require manual knowledge of the code's origin). Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Defer sending ACKs until all received packets have been processedMichael Brown2014-05-121-8/+25
| | | | | | | | | | | | | | | | | | | | | When running inside a virtual machine (or when using the UNDI driver), transmitting packets can be expensive. When we receive several packets in one poll (e.g. because a slow BIOS timer interrupt routine has caused us to fall behind in processing), we can safely send just a single ACK to cover all of the received packets. This reduces the time spent transmitting and allows us to clear the backlog much faster. Various RFCs (starting with RFC1122) state that there should be an ACK for at least every second segment. We choose not to enforce this rule. Under normal operation each poll should find at most one received packet, and we will then not delay any ACKs. We delay (i.e. omit) ACKs only when under sufficiently heavy load that we are finding multiple packets per poll; under these conditions it is important to clear the backlog quickly since any delay may lead to dropped packets. Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Profile transmit and receive datapathsMichael Brown2014-04-281-0/+20
| | | | Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Update window even if ACK does not acknowledge new dataMichael Brown2014-03-071-2/+4
| | | | | | | | | | | | | | | | | iPXE currently ignores ACKs which do not acknowledge any new data. (In particular, it does not stop the retransmission timer; this is done to prevent an immediate retransmission if a duplicate ACK is received while the transmit queue is non-empty.) If a peer provides a window size of zero and later sends a duplicate ACK to update the window size, this update will therefore be ignored and iPXE will never be able to transmit data. Fix by updating the window size even for ACKs which do not acknowledge new data. Reported-by: Wissam Shoukair <wissams@mellanox.com> Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Calculate correct MSS from peer addressMichael Brown2014-03-041-1/+14
| | | | | | | | | | | | | | | | | | | iPXE currently advertises a fixed MSS of 1460, which is correct only for IPv4 over Ethernet. For IPv6 over Ethernet, the value should be 1440 (allowing for the larger IPv6 header). For non-Ethernet link layers, the value should reflect the MTU of the underlying network device. Use tcpip_mtu() to calculate the transport-layer MTU associated with the peer address, and calculate the MSS to allow for an optionless TCP header as per RFC 6691. As a side benefit, we can now fail a connection immediately with a meaningful error message if we have no route to the destination address. Reported-by: Anton D. Kachalov <mouse@yandex-team.ru> Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Add AF_INET6 socket openerMichael Brown2013-10-211-2/+9
| | | | Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcpip] Pass through network device to transport layer protocolsMichael Brown2013-09-031-0/+2
| | | | | | | NDP requires knowledge of the network device on which a packet was received. Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcpip] Allow binding to unspecified privileged ports (below 1024)Michael Brown2013-08-061-39/+14Star
| | | | | Originally-implemented-by: Marin Hannache <git@mareo.fr> Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Fix comment to match code behaviourMichael Brown2013-07-121-1/+1
| | | | | Reported-by: Thomas Miletich <thomas.miletich@gmail.com> Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Do not send RST for unrecognised connectionsMichael Brown2013-07-121-1/+0Star
| | | | | | | | | | | | On large networks with substantial numbers of monitoring agents, unwanted TCP connection attempts may end up flooding iPXE's ARP cache. Fix by silently dropping packets received for unrecognised TCP connections. This should not cause problems, since many firewalls will also silently drop any such packets. Reported-by: Jarrod Johnson <jarrod.b.johnson@gmail.com> Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Truncate TCP window to prevent future packet discardsMichael Brown2012-07-091-3/+20
| | | | | | | | Whenever memory pressure causes a queued packet to be discarded (and so retransmitted), reduce the maximum TCP window to a size that would have prevented the discard. Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [arp] Try to avoid discarding ARP cache entriesMichael Brown2012-07-091-1/+1
| | | | | | | | | Discarding the active ARP cache entry in the middle of a download will substantially disrupt the TCP stream. Try to minimise any such disruption by treating ARP cache entries as expensive, and discarding them only when nothing else is available to discard. Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Avoid potential NULL pointer dereferenceMichael Brown2012-06-301-1/+3
| | | | | | | | | Commit ea61075 ("[tcp] Add support for TCP window scaling") introduced a potential NULL pointer dereference by referring to the connection's send window scale before checking whether or not the connection is known. Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Use a zero window size for RST packetsMichael Brown2012-06-301-1/+1
| | | | Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Add support for TCP window scalingMichael Brown2012-06-291-2/+29
| | | | | | | | The maximum unscaled TCP window (64kB) implies a maximum bandwidth of around 300kB/s on a WAN link with an RTT of 200ms. Add support for the TCP window scaling option to remove this upper limit. Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Mark any unacknowledged transmission as a pending operationMichael Brown2012-06-091-3/+33
| | | | Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Discard all TCP connections on shutdownMichael Brown2012-05-081-0/+22
| | | | | | | Allow detection of genuine memory leaks by ensuring that all TCP connections are freed on shutdown. Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Fix potential NULL pointer dereferenceMichael Brown2012-05-081-1/+1
| | | | | | Detected using Valgrind. Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Allow sufficient headroom for TCP headersMichael Brown2011-09-191-4/+4
| | | | | | | | | | | | | | | TCP currently neglects to allow sufficient space for its own headers when allocating I/O buffers. This problem is masked by the fact that the maximum link-layer header size (802.11) is substantially larger than the common Ethernet link-layer header. Fix by allowing sufficient space for any TCP headers, as well as the network-layer and link-layer headers. Reported-by: Scott K Logan <logans@cottsay.net> Debugged-by: Scott K Logan <logans@cottsay.net> Tested-by: Scott K Logan <logans@cottsay.net> Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Send xfer_window_changed() when window opensMichael Brown2011-06-281-19/+27
| | | | Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Update ts_recent whenever window is advancedMichael Brown2011-04-031-9/+22
| | | | | | | | | | | | | Commit 3f442d3 ("[tcp] Record ts_recent on first received packet") failed to achieve its stated intention. Fix this (and reduce the code size) by moving the ts_recent update to tcp_rx_seq(). This is the code responsible for advancing the window, called by both tcp_rx_syn() and tcp_rx_data(), and so the window check is now redundant. Reported-by: Frank Weed <zorbustheknight@gmail.com> Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Record ts_recent on first received packetMichael Brown2011-03-261-5/+8
| | | | | | | | | | | | | | | Commit 6861304 ("[tcp] Handle out-of-order received packets") introduced a regression in which ts_recent would not be updated until the first packet is received in the ESTABLISHED state, i.e. the timestamp from the SYN+ACK packet would be ignored. This causes the connection to be dropped by strictly-conforming TCP peers, such as FreeBSD. Fix by delaying the timestamp window check until after processing the received SYN flag. Reported-by: winders@sonnet.com Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Use MAX_LL_NET_HEADER_LEN instead of defining our own MAX_HDR_LENMichael Brown2010-11-191-4/+5
| | | | Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Set PSH flag only on packets containing dataMichael Brown2010-11-111-1/+1
| | | | | Suggested-by: Yelena Kadach <klenusik@hotmail.com> Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [list] Add list_first_entry()Michael Brown2010-11-081-3/+2Star
| | | | | | | | | There are several points in the iPXE codebase where list_for_each_entry() is (ab)used to extract only the first entry from a list. Add a macro list_first_entry() to make this code easier to read. Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [retry] Hold reference while timer is running and during expiry callbackMichael Brown2010-09-031-2/+2
| | | | | | | | | Guarantee that a retry timer cannot go out of scope while the timer is running, and provide a guarantee to the expiry callback that the timer will remain in scope during the entire callback (similar to the guarantee provided to interface methods). Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Fix a 64bit compile time errorPiotr Jaroszyński2010-07-221-1/+1
| | | | | Signed-off-by: Piotr Jaroszyński <p.jaroszynski@gmail.com> Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Allow out-of-order receive queue to be discardedMichael Brown2010-07-211-3/+38
| | | | | | | | | Allow packets in the receive queue to be discarded in order to free up memory. This avoids a potential deadlock condition in which the missing packet can never be received because the receive queue is occupying all of the memory available for further RX buffers. Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Handle out-of-order received packetsMichael Brown2010-07-211-34/+150
| | | | | | | | | | | | | | | Maintain a queue of received packets, so that lost packets need not result in retransmission of the entire TCP window. Increase the TCP window to 8kB, in order that we can potentially transmit enough duplicate ACKs to trigger Fast Retransmission at the sender. Using a 10MB HTTP download in qemu-kvm with an artificial drop rate of 1 in 64 packets, this reduces the download time from around 26s to around 4s. Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Treat ACKs as sent only when successfully transmittedMichael Brown2010-07-151-21/+20Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | iPXE currently forces sending (i.e. sends a pure ACK even in the absence of fresh data to send) only in response to packets that consume sequence space or that lie outside of the receive window. This ignores the possibility that a previous ACK was not actually sent (due to, for example, the retransmission timer running). This does not cause incorrect behaviour, but does cause unnecessary retransmissions from our peer. For example: 1. Peer sends final data packet (ack 106 seq 521..523) 2. We send FIN (seq 106..107 ack 523) 3. Peer sends FIN (ack 106 seq 523..524) 4. We send nothing since retransmission timer is running for our FIN 5. Peer ACKs our FIN (ack 107 seq 524..524) 6. We send nothing since this packet consumes no sequence space 7. Peer retransmits FIN (ack 107 seq 523..524) 8. We ACK peer's FIN (seq 107..107 ack 524) What should happen at step (6) is that we should ACK the peer's FIN, since we can deduce that we have never sent this ACK. Fix by maintaining an "ACK pending" flag that is set whenever we are made aware that our peer needs an ACK (whether by consuming sequence space or by sending a packet that appears out of order), and is cleared only when the ACK packet has been transmitted. Reported-by: Piotr Jaroszyński <p.jaroszynski@gmail.com> Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Merge boolean flags into a single "flags" fieldMichael Brown2010-07-151-8/+15
| | | | Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Use a dedicated timer for the TIME_WAIT stateMichael Brown2010-07-151-9/+32
| | | | | | | | | | | | | | | | | | iPXE currently repurposes the retransmission timer to hold the TCP connection in the TIME_WAIT state (i.e. waiting for up to 2*MSL in case we are required to re-ACK our peer's FIN due to a lost ACK). However, the fact that this timer is running will prevent such an ACK from ever being sent, since the logic in tcp_xmit() assumes that a running timer indicates that we ourselves are waiting for an ACK and so blocks the transmission. (We always wait for an ACK before sending our next packet, to keep our transmit data path as simple as possible.) Fix by using an entirely separate timer for the TIME_WAIT state, so that packets can still be sent. Reported-by: Piotr Jaroszyński <p.jaroszynski@gmail.com> Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Randomise local TCP portGuo-Fu Tseng2010-07-131-3/+5
| | | | | | Signed-off-by: Guo-Fu Tseng <cooldavid@cooldavid.org> Modified-by: Michael Brown <mcb30@ipxe.org> Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Fix typos by changing ntohl() to htonl() where appropriateMichael Brown2010-07-131-2/+2
| | | | Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Store local port in host byte orderMichael Brown2010-07-131-9/+9
| | | | | | | Every other scalar integer value in struct tcp_connection is in host byte order; change the definition of local_port to match. Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Fix potential use-after-free when accessing timestamp optionMichael Brown2010-07-071-4/+7
| | | | | Reported-by: Piotr Jaroszyński <p.jaroszynski@gmail.com> Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [interface] Convert all data-xfer interfaces to generic interfacesMichael Brown2010-06-221-29/+23Star
| | | | | | | | | | | | | | Remove data-xfer as an interface type, and replace data-xfer interfaces with generic interfaces supporting the data-xfer methods. Filter interfaces (as used by the TLS layer) are handled using the generic pass-through interface capability. A side-effect of this is that deliver_raw() no longer exists as a data-xfer method. (In practice this doesn't lose any efficiency, since there are no instances within the current codebase where xfer_deliver_raw() is used to pass data to an interface supporting the deliver_raw() method.) Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [retry] Use start_timer_fixed() instead of direct timeout manipulationMichael Brown2010-06-221-2/+1Star
| | | | Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [retry] Add timer_init() wrapper functionMichael Brown2010-06-221-1/+1
| | | | | | | Standardise on using timer_init() to initialise an embedded retry timer, to match the coding style used by other embedded objects. Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [refcnt] Add ref_init() wrapper functionMichael Brown2010-06-221-0/+1
| | | | | | | Standardise on using ref_init() to initialise an embedded reference count, to match the coding style used by other embedded objects. Signed-off-by: Michael Brown <mcb30@ipxe.org>
* [tcp] Update received sequence number before delivering received dataMichael Brown2010-05-221-8/+10
| | | | | | | | | | | | | iPXE currently updates the TCP sequence number after delivering the data to the application via xfer_deliver_iob(). If the application responds to the received data by transmitting more data, this would result in a stale ACK number appearing in the transmitted packet, which potentially causes retransmissions and also gives the undesirable appearance of violating causality (by sending a response to a message that we claim not to have yet received). Reported-by: Guo-Fu Tseng <cooldavid@cooldavid.org> Signed-off-by: Michael Brown <mcb30@ipxe.org>