Cross Reference: /freebsd-current/sys/netinet/tcp

History log of /freebsd-current/sys/netinet/tcp_var.h
Revision	Date	Author	Comments
# ea916b64	18-May-2024	Randall Stewart <rrs@FreeBSD.org>	Remove TCP_SAD optional code now that the sack filter performs this function. With the commit of D44903 we no longer need the SAD option. Instead all stacks that use the sack filter inherit its protection against sack-attack. Reviewed by: tuexen@ Differential Revision:https://reviews.freebsd.org/D45216
# 2a9aae9e	08-May-2024	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: add counter to track when SACK loss recovery uses TSO Add a counter to track how frequently SACK has transmitted more than one MSS using TSO. Instances when this will be beneficial is the use of PRR, or when ACK thinning due to GRO/LRO or ACK discards by the network are present. Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D45070
# dcdfe449	08-May-2024	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: add sysctl to allow/disallow TSO during SACK loss recovery Introduce net.inet.tcp.sack.tso for future use when TSO is ready to be used during loss recovery. Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D45068
# fce03f85	05-May-2024	Randall Stewart <rrs@FreeBSD.org>	TCP can be subject to Sack Attacks lets fix this issue. There is a type of attack that a TCP peer can launch on a connection. This is for sure in Rack or BBR and probably even the default stack if it uses lists in sack processing. The idea of the attack is that the attacker is driving you to look at 100's of sack blocks that only update 1 byte. So for example if you have 1 - 10,000 bytes outstanding the attacker sends in something like: ACK 0 SACK(1-512) SACK(1024 - 1536), SACK(2048-2536), SACK(4096 - 4608), SACK(8192-8704) This first sack looks fine but then the attacker sends ACK 0 SACK(1-512) SACK(1025 - 1537), SACK(2049-2537), SACK(4097 - 4609), SACK(8193-8705) ACK 0 SACK(1-512) SACK(1027 - 1539), SACK(2051-2539), SACK(4099 - 4611), SACK(8195-8707) ... These blocks are making you hunt across your linked list and split things up so that you have an entry for every other byte. Has your list grows you spend more and more CPU running through the lists. The idea here is the attacker chooses entries as far apart as possible that make you run through the list. This example is small but in theory if the window is open to say 1Meg you could end up with 100's of thousands link list entries. To combat this we introduce three things. when the peer requests a very small MSS we stop processing SACK's from them. This prevents a malicious peer from just using a small MSS to do the same thing. Any time we get a sack block, we use the sack-filter to remove sacks that are smaller than the smallest v4 mss (minus 40 for max TCP options) unless it ties up to snd_max (since that is legal). All other sacks in theory should be at least an MSS. If we get such an attacker that means we basically start skipping all but MSS sized Sacked blocks. The sack filter used to throw away data when its bounds were exceeded, instead now we increase its size to 15 and then throw away sack's if the filter gets over-run to prevent the malicious attacker from over-running the sack filter and thus we start to process things anyway. The default stack will need to start using the sack-filter which we have talked about in past conference calls to take full advantage of the protections offered by it (and reduce cpu consumption when processing sacks). After this set of changes is in rack can drop its SAD detection completely Reviewed by:tuexen@, rscheff@ Differential Revision: <https://reviews.freebsd.org/D44903>
# 1d14e88e	08-Apr-2024	Mark Johnston <markj@FreeBSD.org>	tcp: Make tcp_var.h more self-contained struct tcpcb embeds a struct osd and a struct callout. Rather than forcing all consumers to pull in the same headers, include the headers directly. No functional change intended. Reviewed by: glebius MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D44685
# 60d8dbbe	18-Jan-2024	Kristof Provost <kp@FreeBSD.org>	netinet: add a probe point for IP, IP6, ICMP, ICMP6, UDP and TCP stats counters When debugging network issues one common clue is an unexpectedly incrementing error counter. This is helpful, in that it gives us an idea of what might be going wrong, but often these counters may be incremented in different functions. Add a static probe point for them so that we can use dtrace to get futher information (e.g. a stack trace). For example: dtrace -n 'mib:ip:count: { printf("%d", arg0); stack(); }' This can be disabled by setting the following kernel option: options KDTRACE_NO_MIB_SDT Reviewed by: gallatin, tuexen (previous version), gnn (previous version) Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D43504
# 5a268d86	03-Apr-2024	Michael Tuexen <tuexen@FreeBSD.org>	tcp: fix comment Make the comment consistent with the code. Reviewed by: rscheff MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D44611
# dd7b86e2	18-Mar-2024	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: remove IS_FASTOPEN() macro The macro is more obfuscating than helping as it just checks a single flag of t_flags. All other t_flags bits are checked without a macro. A bigger problem was that declaration of the macro in tcp_var.h depended on a kernel option. It is a bad practice to create such definitions in installable headers. Reviewed by: rscheff, tuexen, kib Differential Revision: https://reviews.freebsd.org/D44362
# 220ee18f	13-Mar-2024	Konstantin Belousov <kib@FreeBSD.org>	netinet/tcp_var.h: always define IS_FASTOPEN() for kernel compilation env and drop the definition for userspace (which matched TCP_RFC7413) since it depends on presence of the kernel option. Reviewed by: glebius, rscheff Sponsored by: NVIDIA networking MFC after: 1 week Differential revision: https://reviews.freebsd.org/D44349
# e4315bbc	13-Mar-2024	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: move struct tcp_ifcap declaration under _KERNEL Reviewed by: rscheff, tuexen, kib Differential Revision: https://reviews.freebsd.org/D44340
# e18b97bd	12-Mar-2024	Randall Stewart <rrs@FreeBSD.org>	Update to bring the rack stack with all its fixes in. This brings the rack stack up to the current level used at NF. Many fixes and improvements have been added. I also add in a fix to BBR to deal with the changes that have been in hpts for a while i.e. only one call no matter if mbuf queue or tcp_output. It basically does little except BBlogs and is a placemark for future work on doing path capacity measurements. With a bit of a struggle with git I finally got rack_pcm.c into place (apologies for not noticing this error). The LINT kernel is running on my box now .. sigh. Reviewed by: tuexen, glebius Sponsored by: Netflix Inc. Differential Revision:https://reviews.freebsd.org/D43986
# c112243f	11-Mar-2024	Brooks Davis <brooks@FreeBSD.org>	Revert "Update to bring the rack stack with all its fixes in." This commit was incomplete and breaks LINT kernels. The tree has been broken for 8+ hours. This reverts commit f6d489f402c320f1a6eaa473491a0b8c3878113e.
# f6d489f4	11-Mar-2024	Randall Stewart <rrs@FreeBSD.org>	Update to bring the rack stack with all its fixes in. This brings the rack stack up to the current level used at NF. Many fixes and improvements have been added. I also add in a fix to BBR to deal with the changes that have been in hpts for a while i.e. only one call no matter if mbuf queue or tcp_output. Note there is a new file that I can't figure out how to get in rack_pcm.c It basically does little except BBlogs and is a placemark for future work on doing path capacity measurements. Reviewed by: tuexen, glebius Sponsored by: Netflix Inc. Differential Revision:https://reviews.freebsd.org/D43986
# c7c325d0	24-Jan-2024	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: pass maxseg around instead of calculating locally Improve slowpath processing (reordering, retransmissions) slightly by calculating maxseg only once. This typically saves one of two calls to tcp_maxseg(). Reviewed By: glebius, tuexen, cc, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D43536
# 7f3184ba	22-Jan-2024	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: remove outdated comment This paragraph should have been removed in 446ccdd08e2a.
# 30409ecd	06-Jan-2024	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: do not purge SACK scoreboard on first RTO Keeping the SACK scoreboard intact after the first RTO and retransmitting all data anew only on subsequent RTOs allows a more timely and efficient loss recovery under many adverse cirumstances. Reviewed By: tuexen, #transport MFC after: 10 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D42906
# a8b70cf2	24-Dec-2023	Richard Scheffenegger <rscheff@FreeBSD.org>	netpfil: Use accessor functions and named constants for all tcphdr flags Update all remaining references to the struct tcphdr th_x2 field. This completes the compatibilty of various aspects with AccECN (TH_AE), after the internal ipfw "re-checksum required" was moved to use the TH_RES1 flag. No functional change. Reviewed By: tuexen, #transport, glebius Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D43172
# 8717c306	21-Dec-2023	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: allow userspace use of tcp header flags accessor functions Provide accessor functions to all 12 possible TCP header flags for userspace too. Reviewed By: zlei MFC after: 2 weeks Sponsored by: Netapp, Inc. Differential Revision: https://reviews.freebsd.org/D43152
# 29363fb4	23-Nov-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove ancient SCCS tags. Remove ancient SCCS tags from the tree, automated scripting, with two minor fixup to keep things compiling. All the common forms in the tree were removed with a perl script. Sponsored by: Netflix
# 219a6ca9	21-Nov-2023	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: uninline tcp_account_for_send() This allows to clear inclusion of "opt_kern_tls.h" from a system header. Reviewed by: rscheff, tuexen Differential Revision: https://reviews.freebsd.org/D42696
# 49a6fbe3	15-Nov-2023	Richard Scheffenegger <rscheff@FreeBSD.org>	[tcp] add PRR 6937bis heuristic and retire prr_conservative sysctl Improve Proportional Rate Reduction (RFC6937) by using a heuristic, which automatically chooses between conservative CRB and more aggressive SSRB modes. Only when snd_una advances (a partial ACK), SSRB may be used. Also, that ACK must not have any indication of ongoing loss - using the addition of new holes into the scoreboard as proxy for such an event. MFC after: 4 weeks Reviewed By: #transport, kbowling, rrs Sponsored By: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28822
# 22dc8609	17-Oct-2023	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: use signed IsLost() related accounting variables Coverity found that one safety check (kassert) was not functional, as possible incorrect subtractions during the accounting wouldn't show up as (invalid) negative values. Reported by: gallatin Reviewed By: cc, #transport Sponsored By: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D42180
# e2c6a6d2	09-Oct-2023	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: include RFC6675 IsLost() in pipe calculation Add more accounting while processing SACK data, to keep track of when a packet is deemed lost using the RFC6675 guidance. Together with PRR (RFC6972) this allows a sender to retransmit presumed lost packets faster, and loss recovery to complete earlier. Reviewed By: cc, rrs, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D39299
# 2ff63af9	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove $FreeBSD$: one-line .h pattern Remove /^\s\+\s\$FreeBSD\$.$\n/
# cf32543f	27-Jul-2023	Michael Tuexen <tuexen@FreeBSD.org>	tcp: document that conditional fields in tcpcb should be at the end Reviewed by: rscheff, Peter Lei Sponsored by: Netflix, Inc.
# e4a873bf	19-Jul-2023	Michael Tuexen <tuexen@FreeBSD.org>	tcp: improve layout of struct tcpcb Put optional fields at the end to minimize run time problems in case CC modules are build from within its directory. Reviewed by: cc, gallatin, glebius, imp Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D41059
# d3152ab2	17-Jul-2023	Warner Losh <imp@FreeBSD.org>	tcbpcb: Always define t_osd Always define t_osd. congestion control modules access it unconditionally. This fixes the build. However, this is, at best, a temporary band-aide until the larger issues are sorted. Sponsored by: Netflix
# 43b117f8	06-Jun-2023	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: make the maximum number of retransmissions tunable per VNET Both Windows (TcpMaxDataRetransmissions) and Linux (tcp_retries2) allow to restrict the maximum number of consecutive timer based retransmissions. Add that same capability on a per-VNet basis to FreeBSD. Reviewed By: cc, tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D40424
# 57a3a161	24-May-2023	Randall Stewart <rrs@FreeBSD.org>	tcp: request tracking is not http specific. This change is a name change only. TCP Request tracking can track sendfile and even non-sendfile requests. The names however in the current code use http, and they should not. The feature is not http specific. Lets change the name so they more properly reflect whats going on. This also fixes conflicts with http_req which caused application pain. Reviewed by: tuexen Sponsored by: Netflix Inc Differential Revision:https://reviews.freebsd.org/D40229
# ec6d620b	19-May-2023	Randall Stewart <rrs@FreeBSD.org>	There are congestion control algorithms will that pull in srtt, and this can cause issues with rack. When using rack, cubic and htcp will grab the srtt, but they think it is in ticks. For rack it is in micro-seconds (which we should probably move all stacks to actually). This causes issues so instead lets make a new interface so that any CC module can pull the srtt in whatever granularity they want. Reviewed by: tuexen Sponsored by: Netflix Inc Differential Revision:https://reviews.freebsd.org/D40146
# c3c20de3	25-Apr-2023	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: move HPTS/LRO flags out of inpcb to tcpcb These flags are TCP specific. While here, make also several LRO internal functions to pass tcpcb pointer instead of inpcb one. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D39698
# c2a69e84	25-Apr-2023	Gleb Smirnoff <glebius@FreeBSD.org>	tcp_hpts: move HPTS related fields from inpcb to tcpcb This makes inpcb lighter and allows future cache line optimizations of tcpcb. The reason why HPTS originally used inpcb is the compressed TIME-WAIT state (see 0d7445193ab), that used to free a tcpcb, while the associated connection is still on the HPTS ring. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D39697
# 8aa2be69	20-Apr-2023	Cheng Cui <cc@FreeBSD.org>	Correct the value of macro TF2_TCP_ACCOUNTING. Summary: Make sure the values are in order. Reviewers: rscheff, tuexen, #transport! Approved by: rscheff, tuexen, glebius Subscribers: imp, melifaro, glebius Differential Revision: https://reviews.freebsd.org/D39716
# a540cdca	17-Apr-2023	Gleb Smirnoff <glebius@FreeBSD.org>	tcp_hpts: use queue(9) STAILQ for the input queue Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D39574
# 66fbc19f	07-Apr-2023	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: pass tcpcb in the tfb_tcp_ctloutput() method instead of inpcb Just matches rest of the KPI. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D39435
# 35bc0bcc	07-Apr-2023	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: reduce argument list to functions that pass a segment The socket argument is superfluous, as a tcpcb always has one and only one socket. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D39434
# de4368dd	07-Apr-2023	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: retire tfb_tcp_hpts_do_segment() Isn't in use anymore. Correct comments that mention it. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D39433
# 945f9a7c	07-Apr-2023	Randall Stewart <rrs@FreeBSD.org>	tcp: misc cleanup of options for rack as well as socket option logging. Both BBR and Rack have the ability to log socket options, which is currently disabled. Rack has an experimental SaD (Sack Attack Detection) algorithm that should be made available. Also there is a t_maxpeak_rate that needs to be removed (its un-used). Reviewed by: tuexen, cc Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D39427
# 73ee5756	31-Mar-2023	Randall Stewart <rrs@FreeBSD.org>	Fixes in the tcp infrastructure with respect to stack changes as well as other infrastructure updates for incoming rack features. So stack switching as always been a bit of a issue. We currently use a break before make setup which means that if something goes wrong you have to try to get back to a stack. This patch among a lot of other things changes that so that it is a make before break. We also expand some of the function blocks in prep for new features in rack that will allow more controlled pacing. We also add other abilities such as the pathway for a stack to query a previous stack to acquire from it critical state information so things in flight don't get dropped or mis-handled when switching stacks. We also add the concept of a timer granularity. This allows an alternate stack to change from the old ticks granularity to microseconds and of course this even gives us a pathway to go to nanosecond timekeeping if we need to (something for the data center to consider for sure). Once all this lands I will then update rack to begin using all these new features. Reviewed by: tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D39210
# 69c7c811	16-Mar-2023	Randall Stewart <rrs@FreeBSD.org>	Move access to tcp's t_logstate into inline functions and provide new tracepoint and bbpoint capabilities. The TCP stacks have long accessed t_logstate directly, but in order to do tracepoints and the new bbpoints we need to move to using the new inline functions. This adds them and moves rack to now use the tcp_tracepoints. Reviewed by: tuexen, gallatin Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D38831
# 2f201df1	20-Jul-2021	Alfonso <gfunni234@gmail.com>	Change hw_tls to a bool Reviewed by: imp Pull Request: https://github.com/freebsd/freebsd-src/pull/512
# 624de4ec	22-Feb-2023	Michael Tuexen <tuexen@FreeBSD.org>	tcp: remove unused function prototype tcp_trace was implemented in tcp_debug.c, which was removed recently. Reviewed by: rscheff@, zlei@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D38712
# 76578d60	21-Feb-2023	Michael Tuexen <tuexen@FreeBSD.org>	bblog: improve timeout event handling Extend the BBLog RTO event to deal with all timers of the base stack. Also provide information about starting, stopping, and running off. The expiration of the retransmission timer is reported as it was done before. Reviewed by: rscheff@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D38710
# 6b802933	21-Feb-2023	Michael Tuexen <tuexen@FreeBSD.org>	tcp: rearrange enum and remove unused variable Rearrange the enum tt_which such that TT_REXMIT is 0. This allows an extension of the BBLog event RTO in a backwards compatible way. Remove tcptimers, which was only used in trpt, a utility removed from the source tree recently. Reviewed by: glebius@, guest-ccui@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D38547
# c0e4090e	08-Feb-2023	Andrew Gallatin <gallatin@FreeBSD.org>	ktls: Accurately track if ifnet ktls is enabled This allows us to avoid spurious calls to ktls_disable_ifnet() When we implemented ifnet kTLSe, we set a flag in the tx socket buffer (SB_TLS_IFNET) to indicate ifnet kTLS. This flag meant that now, or in the past, ifnet ktls was active on a socket. Later, I added code to switch ifnet ktls sessions to software in the case of lossy TCP connections that have a high retransmit rate. Because TCP was using SB_TLS_IFNET to know if it needed to do math to calculate the retransmit ratio and potentially call into ktls_disable_ifnet(), it was doing unneeded work long after a session was moved to software. This patch carefully tracks whether or not ifnet ktls is still enabled on a TCP connection. Because the inp is now embedded in the tcpcb, and because TCP is the most frequent accessor of this state, it made sense to move this from the socket buffer flags to the tcpcb. Because we now need reliable access to the tcbcb, we take a ref on the inp when creating a tx ktls session. While here, I noticed that rack/bbr were incorrectly implementing tfb_hwtls_change(), and applying the change to all pending sends, when it should apply only to future sends. This change reduces spurious calls to ktls_disable_ifnet() by 95% or so in a Netflix CDN environment. Reviewed by: markj, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D38380
# 9aff05bb	07-Feb-2023	John Baldwin <jhb@FreeBSD.org>	tcp_var.h: Fix spelling of independent in comment
# 18b83b62	26-Jan-2023	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: reduce the size of t_rttupdated in tcpcb During tcp session start, various mechanisms need to track a few initial RTTs before becoming active. Prevent overflows of the corresponding tracking counter and reduce the size of tcpcb simultaneously. Reviewed By: #transport, tuexen, guest-ccui Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D21117
# 446ccdd0	07-Dec-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: use single locked callout per tcpcb for the TCP timers Use only one callout structure per tcpcb that is responsible for handling all five TCP timeouts. Use locked version of callout, of course. The callout function tcp_timer_enter() chooses soonest timer and executes it with lock held. Unless the timer reports that the tcpcb has been freed, the callout is rescheduled for next soonest timer, if there is any. With single callout per tcpcb on connection teardown we should be able to fully stop the callout and immediately free it, avoiding use of callout_async_drain(). There is one gotcha here: callout_stop() can actually touch our memory when a rare race condition happens. See comment above tcp_timer_stop(). Synchronous stop of the callout makes tcp_discardcb() the single entry point for tcpcb destructor, merging the tcp_freecb() to the end of the function. While here, also remove lots of lingering checks in the beginning of TCP timer functions. With a locked callout they are unnecessary. While here, clean unused parts of timer KPI for the pluggable TCP stacks. While here, remove TCPDEBUG from tcp_timer.c, as this allows for more simplification of TCP timers. The TCPDEBUG is scheduled for removal. Move the DTrace probes in timers to the beginning of a function, where a tcpcb is always existing. Discussed with: rrs, tuexen, rscheff (the TCP part of the diff) Reviewed by: hselasky, kib, mav (the callout part) Differential revision: https://reviews.freebsd.org/D37321
# 918fa422	07-Dec-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: remove tcp_timer_suspend() It was a temporary code added together with RACK to fight against TCP timer races.
# e68b3792	07-Dec-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: embed inpcb into tcpcb For the TCP protocol inpcb storage specify allocation size that would provide space to most of the data a TCP connection needs, embedding into struct tcpcb several structures, that previously were allocated separately. The most import one is the inpcb itself. With embedding we can provide strong guarantee that with a valid TCP inpcb the tcpcb is always valid and vice versa. Also we reduce number of allocs/frees per connection. The embedded inpcb is placed in the beginning of the struct tcpcb, since in_pcballoc() requires that. However, later we may want to move it around for cache line efficiency, and this can be done with a little effort. The new intotcpcb() macro is ready for such move. The congestion algorithm data, the TCP timers and osd(9) data are also embedded into tcpcb, and temprorary struct tcpcb_mem goes away. There was no extra allocation here, but we went through extra pointer every time we accessed this data. One interesting side effect is that now TCP data is allocated from SMR-protected zone. Potentially this allows the TCP stacks or other TCP related modules to utilize that for their own synchronization. Large part of the change was done with sed script: s/tp->ccv->/tp->t_ccv./g s/tp->ccv/\&tp->t_ccv/g s/tp->cc_algo/tp->t_cc/g s/tp->t_timers->tt_/tp->tt_/g s/CCV$ccv, osd$/\&CCV(ccv, t_osd)/g Dependency side effect is that code that needs to know struct tcpcb should also know struct inpcb, that added several <netinet/in_pcb.h>. Differential revision: https://reviews.freebsd.org/D37127
# bd4f9866	16-Nov-2022	Michael Tuexen <tuexen@FreeBSD.org>	tcp: remove unused t_rttbest No functional change intended. Reviewed by: rscheff@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D37401
# 1a70101a	10-Nov-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: account sent/received IP ECN markings independently Have tcpstats (netstat -s) differentiate between received and sent ECN-marked packets. Also account for IP ECN bits (on TCP packets) even when the tcp session has not negotiated ECN support. Event: IETF 115 Hackathon Reviewed By: glebius, tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D37314
# 8840ae22	08-Nov-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: don't store VNET in every tcpcb, take it from the inpcbinfo Reviewed by: rscheff Differential revision: https://reviews.freebsd.org/D37125
# 9eb0e832	08-Nov-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: provide macros to access inpcb and socket from a tcpcb There should be no functional changes with this commit. Reviewed by: rscheff Differential revision: https://reviews.freebsd.org/D37123
# dc9daa04	08-Nov-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: allow packets to be marked as ECT1 instead of ECT0 This adds the capability for a modular congestion control to select which variant of ECN-capable-transport it wants to use when sending out elegible segments. As an initial CC to utilize this, DCTCP was selected. Event: IETF 115 Hackathon Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D24869
# 37bf391d	06-Nov-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: make tcp_packets_this_ack() only visible in kernel scope
# b1258b76	06-Nov-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: add conservative d.cep accounting algorithm Accurate ECN asks to conservatively estimate, when the ACE counter may have wrapped due to a single ACK covering a larger number of segments. This is described in Annex A.2 of the accurate-ecn draft. Event: IETF 115 Hackathon Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D37281
# c348e880	31-Oct-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: make tcp_handle_wakeup() static and robust It is called only from tcp_input() and always has valid parameter. Reviewed by: rscheff, tuexen Differential revision: https://reviews.freebsd.org/D37115
# 83c1ec92	20-Oct-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: ECN preparations for ECN++, AccECN (tcp_respond) tcp_respond is another function to build a tcp control packet quickly. With ECN++ and AccECN, both the IP ECN header, and the TCP ECN flags are supposed to reflect the correct state. Also ensure that on receiving multiple ECN SYN-ACKs, the responses triggered will reflect the latest state. Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D36973
# c3738466	18-Oct-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: style the struct tcpcb definition - Use C99 types uintXX_t instead of u_intXX_t. - Try to make space/tab usage a little bit more consistent. - Shorten comments to fit into 80 chars. Not a functional change, just making future changes easier to read.
# 0d744519	06-Oct-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: remove tcptw, the compressed timewait state structure The memory savings the tcptw brought back in 2003 (see 340c35de6a2) no longer justify the complexity required to maintain it. For longer explanation please check out the email [1]. Surpisingly through almost 20 years the TCP stack functionality of handling the TIME_WAIT state with a normal tcpcb did not bitrot. The existing tcp_input() properly handles a tcpcb in TCPS_TIME_WAIT state, which is confirmed by the packetdrill tcp-testsuite [2]. This change just removes tcptw and leaves INP_TIMEWAIT. The flag will be removed in a separate commit. This makes it easier to review and possibly debug the changes. [1] https://lists.freebsd.org/archives/freebsd-net/2022-January/001206.html [2] https://github.com/freebsd-net/tcp-testsuite Differential revision: https://reviews.freebsd.org/D36398
# cd84e78f	04-Oct-2022	Randall Stewart <rrs@FreeBSD.org>	tcp idle reduce does not work for a server. TCP has an idle-reduce feature that allows a connection to reduce its cwnd after it has been idle more than an RTT. This feature only works for a sending side connection. It does this by at output checking the idle time (t_rcvtime vs ticks) to see if its more than the RTO timeout. The problem comes if you are a web server. You get a request and then send out all the data.. then go idle. The next time you would send is in response to a request from the peer asking for more data. But the thing is you updated t_rcvtime when the request came in so you never reduce. The fix is to do the idle reduce check also on inbound. Reviewed by: tuexen, rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D36721
# 775e20c1	03-Oct-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: make tcp_drop_syn_sent() static
# 43d39ca7	03-Oct-2022	Gleb Smirnoff <glebius@FreeBSD.org>	netinet*: de-void control input IP protocol methods After decoupling of protosw(9) and IP wire protocols in 78b1fc05b205 for IPv4 we got vector ip_ctlprotox[] that is executed only and only from icmp_input() and respectively for IPv6 we got ip6_ctlprotox[] executed only and only from icmp6_input(). This allows to use protocol specific argument types in these methods instead of struct sockaddr and void. Reviewed by: melifaro Differential revision: https://reviews.freebsd.org/D36727
# 08af8aac	27-Sep-2022	Randall Stewart <rrs@FreeBSD.org>	Tcp progress timeout Rack has had the ability to timeout connections that just sit idle automatically. This feature of course is off by default and requires the user set it on (though the socket option has been missing in tcp_usrreq.c). Lets get the progress timeout fully supported in the base stack as well as rack. Reviewed by: tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D36716
# e5049a17	26-Sep-2022	Randall Stewart <rrs@FreeBSD.org>	TCP rack does not work properly with cubic. Right now if you use rack with cubic (the new default cc) you will have improper results. This is because rack uses different variables than the base stack (or bbr) and thus tcp_compute_pipe() always returns so that cubic will choose a 30% backoff not the 50% backoff it should when it is newreno compatibility mode. The fix is to allow a stack (rack) to override its own compute_pipe. Reviewed by: tuexen, rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D36711
# 493105c2	21-Sep-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: fix simultaneous open and refine e80062a2d43 - The soisconnected() call on transition from SYN_RCVD to ESTABLISHED is also necessary for a half-synchronized connection. Fix that just setting the flag, when we transfer SYN-SENT -> SYN-RECEIVED. - Provide a comment that explains at what conditions the call to soisconnected() is necessary. - Hence mechanically rename the TF_INCQUEUE flag to TF_SONOTCONN. - Extend the change to the BBR and RACK stacks. Note: the interaction between the accept_filter(9) and the socket layer is not fully consistent, yet. For most accept filters this call to soisconnected() will not move the connection from the incomplete queue to the complete. The move would happen only when the filter has received the desired data, and soisconnected() would be called once again from sorwakeup(). Ideally, we should mark socket as connected only there, and leave the soisconnected() from SYN_RCVD->ESTABLISHED only for the simultaneous open case. However, this doesn't yet work. Reviewed by: rscheff, tuexen, rrs Differential revision: https://reviews.freebsd.org/D36641
# 0c7f3ae8	19-Sep-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcpcb: fix tabulation count in i4012ef7754c and abbreviate "packets" This lines up comments to the rest of the file. Abbreviation helps to fit in to 80 char terminal. Not a functional change.
# e80062a2	08-Sep-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: avoid call to soisconnected() on transition to ESTABLISHED This call existed since pre-FreeBSD times, and it is hard to understand why it was there in the first place. After 6f3caa6d815 it definitely became necessary always and commit message from f1ee30ccd60 confirms that. Now that 6f3caa6d815 is effectively backed out by 07285bb4c22, the call appears to be useful only for sockets that landed on the incomplete queue, e.g. sockets that have accept_filter(9) enabled on them. Provide a new TCP flag to mark connections that are known to be on the incomplete queue, and call soisconnected() only for those connections. Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D36488
# 4012ef77	31-Aug-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Functional implementation of Accurate ECN The AccECN handshake and TCP header flags are supported, no support yet for the AccECN option. This minimalistic implementation is sufficient to support DCTCP while dramatically cutting the number of ACKs, and provide ECN response from the receiver to the CC modules. Reviewed By: #transport, #manpages, rrs, pauamma Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D21011
# e7d02be1	17-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	protosw: refactor protosw and domain static declaration and load o Assert that every protosw has pr_attach. Now this structure is only for socket protocols declarations and nothing else. o Merge struct pr_usrreqs into struct protosw. This was suggested in 1996 by wollman@ (see 7b187005d18ef), and later reiterated in 2006 by rwatson@ (see 6fbb9cf860dcd). o Make struct domain hold a variable sized array of protosw pointers. For most protocols these pointers are initialized statically. Those domains that may have loadable protocols have spacers. IPv4 and IPv6 have 8 spacers each (andre@ dff3237ee54ea). o For inetsw and inet6sw leave a comment noting that many protosw entries very likely are dead code. o Refactor pf_proto_[un]register() into protosw_[un]register(). o Isolate pr_*_notsupp() methods into uipc_domain.c Reviewed by: melifaro Differential revision: https://reviews.freebsd.org/D36232
# 81a34d37	17-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	protosw: retire pr_drain and use EVENTHANDLER(9) directly The method was called for two different conditions: 1) the VM layer is low on pages or 2) one of UMA zones of mbuf allocator exhausted. This change 2) into a new event handler, but all affected network subsystems modified to subscribe to both, so this change shall not bring functional changes under different low memory situations. There were three subsystems still using pr_drain: TCP, SCTP and frag6. The latter had its protosw entry for the only reason to register its pr_drain method. Reviewed by: tuexen, melifaro Differential revision: https://reviews.freebsd.org/D36164
# 6c452841	17-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: use callout(9) directly instead of pr_slowtimo Modern TCP stacks uses multiple callouts per tcpcb, and a global callout is ancient artifact. However it is still used to garbage collect compressed timewait entries. Reviewed by: melifaro, tuexen Differential revision: https://reviews.freebsd.org/D36159
# 74703901	04-Jul-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: use a TCP flag to check if connection has been close(2)d The flag SS_NOFDREF is a private flag of the socket layer. It also is supposed to be read with SOCK_LOCK(), which we don't own here. Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D35663
# 700a395c	13-Apr-2022	John Baldwin <jhb@FreeBSD.org>	tcp_log_vain/addrs: Use a const pointer for the IPv4 header. The pointer to the IPv6 header was already const.
# ea9017fb	21-Feb-2022	Randall Stewart <rrs@FreeBSD.org>	tcp: Congestion control move to using reference counting. In the transport call on 12/3 Gleb asked to move the CC modules towards using reference counting to prevent folks from unloading a module in use. It was also agreed that Michael would do a user space utility like tcp_drop that could be used to move all connections that are using a specific CC to some other CC. This is the half I committed to doing, making it so that we maintain a refcount on a cc module every time a pcb refers to it and decrementing that every time a pcb no longer uses a cc module. This also helps us simplify the whole unloading process by getting rid of tcp_ccunload() which munged through all the tcb's. Instead we mark a module as being removed and prevent further references to it. We also make sure that if a module is marked as being removed it cannot be made as the default and also the opposite of that, if its a default it fails and does not mark it as being removed. Reviewed by: Michael Tuexen, Gleb Smirnoff Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D33249
# 0c2832ee	15-Feb-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Restore 6 tcps padding entries in HEAD The padding in CURRENT shall not shrink. It is designed that in CURRENT at always stays the same, and then when a new stable is branched, it inherits 6 pointer placeholders that can be used withing this stable/X lifetime to extend the structure. Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D34269
# 3f169c54	09-Feb-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Add/update AccECN related statistics and numbers Reserve couters in the tcps struct in preparation for AccECN, extend the debugging output for TF2 flags, optimize the syncache flags from individual bits to a codepoint for the specifc ECN handshake. This is in preparation of AccECN. No functional chance except for extended debug output capabilities. Reviewed By: #transport, rrs Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D34161
# fd7daa72	08-Feb-2022	Michael Tuexen <tuexen@FreeBSD.org>	tcp: make tcp_ctloutput_set() non-static tcp_ctloutput_set() will be used via the sysctl interface in a upcoming command line tool tcpsso. Reviewed by: glebius, rscheff Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D34164
# 1ebf4607	03-Feb-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Access all 12 TCP header flags via inline function In order to consistently provide access to all (including reserved) TCP header flag bits, use an accessor function tcp_get_flags and tcp_set_flags. Also expand any flag variable from uint8_t / char to uint16_t. Reviewed By: hselasky, tuexen, glebius, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D34130
# 3b3c08c1	02-Feb-2022	Michael Tuexen <tuexen@FreeBSD.org>	tcp: cleanup functions related to socket option handling Consistently only pass the inp and the sopt around. Don't pass the so around, since in a upcoming commit tcp_ctloutput_set() will be called from a context different from setsockopt(). Also expect the inp to be locked when calling tcp_ctloutput_[gs]et(), this is also required for the upcoming use by tcpsso, a command line tool to set socket options. Reviewed by: glebius, rscheff Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D34151
# 68e623c3	27-Jan-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Rewind erraneous RTO only while performing RTO retransmissions Under rare circumstances, a spurious retranmission is incorrectly detected and rewound, messing up various tcpcb values, which can lead to a panic when SACK is in use. Reviewed By: tuexen, chengc_netapp.com, #transport MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D33979
# 89128ff3	03-Jan-2022	Gleb Smirnoff <glebius@FreeBSD.org>	protocols: init with standard SYSINIT(9) or VNET_SYSINIT The historical BSD network stack loop that rolls over domains and over protocols has no advantages over more modern SYSINIT(9). While doing the sweep, split global and per-VNET initializers. Getting rid of pr_init allows to achieve several things: o Get rid of ifdef's that protect against double foo_init() when both INET and INET6 are compiled in. o Isolate initializers statically to the module they init. o Makes code easier to understand and maintain. Reviewed by: melifaro Differential revision: https://reviews.freebsd.org/D33537
# f64dc2ab	26-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: TCP output method can request tcp_drop The advanced TCP stacks (bbr, rack) may decide to drop a TCP connection when they do output on it. The default stack never does this, thus existing framework expects tcp_output() always to return locked and valid tcpcb. Provide KPI extension to satisfy demands of advanced stacks. If the output method returns negative error code, it means that caller must call tcp_drop(). In tcp_var() provide three inline methods to call tcp_output(): - tcp_output() is a drop-in replacement for the default stack, so that default stack can continue using it internally without modifications. For advanced stacks it would perform tcp_drop() and unlock and report that with negative error code. - tcp_output_unlock() handles the negative code and always converts it to positive and always unlocks. - tcp_output_nodrop() just calls the method and leaves the responsibility to drop on the caller. Sweep over the advanced stacks and use new KPI instead of using HPTS delayed drop queue for that. Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D33370
# 5b08b46a	26-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: welcome back tcp_output() as the right way to run output on tcpcb. Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D33365
# 2a28b045	20-Dec-2021	Robert Wing <rew@FreeBSD.org>	tcp_twrespond: send signed segment when connection is TCP-MD5 When a connection is established to use TCP-MD5, tcp_twrespond() doesn't respond with a signed segment. This results in the host performing the active close to remain in a TIME_WAIT state and the other host in the LAST_ACK state. Fix this by sending a signed segment when the connection is established to use TCP-MD5. Reviewed by: glebius Differential Revision: https://reviews.freebsd.org/D33490
# 71d2d5ad	19-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	tcptw: count how many times a tcptw was actually useful This will allow a sysadmin to lower net.inet.tcp.msl and see how long tcptw are actually useful.
# cb377263	19-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	tcptw: remove unused fields The structure goes away anyway, but it would be interesting to know how much memory we used to save with it. So for the record, structure size with this revision is 64 bytes.
# db0ac6de	02-Dec-2021	Cy Schubert <cy@FreeBSD.org>	Revert "wpa: Import wpa_supplicant/hostapd commit 14ab4a816" This reverts commit 266f97b5e9a7958e365e78288616a459b40d924a, reversing changes made to a10253cffea84c0c980a36ba6776b00ed96c3e3b. A mismerge of a merge to catch up to main resulted in files being committed which should not have been.
# de2d4784	02-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	SMR protection for inpcbs With introduction of epoch(9) synchronization to network stack the inpcb database became protected by the network epoch together with static network data (interfaces, addresses, etc). However, inpcb aren't static in nature, they are created and destroyed all the time, which creates some traffic on the epoch(9) garbage collector. Fairly new feature of uma(9) - Safe Memory Reclamation allows to safely free memory in page-sized batches, with virtually zero overhead compared to uma_zfree(). However, unlike epoch(9), it puts stricter requirement on the access to the protected memory, needing the critical(9) section to access it. Details: - The database is already build on CK lists, thanks to epoch(9). - For write access nothing is changed. - For a lookup in the database SMR section is now required. Once the desired inpcb is found we need to transition from SMR section to r/w lock on the inpcb itself, with a check that inpcb isn't yet freed. This requires some compexity, since SMR section itself is a critical(9) section. The complexity is hidden from KPI users in inp_smr_lock(). - For a inpcb list traversal (a pcblist sysctl, or broadcast notification) also a new KPI is provided, that hides internals of the database - inp_next(struct inp_iterator *). Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33022
# ff945008	18-Nov-2021	Gleb Smirnoff <glebius@FreeBSD.org>	Add tcp_freecb() - single place to free tcpcb. Until this change there were two places where we would free tcpcb - tcp_discardcb() in case if all timers are drained and tcp_timer_discard() otherwise. They were pretty much copy-n-paste, except that in the default case we would run tcp_hc_update(). Merge this into single function tcp_freecb() and move new short version of tcp_timer_discard() to tcp_timer.c and make it static. Reviewed by: rrs, hselasky Differential revision: https://reviews.freebsd.org/D32965
# f581a26e	25-Oct-2021	Gleb Smirnoff <glebius@FreeBSD.org>	Factor out tcp6_use_min_mtu() to handle IPV6_USE_MIN_MTU by TCP. Pass control for IP/IP6 level options from generic tcp_ctloutput_set() down to per-stack ctloutput. Call tcp6_use_min_mtu() from tcp stack tcp_default_ctloutput(). Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D32655
# e2833083	25-Oct-2021	Peter Lei <peterlei@netflix.com>	tcp: socket option to get stack alias name TCP stack sysctl nodes are currently inserted using the stack name alias. Allow the user to get the current stack's alias to allow for programatic sysctl access. Obtained from: Netflix
# a36230f7	01-Oct-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Make dsack stats available in netstat and also make sure its aware of TLP's. DSACK accounting has been for quite some time under a NETFLIX_STATS ifdef. Statistics on DSACKs however are very useful in figuring out how much bad retransmissions you are doing. This is further complicated, however, by stacks that do TLP. A TLP when discovering a lost ack in the reverse path will cause the generation of a DSACK. For this situation we introduce a new dsack-tlp-bytes as well as the more traditional dsack-bytes and dsack-packets. These will now all display in netstat -p tcp -s. This also updates all stacks that are currently built to keep track of these stats. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32158
# e3e7d953	15-Sep-2021	Hans Petter Selasky <hselasky@FreeBSD.org>	tcp: Avoid division by zero when KERN_TLS is enabled in tcp_account_for_send(). If the "len" variable is non-zero, we can assume that the sum of "tp->t_snd_rxt_bytes + tp->t_sndbytes" is also non-zero. It is also assumed that the 64-bit byte counters will never wrap around. Differential Revision: https://reviews.freebsd.org/D31959 Reviewed by: gallatin, rrs and tuexen Found by: "I told you so", also called hselasky MFC after: 1 week Sponsored by: NVIDIA Networking
# 739de953	05-Aug-2021	Andrew Gallatin <gallatin@FreeBSD.org>	ktls: Move KERN_TLS ifdef to tcp_var.h This allows us to remove stubs in ktls.h and allows us to sort the function prototypes. Reviewed by: jhb Sponsored by: Netflix
# ca1a7e10	12-Jul-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: TCP_LRO getting bad checksums and sending it in to TCP incorrectly. In reviewing tcp_lro.c we have a possibility that some drives may send a mbuf into LRO without making sure that the checksum passes. Some drivers actually are aware of this and do not call lro when the csum failed, others do not do this and thus could end up sending data up that we think has a checksum passing when it does not. This change will fix that situation by properly verifying that the mbuf has the correct markings (CSUM VALID bits as well as csum in mbuf header is set to 0xffff). Reviewed by: tuexen, hselasky, gallatin Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31155
# 28d0a740	06-Jul-2021	Andrew Gallatin <gallatin@FreeBSD.org>	ktls: auto-disable ifnet (inline hw) kTLS Ifnet (inline) hw kTLS NICs typically keep state within a TLS record, so that when transmitting in-order, they can continue encryption on each segment sent without DMA'ing extra state from the host. This breaks down when transmits are out of order (eg, TCP retransmits). In this case, the NIC must re-DMA the entire TLS record up to and including the segment being retransmitted. This means that when re-transmitting the last 1448 byte segment of a TLS record, the NIC will have to re-DMA the entire 16KB TLS record. This can lead to the NIC running out of PCIe bus bandwidth well before it saturates the network link if a lot of TCP connections have a high retransmoit rate. This change introduces a new sysctl (kern.ipc.tls.ifnet_max_rexmit_pct), where TCP connections with higher retransmit rate will be switched to SW kTLS so as to conserve PCIe bandwidth. Reviewed by: hselasky, markj, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30908
# 9e4d9e4c	25-Jun-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Preparation for allowing hardware TLS to be able to kick a tcp connection that is retransmitting too much out of hardware and back to software. Hardware TLS is now supported in some interface cards and it works well. Except that when we have connections that retransmit a lot we get into trouble with all the retransmits. This prep step makes way for change that Drew will be making so that we can "kick out" a session from hardware TLS. Reviewed by: mtuexen, gallatin Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30895
# 67e89281	10-Jun-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Mbuf leak while holding a socket buffer lock. When running at NF the current Rack and BBR changes with the recent commits from Richard that cause the socket buffer lock to be held over the ip_output() call and then finally culminating in a call to tcp_handle_wakeup() we get a lot of leaked mbufs. I don't think that this leak is actually caused by holding the lock or what Richard has done, but is exposing some other bug that has probably been lying dormant for a long time. I will continue to look (using his changes) at what is going on to try to root cause out the issue. In the meantime I can't leave the leaks out for everyone else. So this commit will revert all of Richards changes and move both Rack and BBR back to just doing the old sorwakeup_locked() calls after messing with the so_rcv buffer. We may want to look at adding back in Richards changes after I have pinpointed the root cause of the mbuf leak and fixed it. Reviewed by: mtuexen,rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30704
# 032bf749	21-May-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	[tcp] Keep socket buffer locked until upcall r367492 would unlock the socket buffer before eventually calling the upcall. This leads to problematic interaction with NFS kernel server/client components (MP threads) accessing the socket buffer with potentially not correctly updated state. Reported by: rmacklem Reviewed By: tuexen, #transport Tested by: rmacklem, otis MFC after: 2 weeks Sponsored By: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29690
# 0471a8c7	10-May-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: SACK Lost Retransmission Detection (LRD) Recover from excessive losses without reverting to a retransmission timeout (RTO). Disabled by default, enable with sysctl net.inet.tcp.do_lrd=1 Reviewed By: #transport, rrs, tuexen, #manpages Sponsored by: Netapp, Inc. Differential Revision: https://reviews.freebsd.org/D28931
# 5d8fd932	06-May-2021	Randall Stewart <rrs@FreeBSD.org>	This brings into sync FreeBSD with the netflix versions of rack and bbr. This fixes several breakages (panics) since the tcp_lro code was committed that have been reported. Quite a few new features are now in rack (prefecting of DGP -- Dynamic Goodput Pacing among the largest). There is also support for ack-war prevention. Documents comming soon on rack.. Sponsored by: Netflix Reviewed by: rscheff, mtuexen Differential Revision: https://reviews.freebsd.org/D30036
# 9ca874cf	30-Mar-2021	Hans Petter Selasky <hselasky@FreeBSD.org>	Add TCP LRO support for VLAN and VxLAN. This change makes the TCP LRO code more generic and flexible with regards to supporting multiple different TCP encapsulation protocols and in general lays the ground for broader TCP LRO support. The main job of the TCP LRO code is to merge TCP packets for the same flow, to reduce the number of calls to upper layers. This reduces CPU and increases performance, due to being able to send larger TSO offloaded data chunks at a time. Basically the TCP LRO makes it possible to avoid per-packet interaction by the host CPU. Because the current TCP LRO code was tightly bound and optimized for TCP/IP over ethernet only, several larger changes were needed. Also a minor bug was fixed in the flushing mechanism for inactive entries, where the expire time, "le->mtime" was not always properly set. To avoid having to re-run time consuming regression tests for every change, it was chosen to squash the following list of changes into a single commit: - Refactor parsing of all address information into the "lro_parser" structure. This easily allows to reuse parsing code for inner headers. - Speedup header data comparison. Don't compare field by field, but instead use an unsigned long array, where the fields get packed. - Refactor the IPv4/TCP/UDP checksum computations, so that they may be computed recursivly, only applying deltas as the result of updating payload data. - Make smaller inline functions doing one operation at a time instead of big functions having repeated code. - Refactor the TCP ACK compression code to only execute once per TCP LRO flush. This gives a minor performance improvement and keeps the code simple. - Use sbintime() for all time-keeping. This change also fixes flushing of inactive entries. - Try to shrink the size of the LRO entry, because it is frequently zeroed. - Removed unused TCP LRO macros. - Cleanup unused TCP LRO statistics counters while at it. - Try to use __predict_true() and predict_false() to optimise CPU branch predictions. Bump the __FreeBSD_version due to changing the "lro_ctrl" structure. Tested by: Netflix Reviewed by: rrs (transport) Differential Revision: https://reviews.freebsd.org/D29564 MFC after: 2 week Sponsored by: Mellanox Technologies // NVIDIA Networking
# 9e644c23	18-Apr-2021	Michael Tuexen <tuexen@FreeBSD.org>	tcp: add support for TCP over UDP Adding support for TCP over UDP allows communication with TCP stacks which can be implemented in userspace without requiring special priviledges or specific support by the OS. This is joint work with rrs. Reviewed by: rrs Sponsored by: Netflix, Inc. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29469
# d1de2b05	17-Apr-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Rename rfc6675_pipe to sack.revised, and enable by default As full support of RFC6675 is in place, deprecating net.inet.tcp.rfc6675_pipe and enabling by default net.inet.tcp.sack.revised. Reviewed By: #transport, kbowling, rrs Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28702
# 90cca08e	08-Apr-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Prepare PRR to work with NewReno LossRecovery Add proper PRR vnet declarations for consistency. Also add pointer to tcpopt struct to tcp_do_prr_ack, in preparation for it to deal with non-SACK window reduction (after loss). No functional change. MFC after: 2 weeks Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29440
# eb3a59a8	25-Mar-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Refactor PRR code No functional change intended. MFC after: 2 weeks Reviewed By: #transport, rrs Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29411
# e5313869	05-Mar-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Add prr_out in preparation for PRR/nonSACK and LRD Reviewed By: #transport, kbowling MFC after: 3 days Sponsored By: Netapp, Inc. Differential Revision: https://reviews.freebsd.org/D29058
# 69a34e8d	26-Jan-2021	Randall Stewart <rrs@FreeBSD.org>	Update the LRO processing code so that we can support a further CPU enhancements for compressed acks. These are acks that are compressed into an mbuf. The transport has to be aware of how to process these, and an upcoming update to rack will do so. You need the rack changes to actually test and validate these since if the transport does not support mbuf compression, then the old code paths stay in place. We do in this commit take out the concept of logging if you don't have a lock (which was quite dangerous and was only for some early debugging but has been left in the code). Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D28374
# d2b3cedd	13-Jan-2021	Michael Tuexen <tuexen@FreeBSD.org>	tcp: add sysctl to tolerate TCP segments missing timestamps When timestamp support has been negotiated, TCP segements received without a timestamp should be discarded. However, there are broken TCP implementations (for example, stacks used by Omniswitch 63xx and 64xx models), which send TCP segments without timestamps although they negotiated timestamp support. This patch adds a sysctl variable which tolerates such TCP segments and allows to interoperate with broken stacks. Reviewed by: jtl@, rscheff@ Differential Revision: https://reviews.freebsd.org/D28142 Sponsored by: Netflix, Inc. PR: 252449 MFC after: 1 week
# 0e1d7c25	04-Dec-2020	Richard Scheffenegger <rscheff@FreeBSD.org>	Add TCP feature Proportional Rate Reduction (PRR) - RFC6937 PRR improves loss recovery and avoids RTOs in a wide range of scenarios (ACK thinning) over regular SACK loss recovery. PRR is disabled by default, enable by net.inet.tcp.do_prr = 1. Performance may be impeded by token bucket rate policers at the bottleneck, where net.inet.tcp.do_prr_conservate = 1 should be enabled in addition. Submitted by: Aris Angelogiannopoulos Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D18892
# 4d0770f1	08-Nov-2020	Richard Scheffenegger <rscheff@FreeBSD.org>	Prevent premature SACK block transmission during loss recovery Under specific conditions, a window update can be sent with outdated SACK information. Some clients react to this by subsequently delaying loss recovery, making TCP perform very poorly. Reported by: chengc_netapp.com Reviewed by: rrs, jtl MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D24237
# ce398115	28-Oct-2020	John Baldwin <jhb@FreeBSD.org>	Save the current TCP pacing rate in t_pacing_rate. Reviewed by: gallatin, gnn Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26875
# 54321200	09-Oct-2020	Richard Scheffenegger <rscheff@FreeBSD.org>	Extend netstat to display TCP stack and detailed congestion state (2) Extend netstat to display TCP stack and detailed congestion state Adding the "-c" option used to show detailed per-connection congestion control state for TCP sessions. This is one summary patch, which adds the relevant variables into xtcpcb. As previous "spare" space is used, these changes are ABI compatible. Reviewed by: tuexen MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D26518
# 42d75607	13-Sep-2020	Michael Tuexen <tuexen@FreeBSD.org>	Export the name of the congestion control. This will be used by sockstat and netstat. Reviewed by: rscheff MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D26412
# f359d6eb	13-Aug-2020	Richard Scheffenegger <rscheff@FreeBSD.org>	Improve SACK support code for RFC6675 and PRR Adding proper accounting of sacked_bytes and (per-ACK) delivered data to the SACK scoreboard. This will allow more aspects of RFC6675 to be implemented as well as Proportional Rate Reduction (RFC6937). Prior to this change, the pipe calculation controlled with net.inet.tcp.rfc6675_pipe was also susceptible to incorrect results when more than 3 (or 4) holes in the sequence space were present, which can no longer all fit into a single ACK's SACK option. Reviewed by: kbowling, rgrimes (mentor) Approved by: rgrimes (mentor, blanket) MFC after: 3 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D18624
# 9dc7d8a2	24-Jun-2020	Richard Scheffenegger <rscheff@FreeBSD.org>	TCP: make after-idle work for transactional sessions. The use of t_rcvtime as proxy for the last transmission fails for transactional IO, where the client requests data before the server can respond with a bulk transfer. Set aside a dedicated variable to actually track the last locally sent segment going forward. Reported by: rrs Reviewed by: rrs, tuexen (mentor) Approved by: tuexen (mentor), rgrimes (mentor) MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D25016
# e854dd38	08-Jun-2020	Randall Stewart <rrs@FreeBSD.org>	An important statistic in determining if a server process (or client) is being delayed is to know the time to first byte in and time to first byte out. Currently we have no way to know these all we have is t_starttime. That (t_starttime) tells us what time the 3 way handshake completed. We don't know when the first request came in or how quickly we responded. Nor from a client perspective do we know how long from when we sent out the first byte before the server responded. This small change adds the ability to track the TTFB's. This will show up in BB logging which then can be pulled for later analysis. Note that currently the tracking is via the ticks variable of all three variables. This provides a very rough estimate (hz=1000 its 1ms). A follow-on set of work will be to change all three of these values into something with a much finer resolution (either microseconds or nanoseconds), though we may want to make the resolution configurable so that on lower powered machines we could still use the much cheaper ticks variable. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D24902
# d3b6c96b	04-May-2020	Randall Stewart <rrs@FreeBSD.org>	Adjust the fb to have a way to ask the underlying stack if it can support the PRUS option (OOB). And then have the new function call that to validate and give the correct error response if needed to the user (rack and bbr do not support obsoleted OOB data). Sponsoered by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D24574
# e570d231	27-Apr-2020	Randall Stewart <rrs@FreeBSD.org>	This change does a small prepratory step in getting the latest rack and bbr in from the NF repo. When those come in the OOB data handling will be fixed where Skyzaller crashes. Differential Revision: https://reviews.freebsd.org/D24575
# b89af8e1	14-Apr-2020	Michael Tuexen <tuexen@FreeBSD.org>	Improve the TCP blackhole detection. The principle is to reduce the MSS in two steps and try each candidate two times. However, if two candidates are the same (which is the case in TCP/IPv6), this candidate was tested four times. This patch ensures that each candidate actually reduced the MSS and is only tested 2 times. This reduces the time window of missclassifying a temporary outage as an MTU issue. Reviewed by: jtl MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D24308
# 7ca6e296	12-Mar-2020	Michael Tuexen <tuexen@FreeBSD.org>	Use KMOD_TCPSTAT_INC instead of TCPSTAT_INC for RACK and BBR, since these are kernel modules. Also add a KMOD_TCPSTAT_ADD and use that instead of TCPSTAT_ADD. Reviewed by: jtl@, rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D23904
# a3574665	13-Feb-2020	Michael Tuexen <tuexen@FreeBSD.org>	sack_newdata and snd_recover hold the same value. Therefore, use only a single instance: use snd_recover also where sack_newdata was used. Submitted by: Richard Scheffenegger Differential Revision: https://reviews.freebsd.org/D18811
# 481be5de	12-Feb-2020	Randall Stewart <rrs@FreeBSD.org>	White space cleanup -- remove trailing tab's or spaces from any line. Sponsored by: Netflix Inc.
# 334fc582	08-Jan-2020	Bjoern A. Zeeb <bz@FreeBSD.org>	vnet: virtualise more network stack sysctls. Virtualise tcp_always_keepalive, TCP and UDP log_in_vain. All three are set in the netoptions startup script, which we would love to run for VNETs as well [1]. While virtualising the log_in_vain sysctls seems pointles at first for as long as the kernel message buffer is not virtualised, it at least allows an administrator to debug the base system or an individual jail if needed without turning the logging on for all jails running on a system. PR: 243193 [1] MFC after: 2 weeks
# 4ad24737	06-Jan-2020	Randall Stewart <rrs@FreeBSD.org>	This catches rack up in the recent changes to ECN and also commonizes the functions that both the freebsd and rack stack uses. Sponsored by:Netflix Inc Differential Revision: https://reviews.freebsd.org/D23052
# a9a08ece	05-Jan-2020	Randall Stewart <rrs@FreeBSD.org>	This change adds a small feature to the tcp logging code. Basically a connection can now have a separate tag added to the id. Obtained from: Lawrence Stewart Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D22866
# 1cf55767	17-Dec-2019	Randall Stewart <rrs@FreeBSD.org>	This commit is a bit of a re-arrange of deck chairs. It gets both rack and bbr ready for the completion of the STATs framework in FreeBSD. For now if you don't have both NF_stats and stats on it disables them. As soon as the rest of the stats framework lands we can remove that restriction and then just uses stats when defined. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D22479
# adc56f5a	02-Dec-2019	Edward Tomasz Napierala <trasz@FreeBSD.org>	Make use of the stats(3) framework in the TCP stack. This makes it possible to retrieve per-connection statistical information such as the receive window size, RTT, or goodput, using a newly added TCP_STATS getsockopt(3) option, and extract them using the stats_voistat_fetch(3) API. See the net/tcprtt port for an example consumer of this API. Compared to the existing TCP_INFO system, the main differences are that this mechanism is easy to extend without breaking ABI, and provides statistical information instead of raw "snapshots" of values at a given point in time. stats(3) is more generic and can be used in both userland and the kernel. Reviewed by: thj Tested by: thj Obtained from: Netflix Relnotes: yes Sponsored by: Klara Inc, Netflix Differential Revision: https://reviews.freebsd.org/D20655
# 3cf38784	01-Dec-2019	Michael Tuexen <tuexen@FreeBSD.org>	Move all ECN related flags from the flags to the flags2 field. This allows adding more ECN related flags in the future. No functional change intended. Submitted by: Richard Scheffenegger Reviewed by: rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22497
# 77aabfd9	01-Dec-2019	Michael Tuexen <tuexen@FreeBSD.org>	Make the TF_* flags easier readable by humans by adding leading zeroes to make them aligned. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22428
# b72e56e7	01-Dec-2019	Michael Tuexen <tuexen@FreeBSD.org>	This is an initial step in implementing the new congestion window validation as specified in RFC 7661. Submitted by: Richard Scheffenegger Reviewed by: rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D21798
# af9b9e0d	06-Sep-2019	Randall Stewart <rrs@FreeBSD.org>	This adds in the missing counter initialization which I had forgotten to bring over.. opps. Differential Revision: https://reviews.freebsd.org/D21127
# fe5dee73	02-Sep-2019	Michael Tuexen <tuexen@FreeBSD.org>	This patch improves the DSACK handling to conform with RFC 2883. The lowest SACK block is used when multiple Blocks would be elegible as DSACK blocks ACK blocks get reordered - while maintaining the ordering of SACK blocks not relevant in the DSACK context is maintained. Reviewed by: rrs@, tuexen@ Obtained from: Richard Scheffenegger MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D21038
# b2e60773	26-Aug-2019	John Baldwin <jhb@FreeBSD.org>	Add kernel-side support for in-kernel TLS. KTLS adds support for in-kernel framing and encryption of Transport Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports offload of TLS for transmitted data. Key negotation must still be performed in userland. Once completed, transmit session keys for a connection are provided to the kernel via a new TCP_TXTLS_ENABLE socket option. All subsequent data transmitted on the socket is placed into TLS frames and encrypted using the supplied keys. Any data written to a KTLS-enabled socket via write(2), aio_write(2), or sendfile(2) is assumed to be application data and is encoded in TLS frames with an application data type. Individual records can be sent with a custom type (e.g. handshake messages) via sendmsg(2) with a new control message (TLS_SET_RECORD_TYPE) specifying the record type. At present, rekeying is not supported though the in-kernel framework should support rekeying. KTLS makes use of the recently added unmapped mbufs to store TLS frames in the socket buffer. Each TLS frame is described by a single ext_pgs mbuf. The ext_pgs structure contains the header of the TLS record (and trailer for encrypted records) as well as references to the associated TLS session. KTLS supports two primary methods of encrypting TLS frames: software TLS and ifnet TLS. Software TLS marks mbufs holding socket data as not ready via M_NOTREADY similar to sendfile(2) when TLS framing information is added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then called to schedule TLS frames for encryption. In the case of sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving the mbufs marked M_NOTREADY until encryption is completed. For other writes (vn_sendfile when pages are available, write(2), etc.), the PRUS_NOTREADY is set when invoking pru_send() along with invoking ktls_enqueue(). A pool of worker threads (the "KTLS" kernel process) encrypts TLS frames queued via ktls_enqueue(). Each TLS frame is temporarily mapped using the direct map and passed to a software encryption backend to perform the actual encryption. (Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if someone wished to make this work on architectures without a direct map.) KTLS supports pluggable software encryption backends. Internally, Netflix uses proprietary pure-software backends. This commit includes a simple backend in a new ktls_ocf.ko module that uses the kernel's OpenCrypto framework to provide AES-GCM encryption of TLS frames. As a result, software TLS is now a bit of a misnomer as it can make use of hardware crypto accelerators. Once software encryption has finished, the TLS frame mbufs are marked ready via pru_ready(). At this point, the encrypted data appears as regular payload to the TCP stack stored in unmapped mbufs. ifnet TLS permits a NIC to offload the TLS encryption and TCP segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS) is allocated on the interface a socket is routed over and associated with a TLS session. TLS records for a TLS session using ifnet TLS are not marked M_NOTREADY but are passed down the stack unencrypted. The ip_output_send() and ip6_output_send() helper functions that apply send tags to outbound IP packets verify that the send tag of the TLS record matches the outbound interface. If so, the packet is tagged with the TLS send tag and sent to the interface. The NIC device driver must recognize packets with the TLS send tag and schedule them for TLS encryption and TCP segmentation. If the the outbound interface does not match the interface in the TLS send tag, the packet is dropped. In addition, a task is scheduled to refresh the TLS send tag for the TLS session. If a new TLS send tag cannot be allocated, the connection is dropped. If a new TLS send tag is allocated, however, subsequent packets will be tagged with the correct TLS send tag. (This latter case has been tested by configuring both ports of a Chelsio T6 in a lagg and failing over from one port to another. As the connections migrated to the new port, new TLS send tags were allocated for the new port and connections resumed without being dropped.) ifnet TLS can be enabled and disabled on supported network interfaces via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported across both vlan devices and lagg interfaces using failover, lacp with flowid enabled, or lacp with flowid enabled. Applications may request the current KTLS mode of a connection via a new TCP_TXTLS_MODE socket option. They can also use this socket option to toggle between software and ifnet TLS modes. In addition, a testing tool is available in tools/tools/switch_tls. This is modeled on tcpdrop and uses similar syntax. However, instead of dropping connections, -s is used to force KTLS connections to switch to software TLS and -i is used to switch to ifnet TLS. Various sysctls and counters are available under the kern.ipc.tls sysctl node. The kern.ipc.tls.enable node must be set to true to enable KTLS (it is off by default). The use of unmapped mbufs must also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS. KTLS is enabled via the KERN_TLS kernel option. This patch is the culmination of years of work by several folks including Scott Long and Randall Stewart for the original design and implementation; Drew Gallatin for several optimizations including the use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records awaiting software encryption, and pluggable software crypto backends; and John Baldwin for modifications to support hardware TLS offload. Reviewed by: gallatin, hselasky, rrs Obtained from: Netflix Sponsored by: Netflix, Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21277
# d21036e0	23-Jul-2019	Michael Tuexen <tuexen@FreeBSD.org>	Add a sysctl variable ts_offset_per_conn to change the computation of the TCP TS offset from taking the IP addresses and the TCP port numbers into account to a version just taking only the IP addresses into account. This works around broken middleboxes or endpoints. The default is to keep the behaviour, which is also the behaviour recommended in RFC 7323. Reported by: devgs@ukr.net Reviewed by: rrs@ MFC after: 2 weeks Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D20980
# e5926fd3	14-Jul-2019	Randall Stewart <rrs@FreeBSD.org>	This is the second in a number of patches needed to get BBRv1 into the tree. This fixes the DSACK bug but is also needed by BBR. We have yet to go two more one will be for the pacing code (tcp_ratelimit.c) and the second will be for the new updated LRO code that allows a transport to know the arrival times of packets and (tcp_lro.c). After that we should finally be able to get BBRv1 into head. Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D20908
# 3b0b41e6	10-Jul-2019	Randall Stewart <rrs@FreeBSD.org>	This commit updates rack to what is basically being used at NF as well as sets in some of the groundwork for committing BBR. The hpts system is updated as well as some other needed utilities for the entrance of BBR. This is actually part 1 of 3 more needed commits which will finally complete with BBRv1 being added as a new tcp stack. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D20834
# 7893235f	29-Mar-2019	Navdeep Parhar <np@FreeBSD.org>	tcp_autorcvbuf_inc was removed in r344433. Discussed with: tuexen@ Sponsored by: Chelsio Communications
# 7dc90a1d	25-Jan-2019	Michael Tuexen <tuexen@FreeBSD.org>	Fix a bug in the restart window computation of TCP New Reno When implementing support for IW10, an update in the computation of the restart window used after an idle phase was missed. To minimize code duplication, implement the logic in tcp_compute_initwnd() and call it. This fixes a bug in NewReno, which was not aware of IW10. Submitted by: Richard Scheffenegger Reviewed by: tuexen@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D18940
# c28440db	19-Aug-2018	Randall Stewart <rrs@FreeBSD.org>	This change represents a substantial restructure of the way we reassembly inbound tcp segments. The old algorithm just blindly dropped in segments without coalescing. This meant that every segment could take up greater and greater room on the linked list of segments. This of course is now subject to a tighter limit (100) of segments which in a high BDP situation will cause us to be a lot more in-efficent as we drop segments beyond 100 entries that we receive. What this restructure does is cause the reassembly buffer to coalesce segments putting an emphasis on the two common cases (which avoid walking the list of segments) i.e. where we add to the back of the queue of segments and where we add to the front. We also have the reassembly buffer supporting a couple of debug options (black box logging as well as counters for code coverage). These are compiled out by default but can be added by uncommenting the defines. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D16626
# 8e02b4e0	19-Aug-2018	Michael Tuexen <tuexen@FreeBSD.org>	Don't expose the uptime via the TCP timestamps. The TCP client side or the TCP server side when not using SYN-cookies used the uptime as the TCP timestamp value. This patch uses in all cases an offset, which is the result of a keyed hash function taking the source and destination addresses and port numbers into account. The keyed hash function is the same a used for the initial TSN. Reviewed by: rrs@ MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16636
# f38b68ae	05-Jul-2018	Brooks Davis <brooks@FreeBSD.org>	Make struct xinpcb and friends word-size independent. Replace size_t members with ksize_t (uint64_t) and pointer members (never used as pointers in userspace, but instead as unique idenitifiers) with kvaddr_t (uint64_t). This makes the structs identical between 32-bit and 64-bit ABIs. On 64-bit bit systems, the ABI is maintained. On 32-bit systems, this is an ABI breaking change. The ABI of most of these structs was previously broken in r315662. This also imposes a small API change on userspace consumers who must handle kernel pointers becoming virtual addresses. PR: 228301 (exp-run by antoine) Reviewed by: jtl, kib, rwatson (various versions) Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15386
# 6573d758	03-Jul-2018	Matt Macy <mmacy@FreeBSD.org>	epoch(9): allow preemptible epochs to compose - Add tracker argument to preemptible epochs - Inline epoch read path in kernel and tied modules - Change in_epoch to take an epoch as argument - Simplify tfb_tcp_do_segment to not take a ti_locked argument, there's no longer any benefit to dropping the pcbinfo lock and trying to do so just adds an error prone branchfest to these functions - Remove cases of same function recursion on the epoch as recursing is no longer free. - Remove the the TAILQ_ENTRY and epoch_section from struct thread as the tracker field is now stack or heap allocated as appropriate. Tested by: pho and Limelight Networks Reviewed by: kbowling at llnw dot com Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16066
# 89e560f4	07-Jun-2018	Randall Stewart <rrs@FreeBSD.org>	This commit brings in a new refactored TCP stack called Rack. Rack includes the following features: - A different SACK processing scheme (the old sack structures are not used). - RACK (Recent acknowledgment) where counting dup-acks is no longer done instead time is used to knwo when to retransmit. (see the I-D) - TLP (Tail Loss Probe) where we will probe for tail-losses to attempt to try not to take a retransmit time-out. (see the I-D) - Burst mitigation using TCPHTPS - PRR (partial rate reduction) see the RFC. Once built into your kernel, you can select this stack by either socket option with the name of the stack is "rack" or by setting the global sysctl so the default is rack. Note that any connection that does not support SACK will be kicked back to the "default" base FreeBSD stack (currently known as "default"). To build this into your kernel you will need to enable in your kernel: makeoptions WITH_EXTRA_TCP_STACKS=1 options TCPHPTS Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15525
# 803a2305	26-Apr-2018	Randall Stewart <rrs@FreeBSD.org>	This change re-arranges the fields within the tcp-pcb so that they are more in order of cache line use as one passes through the tcp_input/output paths (non-errors most likely path). This helps speed up cache line optimization so that the tcp stack runs a bit more efficently. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15136
# 3ee9c3c4	19-Apr-2018	Randall Stewart <rrs@FreeBSD.org>	This commit brings in the TCP high precision timer system (tcp_hpts). It is the forerunner/foundational work of bringing in both Rack and BBR which use hpts for pacing out packets. The feature is optional and requires the TCPHPTS option to be enabled before the feature will be active. TCP modules that use it must assure that the base component is compile in the kernel in which they are loaded. MFC after: Never Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15020
# f8979519	10-Apr-2018	Jonathan T. Looney <jtl@FreeBSD.org>	Add missing header change from r332381. Sponsored by: Netflix, Inc. Pointy hat: jtl
# 2529f56e	22-Mar-2018	Jonathan T. Looney <jtl@FreeBSD.org>	Add the "TCP Blackbox Recorder" which we discussed at the developer summits at BSDCan and BSDCam in 2017. The TCP Blackbox Recorder allows you to capture events on a TCP connection in a ring buffer. It stores metadata with the event. It optionally stores the TCP header associated with an event (if the event is associated with a packet) and also optionally stores information on the sockets. It supports setting a log ID on a TCP connection and using this to correlate multiple connections that share a common log ID. You can log connections in different modes. If you are doing a coordinated test with a particular connection, you may tell the system to put it in mode 4 (continuous dump). Or, if you just want to monitor for errors, you can put it in mode 1 (ring buffer) and dump all the ring buffers associated with the connection ID when we receive an error signal for that connection ID. You can set a default mode that will be applied to a particular ratio of incoming connections. You can also manually set a mode using a socket option. This commit includes only basic probes. rrs@ has added quite an abundance of probes in his TCP development work. He plans to commit those soon. There are user-space programs which we plan to commit as ports. These read the data from the log device and output pcapng files, and then let you analyze the data (and metadata) in the pcapng files. Reviewed by: gnn (previous version) Obtained from: Netflix, Inc. Relnotes: yes Differential Revision: https://reviews.freebsd.org/D11085
# 18a75309	25-Feb-2018	Patrick Kelsey <pkelsey@FreeBSD.org>	Greatly reduce the number of #ifdefs supporting the TCP_RFC7413 kernel option. The conditional compilation support is now centralized in tcp_fastopen.h and tcp_var.h. This doesn't provide the minimum theoretical code/data footprint when TCP_RFC7413 is disabled, but nearly all the TFO code should wind up being removed by the optimizer, the additional footprint in the syncache entries is a single pointer, and the additional overhead in the tcpcb is at the end of the structure. This enables the TCP_RFC7413 kernel option by default in amd64 and arm64 GENERIC. Reviewed by: hiren MFC after: 1 month Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14048
# c560df6f	25-Feb-2018	Patrick Kelsey <pkelsey@FreeBSD.org>	This is an implementation of the client side of TCP Fast Open (TFO) [RFC7413]. It also includes a pre-shared key mode of operation in which the server requires the client to be in possession of a shared secret in order to successfully open TFO connections with that server. The names of some existing fastopen sysctls have changed (e.g., net.inet.tcp.fastopen.enabled -> net.inet.tcp.fastopen.server_enable). Reviewed by: tuexen MFC after: 1 month Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14047
# 66492fea	07-Dec-2017	Gleb Smirnoff <glebius@FreeBSD.org>	Separate out send buffer autoscaling code into function, so that alternative TCP stacks may reuse it instead of pasting. Obtained from: Netflix
# 51369649	20-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.
# 3bdf4c42	11-Oct-2017	Gleb Smirnoff <glebius@FreeBSD.org>	Declare more TCP globals in tcp_var.h, so that alternative TCP stacks can use them. Gather all TCP tunables in tcp_var.h in one place and alphabetically sort them, to ease maintainance of the list. Don't copy and paste declarations in tcp_stacks/fastpath.c.
# e5cccc35	12-Sep-2017	Michael Tuexen <tuexen@FreeBSD.org>	Add support to print the TCP stack being used. Sponsored by: Netflix, Inc.
# 32a04bb8	25-Aug-2017	Sean Bruno <sbruno@FreeBSD.org>	Use counter(9) for PLPMTUD counters. Remove unused PLPMTUD sysctl counters. Bump UPDATING and FreeBSD Version to indicate a rebuild is required. Submitted by: kevin.bowling@kev009.com Reviewed by: jtl Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D12003
# dc6a41b9	08-Jun-2017	Jonathan T. Looney <jtl@FreeBSD.org>	Add the infrastructure to support loading multiple versions of TCP stack modules. It adds support for mangling symbols exported by a module by prepending a string to them. (This avoids overlapping symbols in the kernel linker.) It allows the use of a macro as the module name in the DECLARE_MACRO() and MACRO_VERSION() macros. It allows the code to register stack aliases (e.g. both a generic name ["default"] and version-specific name ["default_10_3p1"]). With these changes, it is trivial to compile TCP stack modules with the name defined in the Makefile and to load multiple versions of the same stack simultaneously. This functionality can be used to enable side-by-side testing of an old and new version of the same TCP stack. It also could support upgrading the TCP stack without a reboot. Reviewed by: gnn, sjg (makefiles only) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D11086
# e44c1887	10-Apr-2017	Steven Hartland <smh@FreeBSD.org>	Use estimated RTT for receive buffer auto resizing instead of timestamps Switched from using timestamps to RTT estimates when performing TCP receive buffer auto resizing, as not all hosts support / enable TCP timestamps. Disabled reset of receive buffer auto scaling when not in bulk receive mode, which gives an extra 20% performance increase. Also extracted auto resizing to a common method shared between standard and fastpath modules. With this AWS S3 downloads at ~17ms latency on a 1Gbps connection jump from ~3MB/s to ~100MB/s using the default settings. Reviewed by: lstewart, gnn MFC after: 2 weeks Relnotes: Yes Sponsored by: Multiplay Differential Revision: https://reviews.freebsd.org/D9668
# cc65eb4e	21-Mar-2017	Gleb Smirnoff <glebius@FreeBSD.org>	Hide struct inpcb, struct tcpcb from the userland. This is a painful change, but it is needed. On the one hand, we avoid modifying them, and this slows down some ideas, on the other hand we still eventually modify them and tools like netstat(1) never work on next version of FreeBSD. We maintain a ton of spares in them, and we already got some ifdef hell at the end of tcpcb. Details: - Hide struct inpcb, struct tcpcb under _KERNEL \|\| _WANT_FOO. - Make struct xinpcb, struct xtcpcb pure API structures, not including kernel structures inpcb and tcpcb inside. Export into these structures the fields from inpcb and tcpcb that are known to be used, and put there a ton of spare space. - Make kernel and userland utilities compilable after these changes. - Bump __FreeBSD_version. Reviewed by: rrs, gnn Differential Revision: D10018
# fbbd9655	28-Feb-2017	Warner Losh <imp@FreeBSD.org>	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96
# cfff3743	10-Feb-2017	Gleb Smirnoff <glebius@FreeBSD.org>	Move tcp_fields_to_net() static inline into tcp_var.h, just below its friend tcp_fields_to_host(). There is third party code that also uses this inline. Reviewed by: ae
# fcf59617	06-Feb-2017	Andrey V. Elsukov <ae@FreeBSD.org>	Merge projects/ipsec into head/. Small summary ------------- o Almost all IPsec releated code was moved into sys/netipsec. o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel option IPSEC_SUPPORT added. It enables support for loading and unloading of ipsec.ko and tcpmd5.ko kernel modules. o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type support was removed. Added TCP/UDP checksum handling for inbound packets that were decapsulated by transport mode SAs. setkey(8) modified to show run-time NAT-T configuration of SA. o New network pseudo interface if_ipsec(4) added. For now it is build as part of ipsec.ko module (or with IPSEC kernel). It implements IPsec virtual tunnels to create route-based VPNs. o The network stack now invokes IPsec functions using special methods. The only one header file <netipsec/ipsec_support.h> should be included to declare all the needed things to work with IPsec. o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed. Now these protocols are handled directly via IPsec methods. o TCP_SIGNATURE support was reworked to be more close to RFC. o PF_KEY SADB was reworked: - now all security associations stored in the single SPI namespace, and all SAs MUST have unique SPI. - several hash tables added to speed up lookups in SADB. - SADB now uses rmlock to protect access, and concurrent threads can do SA lookups in the same time. - many PF_KEY message handlers were reworked to reflect changes in SADB. - SADB_UPDATE message was extended to support new PF_KEY headers: SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They can be used by IKE daemon to change SA addresses. o ipsecrequest and secpolicy structures were cardinally changed to avoid locking protection for ipsecrequest. Now we support only limited number (4) of bundled SAs, but they are supported for both INET and INET6. o INPCB security policy cache was introduced. Each PCB now caches used security policies to avoid SP lookup for each packet. o For inbound security policies added the mode, when the kernel does check for full history of applied IPsec transforms. o References counting rules for security policies and security associations were changed. The proper SA locking added into xform code. o xform code was also changed. Now it is possible to unregister xforms. tdb_xxx structures were changed and renamed to reflect changes in SADB/SPDB, and changed rules for locking and refcounting. Reviewed by: gnn, wblock Obtained from: Yandex LLC Relnotes: yes Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D9352
# 5e946c03	12-Jan-2017	Maxim Sobolev <sobomax@FreeBSD.org>	Fix slight type mismatch between so_options defined in sys/socketvar.h and tw_so_options defined here which is supposed to be a copy of the former (short vs u_short respectively). Switch tw_so_options to be "signed short" to match the type of the field it's inherited from.
# 68bd7ed1	12-Oct-2016	Jonathan T. Looney <jtl@FreeBSD.org>	The TFO server-side code contains some changes that are not conditioned on the TCP_RFC7413 kernel option. This change removes those few instructions from the packet processing path. While not strictly necessary, for the sake of consistency, I applied the new IS_FASTOPEN macro to all places in the packet processing path that used the (t_flags & TF_FASTOPEN) check. Reviewed by: hiren Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D8219
# bd79708d	11-Oct-2016	Jonathan T. Looney <jtl@FreeBSD.org>	In the TCP stack, the hhook(9) framework provides hooks for kernel modules to add actions that run when a TCP frame is sent or received on a TCP session in the ESTABLISHED state. In the base tree, this functionality is only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd, and cc_vegas congestion control modules. Presently, we incur overhead to check for hooks each time a TCP frame is sent or received on an ESTABLISHED TCP session. This change adds a new compile-time option (TCP_HHOOK) to determine whether to include the hhook(9) framework for TCP. To retain backwards compatibility, I added the TCP_HHOOK option to every configuration file that already defined "options INET". (Therefore, this patch introduces no functional change. In order to see a functional difference, you need to compile a custom kernel without the TCP_HHOOK option.) This change will allow users to easily exclude this functionality from their kernel, should they wish to do so. Note that any users who use a custom kernel configuration and use one of the congestion control modules listed above will need to add the TCP_HHOOK option to their kernel configuration. Reviewed by: rrs, lstewart, hiren (previous version), sjg (makefiles only) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D8185
# 3ac12506	06-Oct-2016	Jonathan T. Looney <jtl@FreeBSD.org>	Remove "long" variables from the TCP stack (not including the modular congestion control framework). Reviewed by: gnn, lstewart (partial) Sponsored by: Juniper Networks, Netflix Differential Revision: (multiple) Tested by: Limelight, Netflix
# 55a429a6	06-Oct-2016	Jonathan T. Looney <jtl@FreeBSD.org>	Remove declaration of un-defined function tcp_seq_subtract(). Reviewed by: gnn MFC after: 1 week Sponsored by: Juniper Networks, Netflix Differential Revision: https://reviews.freebsd.org/D7055
# 4b7b743c	25-Aug-2016	Lawrence Stewart <lstewart@FreeBSD.org>	Pass the number of segments coalesced by LRO up the stack by repurposing the tso_segsz pkthdr field during RX processing, and use the information in TCP for more correct accounting and as a congestion control input. This is only a start, and an audit of other uses for the data is left as future work. Reviewed by: gallatin, rrs Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D7564
# 587d67c0	16-Aug-2016	Randall Stewart <rrs@FreeBSD.org>	Here we update the modular tcp to be able to switch to an alternate TCP stack in other then the closed state (pre-listen/connect). The idea is that if that is supported by the alternate stack, it is asked if its ok to switch. If it approves the "handoff" then we allow the switch to happen. Also the fini() function now gets a flag to tell if you are switching away or the tcb is destroyed. The init() call into the alternate stack is moved to the end so the tcb is more fully formed before the init transpires. Sponsored by: Netflix Inc. Differential Revision: D6790
# 3f58662d	01-Jun-2016	Bjoern A. Zeeb <bz@FreeBSD.org>	The pr_destroy field does not allow us to run the teardown code in a specific order. VNET_SYSUNINITs however are doing exactly that. Thus remove the VIMAGE conditional field from the domain(9) protosw structure and replace it with VNET_SYSUNINITs. This also allows us to change some order and to make the teardown functions file local static. Also convert divert(4) as it uses the same mechanism ip(4) and ip6(4) use internally. Slightly reshuffle the SI_SUB_* fields in kernel.h and add a new ones, e.g., for pfil consumers (firewalls), partially for this commit and for others to come. Reviewed by: gnn, tuexen (sctp), jhb (kernel.h) Obtained from: projects/vnet MFC after: 2 weeks X-MFC: do not remove pr_destroy Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D6652
# f59d975e	17-May-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Tiny refactor of r294869/r296881: use defines to mask the VNET() macro. Suggested by: bz
# 5105a92c	17-May-2016	Randall Stewart <rrs@FreeBSD.org>	This small change adopts the excellent suggestion for using named structures in the add of a new tcp-stack that came in late to me via email after the last commit. It also makes it so that a new stack may optionally get a callback during a retransmit timeout. This allows the new stack to clear specific state (think sack scoreboards or other such structures). Sponsored by: Netflix Inc. Differential Revision: http://reviews.freebsd.org/D6303
# e5ad6456	28-Apr-2016	Randall Stewart <rrs@FreeBSD.org>	This cleans up the timers code in TCP to start using the new async_drain functionality. This as been tested in NF as well as by Verisign. Still to do in here is to remove all the old flags. They are currently left being maintained but probably are no longer needed. Sponsored by: Netflix Inc. Differential Revision: http://reviews.freebsd.org/D5924
# 5d20f974	22-Mar-2016	Jonathan T. Looney <jtl@FreeBSD.org>	to_flags is currently a 64-bit integer; however, we only use 7 bits. Furthermore, there is no reason this needs to be a 64-bit integer for the forseeable future. Also, there is an inconsistency between to_flags and the mask in tcp_addoptions(). Before r195654, to_flags was a u_long and the mask in tcp_addoptions() was a u_int. r195654 changed to_flags to be a u_int64_t but left the mask in tcp_addoptions() as a u_int, meaning that these variables will only be the same width on platforms with 64-bit integers. Convert both to_flags and the mask in tcp_addoptions() to be explicitly 32-bit variables. This may save a few cycles on 32-bit platforms, and avoids unnecessarily mixing types. Differential Revision: https://reviews.freebsd.org/D5584 Reviewed by: hiren MFC after: 2 weeks Sponsored by: Juniper Networks
# bf840a17	14-Mar-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Redo r294869. The array of counters for TCP states doesn't belong to struct tcpstat, because the structure can be zeroed out by netstat(1) -z, and of course running connection counts shouldn't be touched. Place running connection counts into separate array, and provide separate read-only sysctl oid for it.
# 2f06d2ab	14-Mar-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Comment fix: statistics are not read-only.
# 57a78e3b	26-Jan-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Augment struct tcpstat with tcps_states[], which is used for book-keeping the amount of TCP connections by state. Provides a cheap way to get connection count without traversing the whole pcb list. Sponsored by: Netflix
# d17d4c6b	26-Jan-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Provide TCPSTAT_DEC() and TCPSTAT_FETCH() macros.
# 1f12da0e	22-Jan-2016	Bjoern A. Zeeb <bz@FreeBSD.org>	Just checkpoint the WIP in order to be able to make the tree update easier. Note: this is currently not in a usable state as certain teardown parts are not called and the DOMAIN rework is missing. More to come soon and find its way to head. Obtained from: P4 //depot/user/bz/vimage/... Sponsored by: The FreeBSD Foundation
# 0c39d38d	06-Jan-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Historically we have two fields in tcpcb to describe sender MSS: t_maxopd, and t_maxseg. This dualism emerged with T/TCP, but was not properly cleaned up after T/TCP removal. After all permutations over the years the result is that t_maxopd stores a minimum of peer offered MSS and MTU reduced by minimum protocol header. And t_maxseg stores (t_maxopd - TCPOLEN_TSTAMP_APPA) if timestamps are in action, or is equal to t_maxopd otherwise. That's a very rough estimate of MSS reduced by options length. Throughout the code it was used in places, where preciseness was not important, like cwnd or ssthresh calculations. With this change: - t_maxopd goes away. - t_maxseg now stores MSS not adjusted by options. - new function tcp_maxseg() is provided, that calculates MSS reduced by options length. The functions gives a better estimate, since it takes into account SACK state as well. Reviewed by: jtl Differential Revision: https://reviews.freebsd.org/D3593
# 281a0fd4	24-Dec-2015	Patrick Kelsey <pkelsey@FreeBSD.org>	Implementation of server-side TCP Fast Open (TFO) [RFC7413]. TFO is disabled by default in the kernel build. See the top comment in sys/netinet/tcp_fastopen.c for implementation particulars. Reviewed by: gnn, jch, stas MFC after: 3 days Sponsored by: Verisign, Inc. Differential Revision: https://reviews.freebsd.org/D4350
# 55bceb1e	15-Dec-2015	Randall Stewart <rrs@FreeBSD.org>	First cut of the modularization of our TCP stack. Still to do is to clean up the timer handling using the async-drain. Other optimizations may be coming to go with this. Whats here will allow differnet tcp implementations (one included). Reviewed by: jtl, hiren, transports Sponsored by: Netflix Inc. Differential Revision: D4055
# 4d163382	10-Dec-2015	Hiren Panchasara <hiren@FreeBSD.org>	Clean up unused bandwidth entry in the TCP hostcache. Submitted by: Jason Wolfe (j at nitrology dot com) Reviewed by: rrs, hiren Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D4154
# b87170f2	09-Dec-2015	Hiren Panchasara <hiren@FreeBSD.org>	r290122 added 4 bytes and removed 8 in struct sackhint. Add a pad entry of 4 bytes to restore the size. Spotted by: rrs Reviewed by: rrs X-MFC with: r290122 Sponsored by: Limelight Networks
# 021eaf79	08-Dec-2015	Hiren Panchasara <hiren@FreeBSD.org>	One of the ways to detect loss is to count duplicate acks coming back from the other end till it reaches predetermined threshold which is 3 for us right now. Once that happens, we trigger fast-retransmit to do loss recovery. Main problem with the current implementation is that we don't honor SACK information well to detect whether an incoming ack is a dupack or not. RFC6675 has latest recommendations for that. According to it, dupack is a segment that arrives carrying a SACK block that identifies previously unknown information between snd_una and snd_max even if it carries new data, changes the advertised window, or moves the cumulative acknowledgment point. With the prevalence of Selective ACK (SACK) these days, improper handling can lead to delayed loss recovery. With the fix, new behavior looks like following: 0) th_ack < snd_una --> ignore Old acks are ignored. 1) th_ack == snd_una, !sack_changed --> ignore Acks with SACK enabled but without any new SACK info in them are ignored. 2) th_ack == snd_una, window == old_window --> increment Increment on a good dupack. 3) th_ack == snd_una, window != old_window, sack_changed --> increment When SACK enabled, it's okay to have advertized window changed if the ack has new SACK info. 4) th_ack > snd_una --> reset to 0 Reset to 0 when left edge moves. 5) th_ack > snd_una, sack_changed --> increment Increment if left edge moves but there is new SACK info. Here, sack_changed is the indicator that incoming ack has previously unknown SACK info in it. Note: This fix is not fully compliant to RFC6675. That may require a few changes to current implementation in order to keep per-sackhole dupack counter and change to the way we mark/handle sack holes. PR: 203663 Reviewed by: jtl MFC after: 3 weeks Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D4225
# 12eeb81f	28-Oct-2015	Hiren Panchasara <hiren@FreeBSD.org>	Calculate the correct amount of bytes that are in-flight for a connection as suggested by RFC 6675. Currently differnt places in the stack tries to guess this in suboptimal ways. The main problem is that current calculations don't take sacked bytes into account. Sacked bytes are the bytes receiver acked via SACK option. This is suboptimal because it assumes that network has more outstanding (unacked) bytes than the actual value and thus sends less data by setting congestion window lower than what's possible which in turn may cause slower recovery from losses. As an example, one of the current calculations looks something like this: snd_nxt - snd_fack + sackhint.sack_bytes_rexmit New proposal from RFC 6675 is: snd_max - snd_una - sackhint.sacked_bytes + sackhint.sack_bytes_rexmit which takes sacked bytes into account which is a new addition to the sackhint struct. Only thing we are missing from RFC 6675 is isLost() i.e. segment being considered lost and thus adjusting pipe based on that which makes this calculation a bit on conservative side. The approach is very simple. We already process each ack with sack info in tcp_sack_doack() and extract sack blocks/holes out of it. We'd now also track this new variable sacked_bytes which keeps track of total sacked bytes reported. One downside to this approach is that we may get incorrect count of sacked_bytes if the other end decides to drop sack info in the ack because of memory pressure or some other reasons. But in this (not very likely) case also the pipe calculation would be conservative which is okay as opposed to being aggressive in sending packets into the network. Next step is to use this more accurate pipe estimation to drive congestion window adjustments. In collaboration with: rrs Reviewed by: jason_eggnet dot com, rrs MFC after: 2 weeks Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D3971
# 356c7958	27-Oct-2015	Hiren Panchasara <hiren@FreeBSD.org>	Add sysctl tunable net.inet.tcp.initcwnd_segments to specify initial congestion window in number of segments on fly. It is set to 10 segments by default. Remove net.inet.tcp.experimental.initcwnd10 which is now redundant. Also remove the parent node net.inet.tcp.experimental as it's not needed anymore and also because it was not well thought out. Differential Revision: https://reviews.freebsd.org/D3858 In collaboration with: lstewart Reviewed by: gnn (prev version), rwatson, allanjude, wblock (man page) MFC after: 2 weeks Relnotes: yes Sponsored by: Limelight Networks
# 86a996e6	13-Oct-2015	Hiren Panchasara <hiren@FreeBSD.org>	There are times when it would be really nice to have a record of the last few packets and/or state transitions from each TCP socket. That would help with narrowing down certain problems we see in the field that are hard to reproduce without understanding the history of how we got into a certain state. This change provides just that. It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is destroyed, the list is freed. I thought this was likely to be more performance-friendly than saving copies of the tcpcb. Plus, with the packets, you should be able to reverse-engineer what happened to the tcpcb. To enable the feature, you will need to compile a kernel with the TCPPCAP option. Even then, the feature defaults to being deactivated. You can activate it by setting a positive value for the number of captured packets. You can do that on either a global basis or on a per-socket basis (via a setsockopt call). There is no way to get the packets out of the kernel other than using kmem or getting a coredump. I thought that would help some of the legal/privacy concerns regarding such a feature. However, it should be possible to add a future effort to export them in PCAP format. I tested this at low scale, and found that there were no mbuf leaks and the peak mbuf usage appeared to be unchanged with and without the feature. The main performance concern I can envision is the number of mbufs that would be used on systems with a large number of sockets. If you save five packets per direction per socket and have 3,000 sockets, that will consume at least 30,000 mbufs just to keep these packets. I tried to reduce the concerns associated with this by limiting the number of clusters (not mbufs) that could be used for this feature. Again, in my testing, that appears to work correctly. Differential Revision: D3100 Submitted by: Jonathan Looney <jlooney at juniper dot net> Reviewed by: gnn, hiren
# 1558cb24	26-Sep-2015	Alexander V. Chernikov <melifaro@FreeBSD.org>	Eliminate nd6_nud_hint() and its TCP bindings. Initially function was introduced in r53541 (KAME initial commit) to "provide hints from upper layer protocols that indicate a connection is making "forward progress"" (quote from RFC 2461 7.3.1 Reachability Confirmation). However, it was converted to do nothing (e.g. just return) in r122922 (tcp_hostcache implementation) back in 2003. Some defines were moved to tcp_var.h in r169541. Then, it was broken (for non-corner cases) by r186119 (L2<>L3 split) in 2008 (NULL ifp in nd6_lookup). So, right now this code is broken and has no "real" base users. Differential Revision: https://reviews.freebsd.org/D3699
# 24067db8	03-Sep-2015	Gleb Smirnoff <glebius@FreeBSD.org>	Make tcp_mtudisc() static and void. No functional changes. Sponsored by: Nginx, Inc.
# 03041aaa	30-Jul-2015	Hiren Panchasara <hiren@FreeBSD.org>	Update snd_una description to make it more readable. Differential Revision: https://reviews.freebsd.org/D3179 Reviewed by: gnn Sponsored by: Limelight Networks
# 4741bfcb	29-Jul-2015	Patrick Kelsey <pkelsey@FreeBSD.org>	Revert r265338, r271089 and r271123 as those changes do not handle non-inline urgent data and introduce an mbuf exhaustion attack vector similar to FreeBSD-SA-15:15.tcp, but not requiring VNETs. Address the issue described in FreeBSD-SA-15:15.tcp. Reviewed by: glebius Approved by: so Approved by: jmallett (mentor) Security: FreeBSD-SA-15:15.tcp Sponsored by: Norse Corp, Inc.
# 5571f9cf	16-Apr-2015	Julien Charbon <jch@FreeBSD.org>	Fix an old and well-documented use-after-free race condition in TCP timers: - Add a reference from tcpcb to its inpcb - Defer tcpcb deletion until TCP timers have finished Differential Revision: https://reviews.freebsd.org/D2079 Submitted by: jch, Marc De La Gueronniere <mdelagueronniere@verisign.com> Reviewed by: imp, rrs, adrian, jhb, bz Approved by: jhb Sponsored by: Verisign, Inc.
# 71da7153	17-Nov-2014	Julien Charbon <jch@FreeBSD.org>	Re-introduce padding fields removed with r264321 to keep struct tcptw ABI unchanged. Suggested by: jhb Approved by: jhb (mentor) MFC after: 1 day X-MFC-With: r264321
# 4952ad42	03-Nov-2014	Hans Petter Selasky <hselasky@FreeBSD.org>	Restore spares used in "struct tcpcb" and bump "__FreeBSD_version" to indicate need for kernel module re-compilation. Sponsored by: Mellanox Technologies
# cea40c48	30-Oct-2014	Julien Charbon <jch@FreeBSD.org>	Fix a race condition in TCP timewait between tcp_tw_2msl_reuse() and tcp_tw_2msl_scan(). This race condition drives unplanned timewait timeout cancellation. Also simplify implementation by holding inpcb reference and removing tcptw reference counting. Differential Revision: https://reviews.freebsd.org/D826 Submitted by: Marc De la Gueronniere <mdelagueronniere@verisign.com> Submitted by: jch Reviewed By: jhb (mentor), adrian, rwatson Sponsored by: Verisign, Inc. MFC after: 2 weeks X-MFC-With: r264321
# f6f6703f	07-Oct-2014	Sean Bruno <sbruno@FreeBSD.org>	Implement PLPMTUD blackhole detection (RFC 4821), inspired by code from xnu sources. If we encounter a network where ICMP is blocked the Needs Frag indicator may not propagate back to us. Attempt to downshift the mss once to a preconfigured value. Default this feature to off for now while we do not have a full PLPMTUD implementation in our stack. Adds the following new sysctl's for control: net.inet.tcp.pmtud_blackhole_detection -- turns on/off this feature net.inet.tcp.pmtud_blackhole_mss -- mss to try for ipv4 net.inet.tcp.v6pmtud_blackhole_mss -- mss to try for ipv6 Adds the following new sysctl's for monitoring: -- Number of times the code was activated to attempt a mss downshift net.inet.tcp.pmtud_blackhole_activated -- Number of times the blackhole mss was used in an attempt to downshift net.inet.tcp.pmtud_blackhole_min_activated -- Number of times that we failed to connect after we downshifted the mss net.inet.tcp.pmtud_blackhole_failed Phabricator: https://reviews.freebsd.org/D506 Reviewed by: rpaulo bz MFC after: 2 weeks Relnotes: yes Sponsored by: Limelight Networks
# 29c47f18	27-Sep-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	* Split tcp_signature_compute() into 2 pieces: - tcp_get_sav() - SADB key lookup - tcp_signature_do_compute() - actual computation * Fix TCP signature case for listening socket: do not assume EVERY connection coming to socket with TCP_SIGNATURE set to be md5 signed regardless of SADB key existance for particular address. This fixes the case for routing software having _some_ BGP sessions secured by md5. * Simplify TCP_SIGNATURE handling in tcp_input() MFC after: 2 weeks
# 9fd573c3	22-Sep-2014	Hans Petter Selasky <hselasky@FreeBSD.org>	Improve transmit sending offload, TSO, algorithm in general. The current TSO limitation feature only takes the total number of bytes in an mbuf chain into account and does not limit by the number of mbufs in a chain. Some kinds of hardware is limited by two factors. One is the fragment length and the second is the fragment count. Both of these limits need to be taken into account when doing TSO. Else some kinds of hardware might have to drop completely valid mbuf chains because they cannot loaded into the given hardware's DMA engine. The new way of doing TSO limitation has been made backwards compatible as input from other FreeBSD developers and will use defaults for values not set. Reviewed by: adrian, rmacklem Sponsored by: Mellanox Technologies MFC after: 1 week
# 8f5a8818	07-Aug-2014	Kevin Lo <kevlo@FreeBSD.org>	Merge 'struct ip6protosw' and 'struct protosw' into one. Now we have only one protocol switch structure that is shared between ipv4 and ipv6. Phabric: D476 Reviewed by: jhb
# 4fd2b4eb	24-May-2014	Bjoern A. Zeeb <bz@FreeBSD.org>	Make tcp_twrespond() file local private; this removes it from the public KPI; it is not used anywhere else and seems it never was. MFC after: 2 weeks
# 255cd9fd	23-May-2014	Bjoern A. Zeeb <bz@FreeBSD.org>	Move the tcp_fields_to_host() and tcp_fields_to_net() (inline) functions to the tcp_var.h header file in order to avoid further duplication with upcoming commits. Reviewed by: np MFC after: 2 weeks
# b1a41566	16-May-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Provide compatibility #define after r265408. Suggested by: truckman
# f1395664	15-May-2014	Mike Silbersack <silby@FreeBSD.org>	Remove the function tcp_twrecycleable; it has been #if 0'd for eight years. The original concept was to improve the corner case where you run out of ephemeral ports, but it was causing performance problems and the mechanism of limiting the number of time_wait sockets serves the same purpose in the end. Reviewed by: bz
# c669105d	05-May-2014	Gleb Smirnoff <glebius@FreeBSD.org>	- Remove net.inet.tcp.reass.overflows sysctl. It counts exactly same events that tcpstat's tcps_rcvmemdrop counter counts. - Rename tcps_rcvmemdrop to tcps_rcvreassfull and improve its description in netstat(1) output. Sponsored by: Netflix Sponsored by: Nginx, Inc.
# e407b67b	04-May-2014	Gleb Smirnoff <glebius@FreeBSD.org>	The FreeBSD-SA-14:08.tcp was a lesson on not doing acrobatics with mixing on stack memory and UMA memory in one linked list. Thus, rewrite TCP reassembly code in terms of memory usage. The algorithm remains unchanged. We actually do not need extra memory to build a reassembly queue. Arriving mbufs are always packet header mbufs. So we got the length of data as pkthdr.len. We got m_nextpkt for linkage. And we need only one pointer to point at the tcphdr, use PH_loc for that. In tcpcb the t_segq fields becomes mbuf pointer. The t_segqlen field now counts not packets, but bytes in the queue. This gives us more precision when comparing to socket buffer limits. Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 66eefb1e	10-Apr-2014	John Baldwin <jhb@FreeBSD.org>	Currently, the TCP slow timer can starve TCP input processing while it walks the list of connections in TIME_WAIT closing expired connections due to contention on the global TCP pcbinfo lock. To remediate, introduce a new global lock to protect the list of connections in TIME_WAIT. Only acquire the TCP pcbinfo lock when closing an expired connection. This limits the window of time when TCP input processing is stopped to the amount of time needed to close a single connection. Submitted by: Julien Charbon <jcharbon@verisign.com> Reviewed by: rwatson, rrs, adrian MFC after: 2 months
# 5b26ea5d	25-Feb-2014	John Baldwin <jhb@FreeBSD.org>	Remove more constants related to static sysctl nodes. The MAXID constants were primarily used to size the sysctl name list macros that were removed in r254295. A few other constants either did not have an associated sysctl node, or the associated node used OID_AUTO instead. PR: ports/184525 (exp-run)
# a5f44cd7	21-Sep-2013	Bjoern A. Zeeb <bz@FreeBSD.org>	Introduce spares in the TCP syncache and timewait structures so that fixed TCP_SIGNATURE handling can later be merged. This is derived from follow-up work to SVN r183001 posted to net@ on Sep 13 2008. Approved by: re (gjb)
# fd77bbb9	26-Aug-2013	John Baldwin <jhb@FreeBSD.org>	Remove most of the remaining sysctl name list macros. They were only ever intended for use in sysctl(8) and it has not used them for many years. Reviewed by: bde Tested by: exp-run by bdrewery
# 57f60867	25-Aug-2013	Mark Johnston <markj@FreeBSD.org>	Implement the ip, tcp, and udp DTrace providers. The probe definitions use dynamic translation so that their arguments match the definitions for these providers in Solaris and illumos. Thus, existing scripts for these providers should work unmodified on FreeBSD. Tested by: gnn, hiren MFC after: 1 month
# 5da0521f	09-Jul-2013	Andrey V. Elsukov <ae@FreeBSD.org>	Use new macros to implement ipstat and tcpstat using PCPU counters. Change interface of kread_counters() similar ot kread() in the netstat(1).
# 3c914c54	02-Jun-2013	Andre Oppermann <andre@FreeBSD.org>	Allow drivers to specify a maximum TSO length in bytes if they are limited in the amount of data they can handle at once. Drivers can set ifp->if_hw_tsomax before calling ether_ifattach() to change the limit. The lowest allowable size is IP_MAXPACKET / 8 (8192 bytes) as anything less wouldn't be very useful anymore. The upper limit is still at IP_MAXPACKET (65536 bytes). Raising it requires further auditing of the IPv4/v6 code path's as the length field in the IP header would overflow leading to confusion in firewalls and others packet handler on the real size of the packet. The placement into "struct ifnet" is a bit hackish but the best place that was found. When the stack/driver boundary is updated it should be handled in a better way. Submitted by: cperciva (earlier version) Reviewed by: cperciva Tested by: cperciva MFC after: 1 week (using spare struct members to preserve ABI)
# 5923c293	08-Apr-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Merge from projects/counters: TCP/IP stats. Convert 'struct ipstat' and 'struct tcpstat' to counter(9). This speeds up IP forwarding at extreme packet rates, and makes accounting more precise. Sponsored by: Nginx, Inc.
# 09440655	28-Oct-2012	Andre Oppermann <andre@FreeBSD.org>	Increase the initial CWND to 10 segments as defined in IETF TCPM draft-ietf-tcpm-initcwnd-05. It explains why the increased initial window improves the overall performance of many web services without risking congestion collapse. As long as it remains a draft it is placed under a sysctl marking it as experimental: net.inet.tcp.experimental.initcwnd10 = 1 When it becomes an official RFC soon the sysctl will be changed to the RFC number and moved to net.inet.tcp. This implementation differs from the RFC draft in that it is a bit more conservative in the case of packet loss on SYN or SYN\|ACK because we haven't reduced the default RTO to 1 second yet. Also the restart window isn't yet increased as allowed. Both will be adjusted with upcoming changes. Is is enabled by default. In Linux it is enabled since kernel 3.0. MFC after: 2 weeks
# 09fe6320	19-Jun-2012	Navdeep Parhar <np@FreeBSD.org>	- Updated TOE support in the kernel. - Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs. These are available as t3_tom and t4_tom modules that augment cxgb(4) and cxgbe(4) respectively. The cxgb/cxgbe drivers continue to work as usual with or without these extra features. - iWARP driver for Terminator 3 ASIC (kernel verbs). T4 iWARP in the works and will follow soon. Build-tested with make universe. 30s overview ============ What interfaces support TCP offload? Look for TOE4 and/or TOE6 in the capabilities of an interface: # ifconfig -m \| grep TOE Enable/disable TCP offload on an interface (just like any other ifnet capability): # ifconfig cxgbe0 toe # ifconfig cxgbe0 -toe Which connections are offloaded? Look for toe4 and/or toe6 in the output of netstat and sockstat: # netstat -np tcp \| grep toe # sockstat -46c \| grep toe Reviewed by: bz, gnn Sponsored by: Chelsio communications. MFC after: ~3 months (after 9.1, and after ensuring MFC is feasible)
# ef341ee1	16-Apr-2012	Gleb Smirnoff <glebius@FreeBSD.org>	When we receive an ICMP unreach need fragmentation datagram, we take proposed MTU value from it and update the TCP host cache. Then tcp_mss_update() is called on the corresponding tcpcb. It finds the just allocated entry in the TCP host cache and updates MSS on the tcpcb. And then we do a fast retransmit of what we have in the tcp send buffer. This sequence gets broken if the TCP host cache is exausted. In this case allocation fails, and later called tcp_mss_update() finds nothing in cache. The fast retransmit is done with not reduced MSS and is immidiately replied by remote host with new ICMP datagrams and the cycle repeats. This ping-pong can go up to wirespeed. To fix this: - tcp_mss_update() gets new parameter - mtuoffer, that is like offer, but needs to have min_protoh subtracted. - tcp_mtudisc() as notification method renamed to tcp_mtudisc_notify(). - tcp_mtudisc() now accepts not a useless error argument, but proposed MTU value, that is passed to tcp_mss_update() as mtuoffer. Reported by: az Reported by: Andrey Zonov <andrey zonov.org> Reviewed by: andre (previous version of patch)
# 9077f387	05-Feb-2012	Gleb Smirnoff <glebius@FreeBSD.org>	Add new socket options: TCP_KEEPINIT, TCP_KEEPIDLE, TCP_KEEPINTVL and TCP_KEEPCNT, that allow to control initial timeout, idle time, idle re-send interval and idle send count on a per-socket basis. Reviewed by: andre, bz, lstewart
# 9ec4a4cc	16-Oct-2011	Andre Oppermann <andre@FreeBSD.org>	Remove the ss_fltsz and ss_fltsz_local sysctl's which have long been superseded by the RFC3390 initial CWND sizing. Also remove the remnants of TCP_METRICS_CWND which used the TCP hostcache to set the initial CWND in a non-RFC compliant way. MFC after: 1 week
# e233e2ac	16-Oct-2011	Andre Oppermann <andre@FreeBSD.org>	VNET virtualize tcp_sendspace/tcp_recvspace and change the type to INT. A long is not necessary as the TCP window is limited to 2**30. A larger initial window isn't useful. MFC after: 1 week
# d9a36286	17-Jul-2011	Bjoern A. Zeeb <bz@FreeBSD.org>	Add spares to the network stack for FreeBSD-9: - TCP keep* timers - TCP UTO (adjust from what was there already) - netmap - route caching - user cookie (temporary to allow for the real fix) Slightly re-shuffle struct ifnet moving fields out of the middle of spares and to better align. Discussed with: rwatson (slightly earlier version)
# 672dc4ae	29-Apr-2011	John Baldwin <jhb@FreeBSD.org>	TCP reuses t_rxtshift to determine the backoff timer used for both the persist state and the retransmit timer. However, the code that implements "bad retransmit recovery" only checks t_rxtshift to see if an ACK has been received in during the first retransmit timeout window. As a result, if ticks has wrapped over to a negative value and a socket is in the persist state, it can incorrectly treat an ACK from the remote peer as a "bad retransmit recovery" and restore saved values such as snd_ssthresh and snd_cwnd. However, if the socket has never had a retransmit timeout, then these saved values will be zero, so snd_ssthresh and snd_cwnd will be set to 0. If the socket is in fast recovery (this can be caused by excessive duplicate ACKs such as those fixed by 220794), then each ACK that arrives triggers either NewReno or SACK partial ACK handling which clamps snd_cwnd to be no larger than snd_ssthresh. In effect, the socket's send window is permamently stuck at 0 even though the remote peer is advertising a much larger window and pending data is only sent via TCP window probes (so one byte every few seconds). Fix this by adding a new TCP pcb flag (TF_PREVVALID) that indicates that the various snd_*_prev fields in the pcb are valid and only perform "bad retransmit recovery" if this flag is set in the pcb. The flag is set on the first retransmit timeout that occurs and is cleared on subsequent retransmit timeouts or when entering the persist state. Reviewed by: bz MFC after: 2 weeks
# 2903309a	25-Apr-2011	Attilio Rao <attilio@FreeBSD.org>	Add the possibility to verify MD5 hash of incoming TCP packets. As long as this is a costy function, even when compiled in (along with the option TCP_SIGNATURE), it can be disabled via the net.inet.tcp.signature_verify_input sysctl. Sponsored by: Sandvine Incorporated Reviewed by: emaste, bz MFC after: 2 weeks
# f1f5cc47	10-Jan-2011	Lawrence Stewart <lstewart@FreeBSD.org>	Fixe some whitespace nits that were introduced in r216758. Sponsored by: FreeBSD Foundation Submitted by: pjd MFC after: 10 weeks X-MFC with: r216758
# 79e955ed	07-Jan-2011	John Baldwin <jhb@FreeBSD.org>	Trim extra spaces before tabs.
# e29f3cc7	27-Dec-2010	Lawrence Stewart <lstewart@FreeBSD.org>	Add a comment for the ccv member of struct tcpcb. Sponsored by: FreeBSD Foundation MFC after: 5 weeks X-MFC with: r215166
# 39bc9de5	27-Dec-2010	Lawrence Stewart <lstewart@FreeBSD.org>	- Add some helper hook points to the TCP stack. The hooks allow Khelp modules to access inbound/outbound events and associated data for established TCP connections. The hooks only run if at least one hook function is registered for the hook point, ensuring the impact on the stack is effectively nil when no TCP Khelp modules are loaded. struct tcp_hhook_data is passed as contextual data to any registered Khelp module hook functions. - Add an OSD (Object Specific Data) pointer to struct tcpcb to allow Khelp modules to associate per-connection data with the TCP control block. - Bump __FreeBSD_version and add a note to UPDATING regarding to ABI changes introduced by this commit and r216753. In collaboration with: David Hayes <dahayes at swin edu au> and Grenville Armitage <garmitage at swin edu au> Sponsored by: FreeBSD Foundation Reviewed by: bz, others along the way MFC after: 3 months
# bee9ab2b	27-Dec-2010	Lawrence Stewart <lstewart@FreeBSD.org>	Add a new sack hint to track the most recent and highest sacked sequence number. This will be used by the incoming Enhanced RTT Khelp module. Sponsored by: FreeBSD Foundation Submitted by: David Hayes <dahayes at swin edu au> Reviewed by: bz and others (as part of a larger patch) MFC after: 3 months
# f5d34df5	17-Nov-2010	George V. Neville-Neil <gnn@FreeBSD.org>	Add new, per connection, statistics for TCP, including: Retransmitted Packets Zero Window Advertisements Out of Order Receives These statistics are available via the -T argument to netstat(1). MFC after: 2 weeks
# 99065ae6	16-Nov-2010	Lawrence Stewart <lstewart@FreeBSD.org>	Move protocol specific implementation detail out of the core CC framework. Sponsored by: FreeBSD Foundation Tested by: Mikolaj Golub <to.my.trociny at gmail com> MFC after: 11 weeks X-MFC with: r215166
# dbc42409	11-Nov-2010	Lawrence Stewart <lstewart@FreeBSD.org>	This commit marks the first formal contribution of the "Five New TCP Congestion Control Algorithms for FreeBSD" FreeBSD Foundation funded project. More details about the project are available at: http://caia.swin.edu.au/freebsd/5cc/ - Add a KPI and supporting infrastructure to allow modular congestion control algorithms to be used in the net stack. Algorithms can maintain per-connection state if required, and connections maintain their own algorithm pointer, which allows different connections to concurrently use different algorithms. The TCP_CONGESTION socket option can be used with getsockopt()/setsockopt() to programmatically query or change the congestion control algorithm respectively from within an application at runtime. - Integrate the framework with the TCP stack in as least intrusive a manner as possible. Care was also taken to develop the framework in a way that should allow integration with other congestion aware transport protocols (e.g. SCTP) in the future. The hope is that we will one day be able to share a single set of congestion control algorithm modules between all congestion aware transport protocols. - Introduce a new congestion recovery (TF_CONGRECOVERY) state into the TCP stack and use it to decouple the meaning of recovery from a congestion event and recovery from packet loss (TF_FASTRECOVERY) a la RFC2581. ECN and delay based congestion control protocols don't generally need to recover from packet loss and need a different way to note a congestion recovery episode within the stack. - Remove the net.inet.tcp.newreno sysctl, which simplifies some portions of code and ensures the stack always uses the appropriate mechanisms for recovering from packet loss during a congestion recovery episode. - Extract the NewReno congestion control algorithm from the TCP stack and massage it into module form. NewReno is always built into the kernel and will remain the default algorithm for the forseeable future. Implementations of additional different algorithms will become available in the near future. - Bump __FreeBSD_version to 900025 and note in UPDATING that rebuilding code that relies on the size of "struct tcpcb" is required. Many thanks go to the Cisco University Research Program Fund at Community Foundation Silicon Valley and the FreeBSD Foundation. Their support of our work at the Centre for Advanced Internet Architectures, Swinburne University of Technology is greatly appreciated. In collaboration with: David Hayes <dahayes at swin edu au> and Grenville Armitage <garmitage at swin edu au> Sponsored by: Cisco URP, FreeBSD Foundation Reviewed by: rpaulo Tested by: David Hayes (and many others over the years) MFC after: 3 months
# a7d5f7eb	19-Oct-2010	Jamie Gritton <jamie@FreeBSD.org>	A new jail(8) with a configuration file, to replace the work currently done by /etc/rc.d/jail.
# 0c236c4e	24-Sep-2010	Lawrence Stewart <lstewart@FreeBSD.org>	Internalise reassembly queue related functionality and variables which should not be used outside of the reassembly queue implementation. Provide a new function to flush all segments from a reassembly queue and call it from the appropriate places instead of manipulating the queue directly. Sponsored by: FreeBSD Foundation Reviewed by: andre, gnn, rpaulo MFC after: 2 weeks
# 1c18314d	16-Sep-2010	Andre Oppermann <andre@FreeBSD.org>	Remove the TCP inflight bandwidth limiter as announced in r211315 to give way for the pluggable congestion control framework. It is the task of the congestion control algorithm to set the congestion window and amount of inflight data without external interference. In 'struct tcpcb' the variables previously used by the inflight limiter are renamed to spares to keep the ABI intact and to have some more space for future extensions. In 'struct tcp_info' the variable 'tcpi_snd_bwnd' is not removed to preserve the ABI. It is always set to 0. In siftr.c in 'struct pkt_node' the variable 'snd_bwnd' is not removed to preserve the ABI. It is always set to 0. These unused variable in the various structures may be reused in the future or garbage collected before the next release or at some other point when an ABI change happens anyway for other reasons. No MFC is planned. The inflight bandwidth limiter stays disabled by default in the other branches but remains available.
# c3f0bdc6	18-Aug-2010	Andre Oppermann <andre@FreeBSD.org>	If a TCP connection has been idle for one retransmit timeout or more it must reset its congestion window back to the initial window. RFC3390 has increased the initial window from 1 segment to up to 4 segments. The initial window increase of RFC3390 wasn't reflected into the restart window which remained at its original defaults of 4 segments for local and 1 segment for all other connections. Both values are controllable through sysctl net.inet.tcp.local_slowstart_flightsize and net.inet.tcp.slowstart_flightsize. The increase helps TCP's slow start algorithm to open up the congestion window much faster. Reviewed by: lstewart MFC after: 1 week
# b7d747ec	18-Aug-2010	Andre Oppermann <andre@FreeBSD.org>	Untangle the net.inet.tcp.log_in_vain and net.inet.tcp.log_debug sysctl's and remove any side effects. Both sysctl's share the same backend infrastructure and due to the way it was implemented enabling net.inet.tcp.log_in_vain would also cause log_debug output to be generated. This was surprising and eventually annoying to the user. The log output backend is kept the same but a little shim is inserted to properly separate log_in_vain and log_debug and to remove any side effects. PR: kern/137317 MFC after: 1 week
# 480d7c6c	06-May-2010	Bjoern A. Zeeb <bz@FreeBSD.org>	MFC r207369: MFP4: @176978-176982, 176984, 176990-176994, 177441 "Whitspace" churn after the VIMAGE/VNET whirls. Remove the need for some "init" functions within the network stack, like pim6_init(), icmp_init() or significantly shorten others like ip6_init() and nd6_init(), using static initialization again where possible and formerly missed. Move (most) variables back to the place they used to be before the container structs and VIMAGE_GLOABLS (before r185088) and try to reduce the diff to stable/7 and earlier as good as possible, to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9. This also removes some header file pollution for putatively static global variables. Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are no longer needed. Reviewed by: jhb Discussed with: rwatson Sponsored by: The FreeBSD Foundation Sponsored by: CK Software GmbH
# 82cea7e6	29-Apr-2010	Bjoern A. Zeeb <bz@FreeBSD.org>	MFP4: @176978-176982, 176984, 176990-176994, 177441 "Whitspace" churn after the VIMAGE/VNET whirls. Remove the need for some "init" functions within the network stack, like pim6_init(), icmp_init() or significantly shorten others like ip6_init() and nd6_init(), using static initialization again where possible and formerly missed. Move (most) variables back to the place they used to be before the container structs and VIMAGE_GLOABLS (before r185088) and try to reduce the diff to stable/7 and earlier as good as possible, to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9. This also removes some header file pollution for putatively static global variables. Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are no longer needed. Reviewed by: jhb Discussed with: rwatson Sponsored by: The FreeBSD Foundation Sponsored by: CK Software GmbH MFC after: 6 days
# 62f500d0	27-Mar-2010	Bjoern A. Zeeb <bz@FreeBSD.org>	MFC r204838: Destroy TCP UMA zones (empty or not) upon network stack teardown to not leak them, otherwise making UMA/vmstat unhappy with every stoped vnet. We will still leak pages (especially for zones marked NOFREE). Reshuffle cleanup order in tcp_destroy() to get rid of what we can easily free first. Reviewed by: rwatson
# 376aadf8	07-Mar-2010	Bjoern A. Zeeb <bz@FreeBSD.org>	Destroy TCP UMA zones (empty or not) upon network stack teardown to not leak them, otherwise making UMA/vmstat unhappy with every stoped vnet. We will still leak pages (especially for zones marked NOFREE). Reshuffle cleanup order in tcp_destroy() to get rid of what we can easily free first. Sponsored by: ISPsystem Reviewed by: rwatson MFC after: 5 days
# be6797dd	23-Jan-2010	Bjoern A. Zeeb <bz@FreeBSD.org>	MFC r202469: Garbage collect references to the no longer implemented tcp_fasttimo().
# 4dcc55a3	17-Jan-2010	Bjoern A. Zeeb <bz@FreeBSD.org>	Garbage collect references to the no longer implemented tcp_fasttimo(). Discussed with: rwatson MFC after: 5 days
# b8614722	15-Sep-2009	Mike Silbersack <silby@FreeBSD.org>	Add the ability to see TCP timers via netstat -x. This can be a useful feature when you have a seemingly stuck socket and want to figure out why it has not been closed yet. No plans to MFC this, as it changes the netstat sysctl ABI. Reviewed by: andre, rwatson, Eric Van Gyzen
# 315e3e38	02-Aug-2009	Robert Watson <rwatson@FreeBSD.org>	Many network stack subsystems use a single global data structure to hold all pertinent statatistics for the subsystem. These structures are sometimes "borrowed" by kernel modules that require a place to store statistics for similar events. Add KPI accessor functions for statistics structures referenced by kernel modules so that they no longer encode certain specifics of how the data structures are named and stored. This change is intended to make it easier to move to per-CPU network stats following 8.0-RELEASE. The following modules are affected by this change: if_bridge if_cxgb if_gif ip_mroute ipdivert pf In practice, most of these statistics consumers should, in fact, maintain their own statistics data structures rather than borrowing structures from the base network stack. However, that change is too agressive for this point in the release cycle. Reviewed by: bz Approved by: re (kib)
# 1e77c105	16-Jul-2009	Robert Watson <rwatson@FreeBSD.org>	Remove unused VNET_SET() and related macros; only VNET_GET() is ever actually used. Rename VNET_GET() to VNET() to shorten variable references. Discussed with: bz, julian Reviewed by: bz Approved by: re (kensmith, kib)
# eddfbb76	14-Jul-2009	Robert Watson <rwatson@FreeBSD.org>	Build on Jeff Roberson's linker-set based dynamic per-CPU allocator (DPCPU), as suggested by Peter Wemm, and implement a new per-virtual network stack memory allocator. Modify vnet to use the allocator instead of monolithic global container structures (vinet, ...). This change solves many binary compatibility problems associated with VIMAGE, and restores ELF symbols for virtualized global variables. Each virtualized global variable exists as a "reference copy", and also once per virtual network stack. Virtualized global variables are tagged at compile-time, placing the in a special linker set, which is loaded into a contiguous region of kernel memory. Virtualized global variables in the base kernel are linked as normal, but those in modules are copied and relocated to a reserved portion of the kernel's vnet region with the help of a the kernel linker. Virtualized global variables exist in per-vnet memory set up when the network stack instance is created, and are initialized statically from the reference copy. Run-time access occurs via an accessor macro, which converts from the current vnet and requested symbol to a per-vnet address. When "options VIMAGE" is not compiled into the kernel, normal global ELF symbols will be used instead and indirection is avoided. This change restores static initialization for network stack global variables, restores support for non-global symbols and types, eliminates the need for many subsystem constructors, eliminates large per-subsystem structures that caused many binary compatibility issues both for monitoring applications (netstat) and kernel modules, removes the per-function INIT_VNET_*() macros throughout the stack, eliminates the need for vnet_symmap ksym(2) munging, and eliminates duplicate definitions of virtualized globals under VIMAGE_GLOBALS. Bump __FreeBSD_version and update UPDATING. Portions submitted by: bz Reviewed by: bz, zec Discussed with: gnn, jamie, jeff, jhb, julian, sam Suggested by: peter Approved by: re (kensmith)
# 237fbe0a	13-Jul-2009	Lawrence Stewart <lstewart@FreeBSD.org>	Replace struct tcpopt with a proxy toeopt struct in the TOE driver interface to the TCP syncache. This returns struct tcpopt to being private within the TCP implementation, thus allowing it to be modified without ABI concerns. The patch breaks the ABI. Bump __FreeBSD_version to 800103 accordingly. The cxgb driver is the only TOE consumer affected by this change, and needs to be recompiled along with the kernel. Suggested by: rwatson Reviewed by: rwatson, kmacy Approved by: re (kensmith), kensmith (mentor temporarily unavailable)
# 962ebef8	12-Jul-2009	Lawrence Stewart <lstewart@FreeBSD.org>	Pad the following TCP related structs to allow MFCs of upcoming features/fixes back to the 8 branch: tcp_var.h - struct sackhint - struct tcpcb - struct tcpstat The patch breaks the ABI. Bump __FreeBSD_version to 800102 accordingly. User space tools that rely on the size of any of these structs (e.g. sockstat) need to be recompiled. Reviewed by: rpaulo, sam, andre, rwatson Approved by: re & mentor (gnn)
# 9f78a87a	16-Jun-2009	John Baldwin <jhb@FreeBSD.org>	- Change members of tcpcb that cache values of ticks from int to u_int: t_rcvtime, t_starttime, t_rtttime, t_bw_rtttime, ts_recent_age, t_badrxtwin. - Change t_recent in struct timewait from u_long to u_int32_t to match the type of the field it shadows from tcpcb: ts_recent. - Change t_starttime in struct timewait from u_long to u_int to match the t_starttime field in tcpcb. Requested by: bde (1, 3)
# 0e8cc7e7	10-Jun-2009	John Baldwin <jhb@FreeBSD.org>	Change a few members of tcpcb that store cached copies of ticks to be ints instead of unsigned longs. This fixes a few overflow edge cases on 64-bit platforms. Specifically, if an idle connection receives a packet shortly before 2^31 clock ticks of uptime (about 25 days with hz=1000) and the keep alive timer fires after 2^31 clock ticks, the keep alive timer will think that the connection has been idle for a very long time and will immediately drop the connection instead of sending a keep alive probe. Reviewed by: silby, gnn, lstewart MFC after: 1 week
# bc29160d	08-Jun-2009	Marko Zec <zec@FreeBSD.org>	Introduce an infrastructure for dismantling vnet instances. Vnet modules and protocol domains may now register destructor functions to clean up and release per-module state. The destructor mechanisms can be triggered by invoking "vimage -d", or a future equivalent command which will be provided via the new jail framework. While this patch introduces numerous placeholder destructor functions, many of those are currently incomplete, thus leaking memory or (even worse) failing to stop all running timers. Many of such issues are already known and will be incrementaly fixed over the next weeks in smaller incremental commits. Apart from introducing new fields in structs ifnet, domain, protosw and vnet_net, which requires the kernel and modules to be rebuilt, this change should have no impact on nooptions VIMAGE builds, since vnet destructors can only be called in VIMAGE kernels. Moreover, destructor functions should be in general compiled in only in options VIMAGE builds, except for kernel modules which can be safely kldunloaded at run time. Bump __FreeBSD_version to 800097. Reviewed by: bz, julian Approved by: rwatson, kib (re), julian (mentor)
# f6dfe47a	30-Apr-2009	Marko Zec <zec@FreeBSD.org>	Permit buiding kernels with options VIMAGE, restricted to only a single active network stack instance. Turning on options VIMAGE at compile time yields the following changes relative to default kernel build: 1) V_ accessor macros for virtualized variables resolve to structure fields via base pointers, instead of being resolved as fields in global structs or plain global variables. As an example, V_ifnet becomes: options VIMAGE: ((struct vnet_net ) vnet_net)->_ifnet default build: vnet_net_0._ifnet options VIMAGE_GLOBALS: ifnet 2) INIT_VNET_ macros will declare and set up base pointers to be used by V_ accessor macros, instead of resolving to whitespace: INIT_VNET_NET(ifp->if_vnet); becomes struct vnet_net vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET]; 3) Memory for vnet modules registered via vnet_mod_register() is now allocated at run time in sys/kern/kern_vimage.c, instead of per vnet module structs being declared as globals. If required, vnet modules can now request the framework to provide them with allocated bzeroed memory by filling in the vmi_size field in their vmi_modinfo structures. 4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are extended to hold a pointer to the parent vnet. options VIMAGE builds will fill in those fields as required. 5) curvnet is introduced as a new global variable in options VIMAGE builds, always pointing to the default and only struct vnet. 6) struct sysctl_oid has been extended with additional two fields to store major and minor virtualization module identifiers, oid_v_subs and oid_v_mod. SYSCTL_V_ family of macros will fill in those fields accordingly, and store the offset in the appropriate vnet container struct in oid_arg1. In sysctl handlers dealing with virtualized sysctls, the SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target variable and make it available in arg1 variable for further processing. Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have been deleted. Reviewed by: bz, rwatson Approved by: julian (mentor)
# de231a06	12-Apr-2009	Robert Watson <rwatson@FreeBSD.org>	Put TCPSTAT_ADD() and TCPSTAT_INC() behind _KERNEL. MFC after: 3 days
# 78b50714	11-Apr-2009	Robert Watson <rwatson@FreeBSD.org>	Update stats in struct tcpstat using two new macros, TCPSTAT_ADD() and TCPSTAT_INC(), rather than directly manipulating the fields across the kernel. This will make it easier to change the implementation of these statistics, such as using per-CPU versions of the data structures. MFC after: 3 days
# de4fbddd	22-Jan-2009	Bjoern A. Zeeb <bz@FreeBSD.org>	Add externs to fix build with VIMAGE_GLOBALS after r187289.
# 24cb0f22	14-Jan-2009	Lawrence Stewart <lstewart@FreeBSD.org>	Add TCP Appropriate Byte Counting (RFC 3465) support to kernel. The new behaviour is on by default, and can be disabled by setting the net.inet.tcp.rfc3465 sysctl to 0 to obtain previous behaviour. The patch changes struct tcpcb in sys/netinet/tcp_var.h which breaks the ABI. Bump __FreeBSD_version to 800061 accordingly. User space tools that rely on the size of struct tcpcb (e.g. sockstat) need to be recompiled. Reviewed by: rpaulo, gnn Approved by: gnn, kmacy (mentors) Sponsored by: FreeBSD Foundation
# 1b193af6	13-Dec-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Second round of putting global variables, which were virtualized but formerly missed under VIMAGE_GLOBAL. Put the extern declarations of the virtualized globals under VIMAGE_GLOBAL as the globals themsevles are already. This will help by the time when we are going to remove the globals entirely. Sponsored by: The FreeBSD Foundation
# 86413abf	11-Dec-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Put a global variables, which were virtualized but formerly missed under VIMAGE_GLOBAL. Start putting the extern declarations of the virtualized globals under VIMAGE_GLOBAL as the globals themsevles are already. This will help by the time when we are going to remove the globals entirely. While there garbage collect a few dead externs from ip6_var.h. Sponsored by: The FreeBSD Foundation
# c3ce7a79	10-Dec-2008	Robert Watson <rwatson@FreeBSD.org>	Move flag definitions for t_flags and t_oobflags below the definition of struct tcpcb so that the structure definition is a bit more vertically compact. Can't yet fit it on one printed page, though. MFC after: pretty soon
# 44e33a07	19-Nov-2008	Marko Zec <zec@FreeBSD.org>	Change the initialization methodology for global variables scheduled for virtualization. Instead of initializing the affected global variables at instatiation, assign initial values to them in initializer functions. As a rule, initialization at instatiation for such variables should never be introduced again from now on. Furthermore, enclose all instantiations of such global variables in #ifdef VIMAGE_GLOBALS blocks. Essentialy, this change should have zero functional impact. In the next phase of merging network stack virtualization infrastructure from p4/vimage branch, the new initialization methology will allow us to switch between using global variables and their counterparts residing in virtualization containers with minimum code churn, and in the long run allow us to intialize multiple instances of such container structures. Discussed at: devsummit Strassburg Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
# 91d6cfa6	06-Nov-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Fix a bug introduced with r182851 splitting tcp_mss() into tcp_mss() and tcp_mss_update() so that tcp_mtudisc() could re-use the same code. Move the TSO logic back to tcp_mss() and out of tcp_mss_update(). We tried to avoid that initially but if were are called from tcp_output() with EMSGSIZE, we cleared the TSO flag on the tcpcb there, called into tcp_mtudisc() and tcp_mss_update() which then would reenable TSO on the tcpcb based on TSO capabilities of the interface as learnt in tcp_maxmtu/6(). So if TSO was enabled on the (possibly new) outgoing interface it was turned back on, which lead to an endless loop between tcp_output() and tcp_mtudisc() until we overflew the stack. Reported by: kmacy MFC after: 2 months (along with r182851)
# d7f03759	19-Oct-2008	Ulf Lilleengen <lulf@FreeBSD.org>	- Import the HEAD csup code which is the basis for the cvsmode work.
# 3cee92e0	07-Sep-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Split tcp_mss() in tcp_mss() and tcp_mss_update() where the former calls the latter. Merge tcp_mss_update() with code from tcp_mtudisc() basically doing the same thing. This gives us one central place where we calcuate and check mss values to update t_maxopd (maximum mss + options length) instead of two slightly different but almost equal implementations to maintain. PR: kern/118455 Reviewed by: silby (back in March) MFC after: 2 months
# f2512ba1	31-Jul-2008	Rui Paulo <rpaulo@FreeBSD.org>	MFp4 (//depot/projects/tcpecn/): TCP ECN support. Merge of my GSoC 2006 work for NetBSD. TCP ECN is defined in RFC 3168. Partly reviewed by: dwmalone, silby Obtained from: NetBSD
# 032fae41	20-Apr-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Revert to rev. 1.161 - switch back to optimized TCP options ordering. A lot of testing has shown that the problem people were seeing was due to invalid padding after the end of option list option, which was corrected in tcp_output.c rev. 1.146. Thanks to: anders@, s3raphi, Matt Reimer Thanks to: Doug Hardie and Randy Rose, John Mayer, Susan Guzzardi Special thanks to: dwhite@ and BitGravity Discussed with: silby MFC after: 1 day
# ea346b19	23-Feb-2008	Mike Silbersack <silby@FreeBSD.org>	Change FreeBSD 7 so that it returns TCP options in the same order that FreeBSD 6 and before did. Doug White and the other bloodhounds at ISC discovered that while FreeBSD 7's ordering of options was more efficient, it caused some cable modem routers to ignore the SYN-ACKs ordered in this fashion. The placement of sackOK after the timestamp option seems to be the critical difference: FreeBSD 6: <mss 1460,nop,wscale 1,nop,nop,timestamp 3512155768 0,sackOK,eol> FreeBSD 7.0: <mss 1460,nop,wscale 3,sackOK,timestamp 1370692577 0> FreeBSD 7.0 + this change: <mss 1460,nop,wscale 3,nop,nop,timestamp 7371813 0,sackOK,eol> MFC after: 1 week
# 76b262c4	12-Dec-2007	Kip Macy <kmacy@FreeBSD.org>	Fix style issues with initial TCP offload commit Requested by: rwatson Submitted by: rwatson
# 620721db	12-Dec-2007	Kip Macy <kmacy@FreeBSD.org>	Add driver independent interface to offload active established TCP connections Reviewed by: silby
# 2de2af32	06-Dec-2007	Kip Macy <kmacy@FreeBSD.org>	Add padding for anticipated functionality - vimage - TOE - multiq - host rtentry caching Rename spare used by 80211 to if_llsoftc Reviewed by: rwatson, gnn MFC after: 1 day
# e2f2059f	23-Sep-2007	Mike Silbersack <silby@FreeBSD.org>	Two changes: - Reintegrate the ANSI C function declaration change from tcp_timer.c rev 1.92 - Reorganize the tcpcb structure so that it has a single pointer to the "tcp_timer" structure which contains all of the tcp timer callouts. This change means that when the single tcp timer change is reintegrated, tcpcb will not change in size, and therefore the ABI between netstat and the kernel will not change. Neither of these changes should have any functional impact. Reviewed by: bmah, rrs Approved by: re (bmah)
# 85d94372	07-Sep-2007	Robert Watson <rwatson@FreeBSD.org>	Back out tcp_timer.c:1.93 and associated changes that reimplemented the many TCP timers as a single timer, but retain the API changes necessary to reintroduce this change. This will back out the source of at least two reported problems: lock leaks in certain timer edge cases, and TCP timers continuing to fire after a connection has closed (a bug previously fixed and then reintroduced with the timer rewrite). In a follow-up commit, some minor restylings and comment changes performed after the TCP timer rewrite will be reapplied, and a further change to allow the TCP timer rewrite to be added back without disturbing the ABI. The new design is believed to be a good thing, but the outstanding issues are leading to significant stability/correctness problems that are holding up 7.0. This patch was generated by silby, but is being committed by proxy due to poor network connectivity for silby this week. Approved by: re (kensmith) Submitted by: silby Tested by: rwatson, kris Problems reported by: peter, kris, others
# 773673c1	27-Jul-2007	Andre Oppermann <andre@FreeBSD.org>	Provide a sysctl to toggle reporting of TCP debug logging: sys.net.inet.tcp.log_debug = 1 It defaults to enabled for the moment and is to be turned off for the next release like other diagnostics from development branches. It is important to note that sysctl sys.net.inet.tcp.log_in_vain uses the same logging function as log_debug. Enabling of the former also causes the latter to engage, but not vice versa. Use consistent terminology in tcp log messages: "ignored" means a segment contains invalid flags/information and is dropped without changing state or issuing a reply. "rejected" means a segments contains invalid flags/information but is causing a reply (usually RST) and may cause a state change. Approved by: re (rwatson)
# c325962b	26-Jul-2007	Mike Silbersack <silby@FreeBSD.org>	Export the contents of the syncache to netstat. Approved by: re (kensmith) MFC after: 2 weeks
# 9fb5d4c0	04-Jul-2007	Peter Wemm <peter@FreeBSD.org>	Fix cast-qualifiers warning when INET6 is not present Approved by: re (rwatson)
# a160e630	28-May-2007	Andre Oppermann <andre@FreeBSD.org>	Refactor and rewrite in parts the SYN handling code on listen sockets in tcp_input(): o tighten the checks on allowed TCP flags to be RFC793 and tcp-secure conform o log check failures to syslog at LOG_DEBUG level o rearrange the code flow to be easier to follow o add KASSERTs to validate assumptions of the code flow Add sysctl net.inet.tcp.syncache.rst_on_sock_fail defaulting to enable that controls the behavior on socket creation failure for a otherwise successful 3-way handshake. The socket creation can fail due to global memory shortage, listen queue limits and file descriptor limits. The sysctl allows to chose between two options to deal with this. One is to send a reset to the other endpoint to notify it about the failure (default). The other one is to ignore and treat the failure as a transient error and have the other endpoint retransmit for another try. Reviewed by: rwatson (in general)
# df541e5f	18-May-2007	Andre Oppermann <andre@FreeBSD.org>	Add tcp_log_addrs() function to generate and standardized TCP log line for use thoughout the tcp subsystem. It is IPv4 and IPv6 aware creates a line in the following format: "TCP: [1.2.3.4]:50332 to [1.2.3.4]:80 tcpflags <RST>" A "\n" is not included at the end. The caller is supposed to add further information after the standard tcp log header. The function returns a NUL terminated string which the caller has to free(s, M_TCPLOG) after use. All memory allocation is done with M_NOWAIT and the return value may be NULL in memory shortage situations. Either struct in_conninfo \|\| (struct tcphdr && (struct ip \|\| struct ip6_hdr) have to be supplied. Due to ip[6].h header inclusion limitations and ordering issues the struct ip and struct ip6_hdr parameters have to be casted and passed as void * pointers. tcp_log_addrs(struct in_conninfo inc, struct tcphdr th, void ip4hdr, void ip6hdr) Usage example: struct ip ip; char tcplog; if (tcplog = tcp_log_addrs(NULL, th, (void *)ip, NULL)) { log(LOG_DEBUG, "%s; %s: Connection attempt to closed port\n", tcplog, __func__); free(s, M_TCPLOG); }
# 2104448f	16-May-2007	Andre Oppermann <andre@FreeBSD.org>	Move TIME_WAIT related functions and timer handling from files other than repo copied tcp_subr.c into tcp_timewait.c#1.284: tcp_input.c#1.350 tcp_timewait() -> tcp_twcheck() tcp_timer.c#1.92 tcp_timer_2msl_reset() -> tcp_tw_2msl_reset() tcp_timer.c#1.92 tcp_timer_2msl_stop() -> tcp_tw_2msl_stop() tcp_timer.c#1.92 tcp_timer_2msl_tw() -> tcp_tw_2msl_scan() This is a mechanical move with appropriate renames and making them static if used only locally. The tcp_tw_2msl_scan() cleanup function is still run from the tcp_slowtimo() in tcp_timer.c.
# ec9c7553	13-May-2007	Andre Oppermann <andre@FreeBSD.org>	Complete the (mechanical) move of the TCP reassembly and timewait functions from their origininal place to their own files. TCP Reassembly from tcp_input.c -> tcp_reass.c TCP Timewait from tcp_subr.c -> tcp_timewait.c
# 504abdc6	11-May-2007	Andre Oppermann <andre@FreeBSD.org>	Add the timestamp offset to struct tcptw so we can generate proper ACKs in TIME_WAIT state that don't get dropped by the PAWS check on the receiver.
# 1a553740	06-May-2007	Andre Oppermann <andre@FreeBSD.org>	Remove unused requested_s_scale from struct tcpcb.
# 3529149e	06-May-2007	Andre Oppermann <andre@FreeBSD.org>	Use existing TF_SACK_PERMIT flag in struct tcpcb t_flags field instead of a decdicated sack_enable int for this bool. Change all users accordingly.
# 6087c3c2	04-May-2007	Robert Watson <rwatson@FreeBSD.org>	Add global mutex tcp_debug_mtx, which will protect global TCP debugging state tcp_debug, tcp_debx. Acquire and drop as required in tcp_trace(). Move to ANSI C function header, correct prototype types so that short TCP state is no longer promoted to int unnecessarily. Add comments. MFC after: 3 weeks
# df47e437	20-Apr-2007	Andre Oppermann <andre@FreeBSD.org>	o Remove unncessary TOF_SIGLEN flag from struct tcpopt o Correctly set to->to_signature in tcp_dooptions() o Update comments
# 4d6e7130	20-Apr-2007	Andre Oppermann <andre@FreeBSD.org>	Remove bogus check for accept queue length and associated failure handling from the incoming SYN handling section of tcp_input(). Enforcement of the accept queue limits is done by sonewconn() after the 3WHS is completed. It is not necessary to have an earlier check before a connection request enters the SYN cache awaiting the full handshake. It rather limits the effectiveness of the syncache by preventing legit and illegit connections from entering it and having them shaken out before we hit the real limit which may have vanished by then. Change return value of syncache_add() to void. No status communication is required.
# b8152ba7	11-Apr-2007	Andre Oppermann <andre@FreeBSD.org>	Change the TCP timer system from using the callout system five times directly to a merged model where only one callout, the next to fire, is registered. Instead of callout_reset(9) and callout_stop(9) the new function tcp_timer_activate() is used which then internally manages the callout. The single new callout is a mutex callout on inpcb simplifying the locking a bit. tcp_timer() is the called function which handles all race conditions in one place and then dispatches the individual timer functions. Reviewed by: rwatson (earlier version)
# e406f5a1	21-Mar-2007	Andre Oppermann <andre@FreeBSD.org>	Remove tcp_minmssoverload DoS detection logic. The problem it tried to protect us from wasn't really there and it only bloats the code. Should the problem surface in the future we can simply resurrect it from cvs history.
# 02a1a643	15-Mar-2007	Andre Oppermann <andre@FreeBSD.org>	Consolidate insertion of TCP options into a segment from within tcp_output() and syncache_respond() into its own generic function tcp_addoptions(). tcp_addoptions() is alignment agnostic and does optimal packing in all cases. In struct tcpopt rename to_requested_s_scale to just to_wscale. Add a comment with quote from RFC1323: "The Window field in a SYN (i.e., a <SYN> or <SYN,ACK>) segment itself is never scaled." Reviewed by: silby, mohans, julian Sponsored by: TCP/IP Optimization Fundraise 2005
# 7c72af87	26-Feb-2007	Mohan Srinivasan <mohans@FreeBSD.org>	Reap FIN_WAIT_2 connections marked SOCANTRCVMORE faster. This mitigate potential issues where the peer does not close, potentially leaving thousands of connections in FIN_WAIT_2. This is controlled by a new sysctl fast_finwait2_recycle, which is disabled by default. Reviewed by: gnn, silby.
# 6741ecf5	01-Feb-2007	Andre Oppermann <andre@FreeBSD.org>	Auto sizing TCP socket buffers. Normally the socket buffers are static (either derived from global defaults or set with setsockopt) and do not adapt to real network conditions. Two things happen: a) your socket buffers are too small and you can't reach the full potential of the network between both hosts; b) your socket buffers are too big and you waste a lot of kernel memory for data just sitting around. With automatic TCP send and receive socket buffers we can start with a small buffer and quickly grow it in parallel with the TCP congestion window to match real network conditions. FreeBSD has a default 32K send socket buffer. This supports a maximal transfer rate of only slightly more than 2Mbit/s on a 100ms RTT trans-continental link. Or at 200ms just above 1Mbit/s. With TCP send buffer auto scaling and the default values below it supports 20Mbit/s at 100ms and 10Mbit/s at 200ms. That's an improvement of factor 10, or 1000%. For the receive side it looks slightly better with a default of 64K buffer size. New sysctls are: net.inet.tcp.sendbuf_auto=1 (enabled) net.inet.tcp.sendbuf_inc=8192 (8K, step size) net.inet.tcp.sendbuf_max=262144 (256K, growth limit) net.inet.tcp.recvbuf_auto=1 (enabled) net.inet.tcp.recvbuf_inc=16384 (16K, step size) net.inet.tcp.recvbuf_max=262144 (256K, growth limit) Tested by: many (on HEAD and RELENG_6) Approved by: re MFC after: 1 month
# bf6d304a	13-Sep-2006	Andre Oppermann <andre@FreeBSD.org>	Rewrite of TCP syncookies to remove locking requirements and to enhance functionality: - Remove a rwlock aquisition/release per generated syncookie. Locking is now integrated with the bucket row locking of syncache itself and syncookies no longer add any additional lock overhead. - Syncookie secrets are different for and stored per syncache buck row. Secrets expire after 16 seconds and are reseeded on-demand. - The computational overhead for syncookie generation and verification is one MD5 hash computation as before. - Syncache can be turned off and run with syncookies only by setting the sysctl net.inet.tcp.syncookies_only=1. This implementation extends the orginal idea and first implementation of FreeBSD by using not only the initial sequence number field to store information but also the timestamp field if present. This way we can keep track of the entire state we need to know to recreate the session in its original form. Almost all TCP speakers implement RFC1323 timestamps these days. For those that do not we still have to live with the known shortcomings of the ISN only SYN cookies. The use of the timestamp field causes the timestamps to be randomized if syncookies are enabled. The idea of SYN cookies is to encode and include all necessary information about the connection setup state within the SYN-ACK we send back and thus to get along without keeping any local state until the ACK to the SYN-ACK arrives (if ever). Everything we need to know should be available from the information we encoded in the SYN-ACK. A detailed description of the inner working of the syncookies mechanism is included in the comments in tcp_syncache.c. Reviewed by: silby (slightly earlier version) Sponsored by: TCP/IP Optimization Fundraise 2005
# 751dea29	07-Sep-2006	Ruslan Ermilov <ru@FreeBSD.org>	Back when we had T/TCP support, we used to apply different timeouts for TCP and T/TCP connections in the TIME_WAIT state, and we had two separate timed wait queues for them. Now that is has gone, the timeout is always 2*MSL again, and there is no reason to keep two queues (the first was unused anyway!). Also, reimplement the remaining queue using a TAILQ (it was technically impossible before, with two queues).
# 233dcce1	06-Sep-2006	Andre Oppermann <andre@FreeBSD.org>	First step of TSO (TCP segmentation offload) support in our network stack. o add IFCAP_TSO[46] for drivers to announce this capability for IPv4 and IPv6 o add CSUM_TSO flag to mbuf pkthdr csum_flags field o add tso_segsz field to mbuf pkthdr o enhance ip_output() packet length check to allow for large TSO packets o extend tcp_maxmtu[46]() with a flag pointer to pass interface capabilities o adjust all callers of tcp_maxmtu[46]() accordingly Discussed on: -current, -net Sponsored by: TCP/IP Optimization Fundraise 2005
# 2c857a9b	06-Sep-2006	Gleb Smirnoff <glebius@FreeBSD.org>	o Backout rev. 1.125 of in_pcb.c. It appeared to behave extremely bad under high load. For example with 40k sockets and 25k tcptw entries, connect() syscall can run for seconds. Debugging showed that it iterates the cycle millions times and purges thousands of tcptw entries at a time. Besides practical unusability this change is architecturally wrong. First, in_pcblookup_local() is used in connect() and bind() syscalls. No stale entries purging shouldn't be done here. Second, it is a layering violation. o Return back the tcptw purging cycle to tcp_timer_2msl_tw(), that was removed in rev. 1.78 by rwatson. The commit log of this revision tells nothing about the reason cycle was removed. Now we need this cycle, since major cleaner of stale tcptw structures is removed. o Disable probably necessary, but now unused tcp_twrecycleable() function. Reviewed by: ru
# f72167f4	26-Jun-2006	Andre Oppermann <andre@FreeBSD.org>	Some cleanups and janitorial work to tcp_dooptions(): o redefine the parameter 'is_syn' to 'flags', add TO_SYN flag and adjust its usage accordingly o update the comments to the tcp_dooptions() invocation in tcp_input():after_listen to reflect reality o move the logic checking the echoed timestamp out of tcp_dooptions() to the only place that uses it next to the invocation described in the previous item o adjust parsing of TCPOPT_SACK_PERMITTED to use the same style as the others o add comments in to struct tcpopt.to_flags #defines No functional changes. Sponsored by: TCP/IP Optimization Fundraise 2005
# 8411d000	17-Jun-2006	Andre Oppermann <andre@FreeBSD.org>	Move all syncache related structures to tcp_syncache.c. They are only used there. This unbreaks userland programs that include tcp_var.h. Discussed with: rwatson
# 93f0d0c5	17-Jun-2006	Andre Oppermann <andre@FreeBSD.org>	Rearrange fields in struct syncache and syncache_head to make them more cache line friendly. Sponsored by: TCP/IP Optimization Fundraise 2005
# 351630c4	17-Jun-2006	Andre Oppermann <andre@FreeBSD.org>	Add locking to TCP syncache and drop the global tcpinfo lock as early as possible for the syncache_add() case. The syncache timer no longer aquires the tcpinfo lock and timeout/retransmit runs can happen in parallel with bucket granularity. On a P4 the additional locks cause a slight degression of 0.7% in tcp connections per second. When IP and TCP input are deserialized and can run in parallel this little overhead can be neglected. The syncookie handling still leaves room for improvement and its random salts may be moved to the syncache bucket head structures to remove the second lock operation currently required for it. However this would be a more involved change from the way syncookies work at the moment. Reviewed by: rwatson Tested by: rwatson, ps (earlier version) Sponsored by: TCP/IP Optimization Fundraise 2005
# 623dce13	01-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Update TCP for infrastructural changes to the socket/pcb refcount model, pru_abort(), pru_detach(), and in_pcbdetach(): - Universally support and enforce the invariant that so_pcb is never NULL, converting dozens of unnecessary NULL checks into assertions, and eliminating dozens of unnecessary error handling cases in protocol code. - In some cases, eliminate unnecessary pcbinfo locking, as it is no longer required to ensure so_pcb != NULL. For example, the receive code no longer requires the pcbinfo lock, and the send code only requires it if building a new connection on an otherwise unconnected socket triggered via sendto() with an address. This should significnatly reduce tcbinfo lock contention in the receive and send cases. - In order to support the invariant that so_pcb != NULL, it is now necessary for the TCP code to not discard the tcpcb any time a connection is dropped, but instead leave the tcpcb until the socket is shutdown. This case is handled by setting INP_DROPPED, to substitute for using a NULL so_pcb to indicate that the connection has been dropped. This requires the inpcb lock, but not the pcbinfo lock. - Unlike all other protocols in the tree, TCP may need to retain access to the socket after the file descriptor has been closed. Set SS_PROTOREF in tcp_detach() in order to prevent the socket from being freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether or not it needs to free the socket when the connection finally does close. The typical case where this occurs is if close() is called on a TCP socket before all sent data in the send socket buffer has been transmitted or acknowledged. If INP_SOCKREF is found when the connection is dropped, we release the inpcb, tcpcb, and socket instead of flagging INP_DROPPED. - Abort and detach protocol switch methods no longer return failures, nor attempt to free sockets, as the socket layer does this. - Annotate the existence of a long-standing race in the TCP timer code, in which timers are stopped but not drained when the socket is freed, as waiting for drain may lead to deadlocks, or have to occur in a context where waiting is not permitted. This race has been handled by testing to see if the tcpcb pointer in the inpcb is NULL (and vice versa), which is not normally permitted, but may be true of a inpcb and tcpcb have been freed. Add a counter to test how often this race has actually occurred, and a large comment for each instance where we compare potentially freed memory with NULL. This will have to be fixed in the near future, but requires is to further address how to handle the timer shutdown shutdown issue. - Several TCP calls no longer potentially free the passed inpcb/tcpcb, so no longer need to return a pointer to indicate whether the argument passed in is still valid. - Un-macroize debugging and locking setup for various protocol switch methods for TCP, as it lead to more obscurity, and as locking becomes more customized to the methods, offers less benefit. - Assert copyright on tcp_usrreq.c due to significant modifications that have been made as part of this work. These changes significantly modify the memory management and connection logic of our TCP implementation, and are (as such) High Risk Changes, and likely to contain serious bugs. Please report problems to the current@ mailing list ASAP, ideally with simple test cases, and optionally, packet traces. MFC after: 3 months
# 464fcfbc	28-Feb-2006	Andre Oppermann <andre@FreeBSD.org>	Rework TCP window scaling (RFC1323) to properly scale the send window right from the beginning and partly clean up the differences in handling between SYN_SENT and SYN_RCVD (syncache). Further changes to this code to come. This is a first incremental step to a general overhaul and streamlining of the TCP code. PR: kern/15095 PR: kern/92690 (partly) Reviewed by: qingli (and tested with ANVL) Sponsored by: TCP/IP Optimization Fundraise 2005
# eaf80179	16-Feb-2006	Andre Oppermann <andre@FreeBSD.org>	Have TCP Inflight disable itself if the RTT is below a certain threshold. Inflight doesn't make sense on a LAN as it has trouble figuring out the maximal bandwidth because of the coarse tick granularity. The sysctl net.inet.tcp.inflight.rttthresh specifies the threshold in milliseconds below which inflight will disengage. It defaults to 10ms. Tested by: Joao Barros <joao.barros-at-gmail.com>, Rich Murphey <rich-at-whiteoaklabs.com> Sponsored by: TCP/IP Optimization Fundraise 2005
# 5a53ca16	27-Jun-2005	Paul Saab <ps@FreeBSD.org>	- Postpone SACK option processing until after PAWS checks. SACK option processing is now done in the ACK processing case. - Merge tcp_sack_option() and tcp_del_sackholes() into a new function called tcp_sack_doack(). - Test (SEG.ACK < SND.MAX) before processing the ACK. Submitted by: Noritoshi Demizu Reveiewed by: Mohan Srinivasan, Raja Mukerji Approved by: re
# 9d17a7a6	04-Jun-2005	Paul Saab <ps@FreeBSD.org>	Changes to tcp_sack_option() that - Walks the scoreboard backwards from the tail to reduce the number of comparisons for each sack option received. - Introduce functions to add/remove sack scoreboard elements, making the code more readable. Submitted by: Noritoshi Demizu Reviewed by: Raja Mukerji, Mohan Srinivasan
# 808f11b7	25-May-2005	Paul Saab <ps@FreeBSD.org>	This is conform with the terminology in M.Mathis and J.Mahdavi, "Forward Acknowledgement: Refining TCP Congestion Control" SIGCOMM'96, August 1996. Submitted by: Noritoshi Demizu, Raja Mukerji
# 2cdbfa66	20-May-2005	Paul Saab <ps@FreeBSD.org>	Replace t_force with a t_flag (TF_FORCEDATA). Submitted by: Raja Mukerji. Reviewed by: Mohan, Silby, Andre Opperman.
# 0077b016	11-May-2005	Paul Saab <ps@FreeBSD.org>	When looking for the next hole to retransmit from the scoreboard, or to compute the total retransmitted bytes in this sack recovery episode, the scoreboard is traversed. While in sack recovery, this traversal occurs on every call to tcp_output(), every dupack and every partial ack. The scoreboard could potentially get quite large, making this traversal expensive. This change optimizes this by storing hints (for the next hole to retransmit and the total retransmitted bytes in this sack recovery episode) reducing the complexity to find these values from O(n) to constant time. The debug code that sanity checks the hints against the computed value will be removed eventually. Submitted by: Mohan Srinivasan, Noritoshi Demizu, Raja Mukerji.
# a6235da6	21-Apr-2005	Paul Saab <ps@FreeBSD.org>	- Make the sack scoreboard logic use the TAILQ macros. This improves code readability and facilitates some anticipated optimizations in tcp_sack_option(). - Remove tcp_print_holes() and TCP_SACK_DEBUG. Submitted by: Raja Mukerji. Reviewed by: Mohan Srinivasan, Noritoshi Demizu.
# 1600372b	20-Apr-2005	Andre Oppermann <andre@FreeBSD.org>	Ignore ICMP Source Quench messages for TCP sessions. Source Quench is ineffective, depreciated and can be abused to degrade the performance of active TCP sessions if spoofed. Replace a bogus call to tcp_quench() in tcp_output() with the direct equivalent tcpcb variable assignment. Security: draft-gont-tcpm-icmp-attacks-03.txt Section 7.1 MFC after: 3 days
# 25e6f9ed	14-Apr-2005	Paul Saab <ps@FreeBSD.org>	Fix for a TCP SACK bug where more than (win/2) bytes could have been in flight in SACK recovery. Found by: Noritoshi Demizu Submitted by: Mohan Srinivasan <mohans at yahoo-inc dot com> Noritoshi Demizu <demizu at dd dot ij4u dot or dot jp> Raja Mukerji <raja at moselle dot com>
# e891d82b	09-Mar-2005	Paul Saab <ps@FreeBSD.org>	Add limits on the number of elements in the sack scoreboard both per-connection and globally. This eliminates potential DoS attacks where SACK scoreboard elements tie up too much memory. Submitted by: Raja Mukerji (raja at moselle dot com). Reviewed by: Mohan Srinivasan (mohans at yahoo-inc dot com).
# 7643c37c	17-Feb-2005	Paul Saab <ps@FreeBSD.org>	Remove 2 (SACK) fields from the tcpcb. These are only used by a function that is called from tcp_input(), so they oughta be passed on the stack instead of stuck in the tcpcb. Submitted by: Mohan Srinivasan
# 9945c0e2	14-Feb-2005	Maxim Konovalov <maxim@FreeBSD.org>	o Add handling of an IPv4-mapped IPv6 address. o Use SYSCTL_IN() macro instead of direct call of copyin(9). Submitted by: ume o Move sysctl_drop() implementation to sys/netinet/tcp_subr.c where most of tcp sysctls live. o There are net.inet[6].tcp[6].getcred sysctls already, no needs in a separate struct tcp_ident_mapping. Suggested by: ume
# 212a79b0	06-Feb-2005	Maxim Konovalov <maxim@FreeBSD.org>	o Implement net.inet.tcp.drop sysctl and userland part, tcpdrop(8) utility: The tcpdrop command drops the TCP connection specified by the local address laddr, port lport and the foreign address faddr, port fport. Obtained from: OpenBSD Reviewed by: rwatson (locking), ru (man page), -current MFC after: 1 month
# c398230b	06-Jan-2005	Warner Losh <imp@FreeBSD.org>	/* -> /*- for license, minor formatting changes
# db0aae38	22-Dec-2004	Robert Watson <rwatson@FreeBSD.org>	Remove the now unused tcp_canceltimers() function. tcpcb timers are now stopped as part of tcp_discardcb(). MFC after: 2 weeks
# c94c54e4	02-Nov-2004	Andre Oppermann <andre@FreeBSD.org>	Remove RFC1644 T/TCP support from the TCP side of the network stack. A complete rationale and discussion is given in this message and the resulting discussion: http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706 Note that this commit removes only the functional part of T/TCP from the tcp_* related functions in the kernel. Other features introduced with RFC1644 are left intact (socket layer changes, sendmsg(2) on connection oriented protocols) and are meant to be reused by a simpler and less intrusive reimplemention of the previous T/TCP functionality. Discussed on: -arch
# cd109b0d	22-Oct-2004	Andre Oppermann <andre@FreeBSD.org>	Shave 40 unused bytes from struct tcpcb.
# a55db2b6	05-Oct-2004	Paul Saab <ps@FreeBSD.org>	- Estimate the amount of data in flight in sack recovery and use it to control the packets injected while in sack recovery (for both retransmissions and new data). - Cleanups to the sack codepaths in tcp_output.c and tcp_sack.c. - Add a new sysctl (net.inet.tcp.sack.initburst) that controls the number of sack retransmissions done upon initiation of sack recovery. Submitted by: Mohan Srinivasan <mohans@yahoo-inc.com>
# a4f757cd	16-Aug-2004	Robert Watson <rwatson@FreeBSD.org>	White space cleanup for netinet before branch: - Trailing tab/space cleanup - Remove spurious spaces between or before tabs This change avoids touching files that Andre likely has in his working set for PFIL hooks changes for IPFW/DUMMYNET. Approved by: re (scottl) Submitted by: Xin LI <delphij@frontfree.net>
# 969860f3	17-Jul-2004	David Malone <dwmalone@FreeBSD.org>	The tcp syncache code was leaving the IPv6 flowlabel uninitialised for the SYN\|ACK packet and then letting in6_pcbconnect set the flowlabel later. Arange for the syncache/syncookie code to set and recall the flow label so that the flowlabel used for the SYN\|ACK is consistent. This is done by using some of the cookie (when tcp cookies are enabeled) and by stashing the flowlabel in syncache. Tested and Discovered by: Orla McGann <orly@cnri.dit.ie> Approved by: ume, silby MFC after: 1 month
# 37332f04	24-Jun-2004	Bruce M Simpson <bms@FreeBSD.org>	Whitespace.
# 6d90faf3	23-Jun-2004	Paul Saab <ps@FreeBSD.org>	Add support for TCP Selective Acknowledgements. The work for this originated on RELENG_4 and was ported to -CURRENT. The scoreboarding code was obtained from OpenBSD, and many of the remaining changes were inspired by OpenBSD, but not taken directly from there. You can enable/disable sack using net.inet.tcp.do_sack. You can also limit the number of sack holes that all senders can have in the scoreboard with net.inet.tcp.sackhole_limit. Reviewed by: gnn Obtained from: Yahoo! (Mohan Srinivasan, Jayanth Vijayaraghavan)
# 80dd2a81	25-Apr-2004	Mike Silbersack <silby@FreeBSD.org>	Tighten up reset handling in order to make reset attacks as difficult as possible while maintaining compatibility with the widest range of TCP stacks. The algorithm is as follows: --- For connections in the ESTABLISHED state, only resets with sequence numbers exactly matching last_ack_sent will cause a reset, all other segments will be silently dropped. For connections in all other states, a reset anywhere in the window will cause the connection to be reset. All other segments will be silently dropped. --- The necessity of accepting all in-window resets was discovered by jayanth and jlemon, both of whom have seen TCP stacks that will respond to FIN-ACK packets with resets not meeting the strict last_ack_sent check. Idea by: Darren Reed Reviewed by: truckman, jlemon, others(?)
# de9f59f8	20-Apr-2004	Bruce M Simpson <bms@FreeBSD.org>	Fix a typo in a comment.
# c1537ef0	20-Apr-2004	Mike Silbersack <silby@FreeBSD.org>	Enhance our RFC1948 implementation to perform better in some pathlogical TIME_WAIT recycling cases I was able to generate with http testing tools. In short, as the old algorithm relied on ticks to create the time offset component of an ISN, two connections with the exact same host, port pair that were generated between timer ticks would have the exact same sequence number. As a result, the second connection would fail to pass the TIME_WAIT check on the server side, and the SYN would never be acknowledged. I've "fixed" this by adding random positive increments to the time component between clock ticks so that ISNs will always be increasing, no matter how quickly the port is recycled. Except in such contrived benchmarking situations, this problem should never come up in normal usage... until networks get faster. No MFC planned, 4.x is missing other optimizations that are needed to even create the situation in which such quick port recycling will occur.
# f36cfd49	07-Apr-2004	Warner Losh <imp@FreeBSD.org>	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999 and email from Peter Wemm, Alan Cox and Robert Watson. Approved by: core, peter, alc, rwatson
# a7b6a14a	28-Feb-2004	Robert Watson <rwatson@FreeBSD.org>	Remove now unneeded arguments to tcp_twrespond() -- so and msrc. These were needed by the MAC Framework until inpcbs gained labels. Submitted by: sam
# 0613995b	25-Feb-2004	Bruce Evans <bde@FreeBSD.org>	Fixed namespace pollution in rev.1.74. Implementation of the syncache increased <netinet/tcp_var>'s already large set of prerequisites, and this was handled badly. Just don't declare the complete syncache struct unless <netinet/pcb.h> is included before <netinet/tcp_var.h>. Approved by: jlemon (years ago, for a more invasive fix)
# a545b1dc	25-Feb-2004	Bruce Evans <bde@FreeBSD.org>	Don't use the negatively-opaque type uma_zone_t or be chummy with <vm/uma.h>'s idempotency indentifier or its misspelling.
# 12e2e970	24-Feb-2004	Andre Oppermann <andre@FreeBSD.org>	Convert the tcp segment reassembly queue to UMA and limit the maximum amount of segments it will hold. The following tuneables and sysctls control the behaviour of the tcp segment reassembly queue: net.inet.tcp.reass.maxsegments (loader tuneable) specifies the maximum number of segments all tcp reassemly queues can hold (defaults to 1/16 of nmbclusters). net.inet.tcp.reass.maxqlen specifies the maximum number of segments any individual tcp session queue can hold (defaults to 48). net.inet.tcp.reass.cursegments (readonly) counts the number of segments currently in all reassembly queues. net.inet.tcp.reass.overflows (readonly) counts how often either the global or local queue limit has been reached. Tested by: bms, silby Reviewed by: bms, silby
# 265ed012	13-Feb-2004	Bruce M Simpson <bms@FreeBSD.org>	Brucification. Submitted by: bde
# b30190b5	12-Feb-2004	Bruce M Simpson <bms@FreeBSD.org>	Update the prototype for tcpsignature_apply() to reflect the spelling of the types used by m_apply()'s callback function, f, as documented in mbuf(9). Noticed by: njl
# 1cfd4b53	10-Feb-2004	Bruce M Simpson <bms@FreeBSD.org>	Initial import of RFC 2385 (TCP-MD5) digest support. This is the first of two commits; bringing in the kernel support first. This can be enabled by compiling a kernel with options TCP_SIGNATURE and FAST_IPSEC. For the uninitiated, this is a TCP option which provides for a means of authenticating TCP sessions which came into being before IPSEC. It is still relevant today, however, as it is used by many commercial router vendors, particularly with BGP, and as such has become a requirement for interconnect at many major Internet points of presence. Several parts of the TCP and IP headers, including the segment payload, are digested with MD5, including a shared secret. The PF_KEY interface is used to manage the secrets using security associations in the SADB. There is a limitation here in that as there is no way to map a TCP flow per-port back to an SPI without polluting tcpcb or using the SPD; the code to do the latter is unstable at this time. Therefore this code only supports per-host keying granularity. Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6), TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective users of this feature, this will not pose any problem. This implementation is output-only; that is, the option is honoured when responding to a host initiating a TCP session, but no effort is made [yet] to authenticate inbound traffic. This is, however, sufficient to interwork with Cisco equipment. Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with local patches. Patches for tcpdump to validate TCP-MD5 sessions are also available from me upon request. Sponsored by: sentex.net
# 53369ac9	08-Jan-2004	Andre Oppermann <andre@FreeBSD.org>	Limiters and sanity checks for TCP MSS (maximum segement size) resource exhaustion attacks. For network link optimization TCP can adjust its MSS and thus packet size according to the observed path MTU. This is done dynamically based on feedback from the remote host and network components along the packet path. This information can be abused to pretend an extremely low path MTU. The resource exhaustion works in two ways: o during tcp connection setup the advertized local MSS is exchanged between the endpoints. The remote endpoint can set this arbitrarily low (except for a minimum MTU of 64 octets enforced in the BSD code). When the local host is sending data it is forced to send many small IP packets instead of a large one. For example instead of the normal TCP payload size of 1448 it forces TCP payload size of 12 (MTU 64) and thus we have a 120 times increase in workload and packets. On fast links this quickly saturates the local CPU and may also hit pps processing limites of network components along the path. This type of attack is particularly effective for servers where the attacker can download large files (WWW and FTP). We mitigate it by enforcing a minimum MTU settable by sysctl net.inet.tcp.minmss defaulting to 256 octets. o the local host is reveiving data on a TCP connection from the remote host. The local host has no control over the packet size the remote host is sending. The remote host may chose to do what is described in the first attack and send the data in packets with an TCP payload of at least one byte. For each packet the tcp_input() function will be entered, the packet is processed and a sowakeup() is signalled to the connected process. For example an attack with 2 Mbit/s gives 4716 packets per second and the same amount of sowakeup()s to the process (and context switches). This type of attack is particularly effective for servers where the attacker can upload large amounts of data. Normally this is the case with WWW server where large POSTs can be made. We mitigate this by calculating the average MSS payload per second. If it goes below 'net.inet.tcp.minmss' and the pps rate is above 'net.inet.tcp.minmssoverload' defaulting to 1000 this particular TCP connection is resetted and dropped. MITRE CVE: CAN-2004-0002 Reviewed by: sam (mentor) MFC after: 1 day
# 97d8d152	20-Nov-2003	Andre Oppermann <andre@FreeBSD.org>	Introduce tcp_hostcache and remove the tcp specific metrics from the routing table. Move all usage and references in the tcp stack from the routing table metrics to the tcp hostcache. It caches measured parameters of past tcp sessions to provide better initial start values for following connections from or to the same source or destination. Depending on the network parameters to/from the remote host this can lead to significant speedups for new tcp connections after the first one because they inherit and shortcut the learning curve. tcp_hostcache is designed for multiple concurrent access in SMP environments with high contention and is hash indexed by remote ip address. It removes significant locking requirements from the tcp stack with regard to the routing table. Reviewed by: sam (mentor), bms Reviewed by: -net, -current, core@kame.net (IPv6 parts) Approved by: re (scottl)
# 4bd4fa3f	02-Nov-2003	Mike Silbersack <silby@FreeBSD.org>	Add an additional check to the tcp_twrecycleable function; I had previously only considered the send sequence space. Unfortunately, some OSes (windows) still use a random positive increments scheme for their syn-ack ISNs, so I must consider receive sequence space as well. The value of 250000 bytes / second for Microsoft's ISN rate of increase was determined by testing with an XP machine.
# 96af9ea5	01-Nov-2003	Mike Silbersack <silby@FreeBSD.org>	- Add a new function tcp_twrecycleable, which tells us if the ISN which we will generate for a given ip/port tuple has advanced far enough for the time_wait socket in question to be safely recycled. - Have in_pcblookup_local use tcp_twrecycleable to determine if time_Wait sockets which are hogging local ports can be safely freed. This change preserves proper TIME_WAIT behavior under normal circumstances while allowing for safe and fast recycling whenever ephemeral port space is scarce.
# 9d11646d	15-Jul-2003	Jeffrey Hsu <hsu@FreeBSD.org>	Unify the "send high" and "recover" variables as specified in the lastest rev of the spec. Use an explicit flag for Fast Recovery. [1] Fix bug with exiting Fast Recovery on a retransmit timeout diagnosed by Lu Guohan. [2] Reviewed by: Thomas Henderson <thomas.r.henderson@boeing.com> Reported and tested by: Lu Guohan <lguohan00@mails.tsinghua.edu.cn> [2] Approved by: Thomas Henderson <thomas.r.henderson@boeing.com>, Sally Floyd <floyd@acm.org> [1]
# 430c6354	06-May-2003	Robert Watson <rwatson@FreeBSD.org>	Correct a bug introduced with reduced TCP state handling; make sure that the MAC label on TCP responses during TIMEWAIT is properly set from either the socket (if available), or the mbuf that it's responding to. Unfortunately, this is made somewhat difficult by the TCP code, as tcp_twstart() calls tcp_twrespond() after discarding the socket but without a reference to the mbuf that causes the "response". Passing both the socket and the mbuf works arounds this--eventually it might be good to make sure the mbuf always gets passed in in "response" scenarios but working through this provided to complicate things too much. Approved by: re (scottl) Reviewed by: hsu Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 48d2549c	01-Apr-2003	Jeffrey Hsu <hsu@FreeBSD.org>	Observe conservation of packets when entering Fast Recovery while doing Limited Transmit. Only artificially inflate the congestion window by 1 segment instead of the usual 3 to take into account the 2 already sent by Limited Transmit. Approved in principle by: Mark Allman <mallman@grc.nasa.gov>, Hari Balakrishnan <hari@nms.lcs.mit.edu>, Sally Floyd <floyd@icir.org>
# 607b0b0c	08-Mar-2003	Jonathan Lemon <jlemon@FreeBSD.org>	Remove a panic(); if the zone allocator can't provide more timewait structures, reuse the oldest one. Also move the expiry timer from a per-structure callout to the tcp slow timer. Sponsored by: DARPA, NAI Labs
# 340c35de	19-Feb-2003	Jonathan Lemon <jlemon@FreeBSD.org>	Add a TCP TIMEWAIT state which uses less space than a fullblown TCP control block. Allow the socket and tcpcb structures to be freed earlier than inpcb. Update code to understand an inp w/o a socket. Reviewed by: hsu, silby, jayanth Sponsored by: DARPA, NAI Labs
# 79909384	19-Feb-2003	Jonathan Lemon <jlemon@FreeBSD.org>	Convert tcp_fillheaders(tp, ...) -> tcpip_fillheaders(inp, ...) so the routine does not require a tcpcb to operate. Since we no longer keep template mbufs around, move pseudo checksum out of this routine, and merge it with the length update. Sponsored by: DARPA, NAI Labs
# cb942153	13-Jan-2003	Jeffrey Hsu <hsu@FreeBSD.org>	Fix NewReno. Reviewed by: Tom Henderson <thomas.r.henderson@boeing.com>
# 1fcc99b5	17-Aug-2002	Matthew Dillon <dillon@FreeBSD.org>	Implement TCP bandwidth delay product window limiting, similar to (but not meant to duplicate) TCP/Vegas. Add four sysctls and default the implementation to 'off'. net.inet.tcp.inflight_enable enable algorithm (defaults to 0=off) net.inet.tcp.inflight_debug debugging (defaults to 1=on) net.inet.tcp.inflight_min minimum window limit net.inet.tcp.inflight_max maximum window limit MFC after: 1 week
# d65bf08a	19-Jul-2002	Matthew Dillon <dillon@FreeBSD.org>	Add the tcps_sndrexmitbad statistic, keep track of late acks that caused unnecessary retransmissions.
# 3ce144ea	14-Jun-2002	Jeffrey Hsu <hsu@FreeBSD.org>	Notify functions can destroy the pcb, so they have to return an indication of whether this happenned so the calling function knows whether or not to unlock the pcb. Submitted by: Jennifer Yang (yangjihui@yahoo.com) Bug reported by: Sid Carter (sidcarter@symonds.net)
# eb5afeba	13-Jun-2002	Mike Silbersack <silby@FreeBSD.org>	Re-commit w/fix: Ensure that the syn cache's syn-ack packets contain the same ip_tos, ip_ttl, and DF bits as all other tcp packets. PR: 39141 MFC after: 2 weeks This time, make sure that ipv4 specific code (aka all of the above) is only run in the ipv4 case.
# 70d2b170	13-Jun-2002	Mike Silbersack <silby@FreeBSD.org>	Back out ip_tos/ip_ttl/DF "fix", it just panic'd my box. :) Pointy-hat to: silby
# 21c3b2fc	13-Jun-2002	Mike Silbersack <silby@FreeBSD.org>	Ensure that the syn cache's syn-ack packets contain the same ip_tos, ip_ttl, and DF bits as all other tcp packets. PR: 39141 MFC after: 2 weeks
# f76fcf6d	10-Jun-2002	Jeffrey Hsu <hsu@FreeBSD.org>	Lock up inpcb. Submitted by: Jennifer Yang <yangjihui@yahoo.com>
# 4d77a549	19-Mar-2002	Alfred Perlstein <alfred@FreeBSD.org>	Remove __P.
# 262c1c1a	02-Dec-2001	Matthew Dillon <dillon@FreeBSD.org>	Fix a bug with transmitter restart after receiving a 0 window. The receiver was not sending an immediate ack with delayed acks turned on when the input buffer is drained, preventing the transmitter from restarting immediately. Propogate the TCP_NODELAY option to accept()ed sockets. (Helps tbench and is a good idea anyway). Some cleanup. Identify additonal issues in comments. MFC after: 1 day
# be2ac88c	21-Nov-2001	Jonathan Lemon <jlemon@FreeBSD.org>	Introduce a syncache, which enables FreeBSD to withstand a SYN flood DoS in an improved fashion over the existing code. Reviewed by: silby (in a previous iteration) Sponsored by: DARPA, NAI Labs
# c24d5dae	05-Oct-2001	Jayanth Vijayaraghavan <jayanth@FreeBSD.org>	Add a flag TF_LASTIDLE, that forces a previously idle connection to send all its data, especially when the data is less than one MSS. This fixes an issue where the stack was delaying the sending of data, eventhough there was enough window to send all the data and the sending of data was emptying the socket buffer. Problem found by Yoshihiro Tsuchiya (tsuchiya@flab.fujitsu.co.jp) Submitted by: Jayanth Vijayaraghavan
# f0ffb944	03-Sep-2001	Julian Elischer <julian@FreeBSD.org>	Patches from Keiichi SHIMA <keiichi@iij.ad.jp> to make ip use the standard protosw structure again. Obtained from: Well, KAME I guess.
# b0e3ad75	21-Aug-2001	Mike Silbersack <silby@FreeBSD.org>	Much delayed but now present: RFC 1948 style sequence numbers In order to ensure security and functionality, RFC 1948 style initial sequence number generation has been implemented. Barring any major crypographic breakthroughs, this algorithm should be unbreakable. In addition, the problems with TIME_WAIT recycling which affect our currently used algorithm are not present. Reviewed by: jesper
# 2d610a50	07-Jul-2001	Mike Silbersack <silby@FreeBSD.org>	Temporary feature: Runtime tuneable tcp initial sequence number generation scheme. Users may now select between the currently used OpenBSD algorithm and the older random positive increment method. While the OpenBSD algorithm is more secure, it also breaks TIME_WAIT handling; this is causing trouble for an increasing number of folks. To switch between generation schemes, one sets the sysctl net.inet.tcp.tcp_seq_genscheme. 0 = random positive increments, 1 = the OpenBSD algorithm. 1 is still the default. Once a secure _and_ compatible algorithm is implemented, this sysctl will be removed. Reviewed by: jlemon Tested by: numerous subscribers of -net
# 08517d53	22-Jun-2001	Mike Silbersack <silby@FreeBSD.org>	Eliminate the allocation of a tcp template structure for each connection. The information contained in a tcptemp can be reconstructed from a tcpcb when needed. Previously, tcp templates required the allocation of one mbuf per connection. On large systems, this change should free up a large number of mbufs. Reviewed by: bmilekic, jlemon, ru MFC after: 2 weeks
# f0a04f3f	17-Apr-2001	Kris Kennaway <kris@FreeBSD.org>	Randomize the TCP initial sequence numbers more thoroughly. Obtained from: OpenBSD Reviewed by: jesper, peter, -developers
# c693a045	26-Feb-2001	Jonathan Lemon <jlemon@FreeBSD.org>	Remove in_pcbnotify and use in_pcblookup_hash to find the cb directly. For TCP, verify that the sequence number in the ICMP packet falls within the tcp receive window before performing any actions indicated by the icmp packet. Clean up some layering violations (access to tcp internals from in_pcb)
# 694a9ff9	25-Feb-2001	Jesper Skriver <jesper@FreeBSD.org>	Remove tcp_drop_all_states, which is unneeded after jlemon removed it from tcp_subr.c in rev 1.92
# 90fcbbd6	18-Feb-2001	Poul-Henning Kamp <phk@FreeBSD.org>	Remove unneeded loop increment in src/sys/netinet/in_pcb.c:in_pcbnotify Add new PRC_UNREACH_ADMIN_PROHIB in sys/sys/protosw.h Remove condition on TCP in src/sys/netinet/ip_icmp.c:icmp_input In src/sys/netinet/ip_icmp.c:icmp_input set code = PRC_UNREACH_ADMIN_PROHIB or PRC_UNREACH_HOST for all unreachables except ICMP_UNREACH_NEEDFRAG Rename sysctl icmp_admin_prohib_like_rst to icmp_unreach_like_rst to reflect the fact that we also react on ICMP unreachables that are not administrative prohibited. Also update the comments to reflect this. In sys/netinet/tcp_subr.c:tcp_ctlinput add code to treat PRC_UNREACH_ADMIN_PROHIB and PRC_UNREACH_HOST different. PR: 23986 Submitted by: Jesper Skriver <jesper@skriver.dk>
# 442fad67	24-Dec-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Update the "icmp_admin_prohib_like_rst" code to check the tcp-window and to be configurable with respect to acting only in SYN or in all TCP states. PR: 23665 Submitted by: Jesper Skriver <jesper@skriver.dk>
# b11d7a4a	16-Dec-2000	Poul-Henning Kamp <phk@FreeBSD.org>	We currently does not react to ICMP administratively prohibited messages send by routers when they deny our traffic, this causes a timeout when trying to connect to TCP ports/services on a remote host, which is blocked by routers or firewalls. rfc1122 (Requirements for Internet Hosts) section 3.2.2.1 actually requi re that we treat such a message for a TCP session, that we treat it like if we had recieved a RST. quote begin. A Destination Unreachable message that is received MUST be reported to the transport layer. The transport layer SHOULD use the information appropriately; for example, see Sections 4.1.3.3, 4.2.3.9, and 4.2.4 below. A transport protocol that has its own mechanism for notifying the sender that a port is unreachable (e.g., TCP, which sends RST segments) MUST nevertheless accept an ICMP Port Unreachable for the same purpose. quote end. I've written a small extension that implement this, it also create a sysctl "net.inet.tcp.icmp_admin_prohib_like_rst" to control if this new behaviour is activated. When it's activated (set to 1) we'll treat a ICMP administratively prohibited message (icmp type 3 code 9, 10 and 13) for a TCP sessions, as if we recived a TCP RST, but only if the TCP session is in SYN_SENT state. The reason for only reacting when in SYN_SENT state, is that this will solve the problem, and at the same time minimize the risk of this being abused. I suggest that we enable this new behaviour by default, but it would be a change of current behaviour, so if people prefer to leave it disabled by default, at least for now, this would be ok for me, the attached diff actually have the sysctl set to 0 by default. PR: 23086 Submitted by: Jesper Skriver <jesper@skriver.dk>
# e7f32693	21-Jul-2000	Jayanth Vijayaraghavan <jayanth@FreeBSD.org>	When a connection is being dropped due to a listen queue overflow, delete the cloned route that is associated with the connection. This does not exhaust the routing table memory when the system is under a SYN flood attack. The route entry is not deleted if there is any prior information cached in it. Reviewed by: Peter Wemm,asmodai
# 571214d4	18-Jul-2000	Sheldon Hearn <sheldonh@FreeBSD.org>	Fix a comment which was broken in rev 1.36. PR: 19947 Submitted by: Tetsuya Isaki <isaki@net.ipc.hiroshima-u.ac.jp>
# e3975643	25-May-2000	Jake Burkholder <jake@FreeBSD.org>	Back out the previous change to the queue(3) interface. It was not discussed and should probably not happen. Requested by: msmith and others
# 740a1973	23-May-2000	Jake Burkholder <jake@FreeBSD.org>	Change the way that the queue(3) structures are declared; don't assume that the type argument to _HEAD and _ENTRY is a struct. Suggested by: phk Reviewed by: phk Approved by: mdodd
# 46f58482	05-May-2000	Jonathan Lemon <jlemon@FreeBSD.org>	Implement TCP NewReno, as documented in RFC 2582. This allows better recovery for multiple packet losses in a single window. The algorithm can be toggled via the sysctl net.inet.tcp.newreno, which defaults to "on". Submitted by: Jayanth Vijayaraghavan <jayanth@yahoo-inc.com>
# fb59c426	09-Jan-2000	Yoshinobu Inoue <shin@FreeBSD.org>	tcp updates to support IPv6. also a small patch to sys/nfs/nfs_socket.c, as max_hdr size change. Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project
# 664a31e4	28-Dec-1999	Peter Wemm <peter@FreeBSD.org>	Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL" is an application space macro and the applications are supposed to be free to use it as they please (but cannot). This is consistant with the other BSD's who made this change quite some time ago. More commits to come.
# 6a800098	22-Dec-1999	Yoshinobu Inoue <shin@FreeBSD.org>	IPSEC support in the kernel. pr_input() routines prototype is also changed to support IPSEC and IPV6 chained protocol headers. Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project
# 76429de4	05-Nov-1999	Yoshinobu Inoue <shin@FreeBSD.org>	KAME related header files additions and merges. (only those which don't affect c source files so much) Reviewed by: cvs-committers Obtained from: KAME project
# 9b8b58e0	30-Aug-1999	Jonathan Lemon <jlemon@FreeBSD.org>	Restructure TCP timeout handling: - eliminate the fast/slow timeout lists for TCP and instead use a callout entry for each timer. - increase the TCP timer granularity to HZ - implement "bad retransmit" recovery, as presented in "On Estimating End-to-End Network Path Properties", by Allman and Paxson. Submitted by: jlemon, wollmann
# c3aac50f	27-Aug-1999	Peter Wemm <peter@FreeBSD.org>	$Id$ -> $FreeBSD$
# ce02431f	16-Feb-1999	Doug Rabson <dfr@FreeBSD.org>	* Change sysctl from using linker_set to construct its tree using SLISTs. This makes it possible to change the sysctl tree at runtime. * Change KLD to find and register any sysctl nodes contained in the loaded file and to unregister them when the file is unloaded. Reviewed by: Archie Cobbs <archie@whistle.com>, Peter Wemm <peter@netplex.com.au> (well they looked at it anyway)
# b0acefa8	20-Jan-1999	Bill Fenner <fenner@FreeBSD.org>	Add a flag, passed to pru_send routines, PRUS_MORETOCOME. This flag means that there is more data to be put into the socket buffer. Use it in TCP to reduce the interaction between mbuf sizes and the Nagle algorithm. Based on: "Justin C. Walker" <justin@apple.com>'s description of Apple's fix for this problem.
# 6effc713	24-Aug-1998	Doug Rabson <dfr@FreeBSD.org>	Re-implement tcp and ip fragment reassembly to not store pointers in the ip header which can't work on alpha since pointers are too big. Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>
# cfe8b629	22-Aug-1998	Garrett Wollman <wollman@FreeBSD.org>	Yow! Completely change the way socket options are handled, eliminating another specialized mbuf type in the process. Also clean up some of the cruft surrounding IPFW, multicast routing, RSVP, and other ill-explored corners.
# 07a4df4f	13-Jul-1998	Bruce Evans <bde@FreeBSD.org>	Declare tcp_seq and tcp_cc as fixed-size types. Half fixed type mismatches exposed by this (the prototype for tcp_respond() didn't match the function definition lexically, and still depends on a gcc feature to match if ints have more than 32 bits).
# a910fdcb	27-Jun-1998	John Hay <jhay@FreeBSD.org>	Only make struct xtcpcb visable if _NETINET_IN_PCB_H_ and _SYS_SOCKETVAR_H_ are defined. Reviewed by: bde
# 98271db4	15-May-1998	Garrett Wollman <wollman@FreeBSD.org>	Convert socket structures to be type-stable and add a version number. Define a parameter which indicates the maximum number of sockets in a system, and use this to size the zone allocators used for sockets and for certain PCBs. Convert PF_LOCAL PCB structures to be type-stable and add a version number. Define an external format for infomation about socket structures and use it in several places. Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through sysctl(3) without blocking network interrupts for an unreasonable length of time. This probably still has some bugs and/or race conditions, but it seems to work well enough on my machines. It is now possible for `netstat' to get almost all of its information via the sysctl(3) interface rather than reading kmem (changes to follow).
# 552b7df4	24-Apr-1998	David Greenman <dg@FreeBSD.org>	Ensure that TCP_REXMTVAL doesn't return a value less than t_rttmin. This is believed to have been broken with the Brakmo/Peterson srtt calculation changes. The result of this bug is that TCP connections could time out extremely quickly (in 12 seconds). Also backed out jdp's partial fix for this problem in rev 1.17 of tcp_timer.c as it is obsoleted by this commit. Bug was pointed out by Kevin Lehey <kml@roller.nas.nasa.gov>. PR: 6068
# 8e5db87c	06-Apr-1998	Poul-Henning Kamp <phk@FreeBSD.org>	Remove the last traces of TUBA. Inspired by: PR kern/3317
# f498eeee	25-Feb-1998	David Greenman <dg@FreeBSD.org>	Changes to support the addition of a new sysctl variable: net.inet.tcp.delack_enabled Which defaults to 1 and can be set to 0 to disable TCP delayed-ack processing (i.e. all acks are immediate).
# c3229e05	27-Jan-1998	David Greenman <dg@FreeBSD.org>	Improved connection establishment performance by doing local port lookups via a hashed port list. In the new scheme, in_pcblookup() goes away and is replaced by a new routine, in_pcblookup_local() for doing the local port check. Note that this implementation is space inefficient in that the PCB struct is now too large to fit into 128 bytes. I might deal with this in the future by using the new zone allocator, but I wanted these changes to be extensively tested in their current form first. Also: 1) Fixed off-by-one errors in the port lookup loops in in_pcbbind(). 2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash() to do the initialial hash insertion. 3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability. 4) Added a new routine, in_pcbremlists() to remove the PCB from the various hash lists. 5) Added/deleted comments where appropriate. 6) Removed unnecessary splnet() locking. In general, the PCB functions should be called at splnet()...there are unfortunately a few exceptions, however. 7) Reorganized a few structs for better cache line behavior. 8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in the future, however. These changes have been tested on wcarchive for more than a month. In tests done here, connection establishment overhead is reduced by more than 50 times, thus getting rid of one of the major networking scalability problems. Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult. WARNING: Anything that knows about inpcb and tcpcb structs will have to be recompiled; at the very least, this includes netstat(1).
# a29f300e	27-Apr-1997	Garrett Wollman <wollman@FreeBSD.org>	The long-awaited mega-massive-network-code- cleanup. Part I. This commit includes the following changes: 1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility glue for them is deleted, and the kernel will panic on boot if any are compiled in. 2) Certain protocol entry points are modified to take a process structure, so they they can easily tell whether or not it is possible to sleep, and also to access credentials. 3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt() call. Protocols should use the process pointer they are now passed. 4) The PF_LOCAL and PF_ROUTE families have been updated to use the new style, as has the `raw' skeleton family. 5) PF_LOCAL sockets now obey the process's umask when creating a socket in the filesystem. As a result, LINT is now broken. I'm hoping that some enterprising hacker with a bit more time will either make the broken bits work (should be easy for netipx) or dike them out.
# 6875d254	22-Feb-1997	Peter Wemm <peter@FreeBSD.org>	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# 1130b656	14-Jan-1997	Jordan K. Hubbard <jkh@FreeBSD.org>	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# 561c2ad3	13-Sep-1996	Paul Traina <pst@FreeBSD.org>	Move TCPCTL_KEEPINIT to end of MIB list (sigh)
# 7b40aa32	13-Sep-1996	Paul Traina <pst@FreeBSD.org>	Make the misnamed tcp initial keepalive timer value (which is really the time, in seconds, that state for non-established TCP sessions stays about) a sysctl modifyable variable. [part 1 of two commits, I just realized I can't play with the indices as I was typing this commit message.]
# 2c37256e	11-Jul-1996	Garrett Wollman <wollman@FreeBSD.org>	Modify the kernel to use the new pr_usrreqs interface rather than the old pr_usrreq mechanism which was poorly designed and error-prone. This commit renames pr_usrreq to pr_ousrreq so that old code which depended on it would break in an obvious manner. This commit also implements the new interface for TCP, although the old function is left as an example (#ifdef'ed out). This commit ALSO fixes a longstanding bug in the TCP timer processing (introduced by davidg on 1995/04/12) which caused timer processing on a TCB to always stop after a single timer had expired (because it misinterpreted the return value from tcp_usrreq() to indicate that the TCB had been deleted). Finally, some code related to polling has been deleted from if.c because it is not relevant t -current and doesn't look at all like my current code.
# 6da5712b	05-Jun-1996	Garrett Wollman <wollman@FreeBSD.org>	Correct formula for TCP RTO calculation. Also try to do a better job in filling in a new PCB's rttvar (but this is not the last word on the subject). And get rid of `#ifdef RTV_RTT', it's been true for four years now...
# a2352fc1	26-Apr-1996	Garrett Wollman <wollman@FreeBSD.org>	Delete #ifdef notdef blocks containing old method of srtt calculation. Requested by: davidg
# 233e8c18	22-Mar-1996	Garrett Wollman <wollman@FreeBSD.org>	A number of performance-reducing flaws fixed based on comments from Larry Peterson &co. at Arizona: - Header prediction for ACKs did not exclude Fast Retransmit/Recovery. - srtt calculation tended to get ``stuck'' and could never decrease when below 8. It still can't, but the scaling factors are adjusted so that this artifact does not cause as bad an effect on the RTO value as it used to. The paper also points out the incr/8 error that has been long since fixed, and the problems with ACKing frequency resulting from the use of options which I suspect to be fixed already as well (as part of the T/TCP work). Obtained from: Brakmo & Peterson, ``Performance Problems in BSD4.4 TCP''
# 3420f4ab	27-Feb-1996	Bruce Evans <bde@FreeBSD.org>	Spell tcp_listendrop consistently so that tcp_input.c and netstat compile.
# 1347f5b8	26-Feb-1996	Guido van Rooij <guido@FreeBSD.org>	Add a counter for the number of times the listen queue was overflowed to the tcpstat structure. (netstat -s) Reviewed by: wollman Obtained from: Steves, TCP/IP Ill. vol.3, page 189
# 6c5e9bbd	30-Jan-1996	Mike Pritchard <mpp@FreeBSD.org>	Fix a bunch of spelling errors in the comment fields of a bunch of system include files.
# 34da5848	19-Jan-1996	Peter Wemm <peter@FreeBSD.org>	remove tcp_lastport - it has not been used for quite a while (at least since the hashed pcb's I think).
# 858c045f	28-Dec-1995	David Greenman <dg@FreeBSD.org>	Remove some bogus externs.
# b62d102c	15-Dec-1995	Bruce Evans <bde@FreeBSD.org>	Uniformized pr_ctlinput protosw functions. The third arg is now `void *' instead of caddr_t and it isn't optional (it never was). Most of the netipx (and netns) pr_ctlinput functions abuse the second arg instead of using the third arg but fixing this is beyond the scope of this round of changes.
# b7a44e34	05-Dec-1995	Garrett Wollman <wollman@FreeBSD.org>	Path MTU Discovery is now standard.
# 0312fbe9	14-Nov-1995	Poul-Henning Kamp <phk@FreeBSD.org>	New style sysctl & staticize alot of stuff.
# a45d2726	03-Nov-1995	Andras Olah <olah@FreeBSD.org>	Fix a logical error in T/TCP: when we actively open a connection, we have to decide whether to send a CC or CCnew option in our SYN segment depending on the contents of our TAO cache. This decision has to be made once when the connection starts. The earlier code delayed this decision until the segment was assembled in tcp_output() and retransmitted SYN segments could have different CC options. Reviewed by: Richard Stevens, davidg, wollman
# 3d1f141b	16-Oct-1995	Garrett Wollman <wollman@FreeBSD.org>	The ability to administratively change the MTU of an interface presents a few new wrinkles for MTU discovery which tcp_output() had better be prepared to handle. ip_output() is also modified to do something helpful in this case, since it has already calculated the information we need.
# 3abc79d2	12-Oct-1995	Garrett Wollman <wollman@FreeBSD.org>	The additional checks involving sequence numbers in MTU discovery resends turned out not to be necessary; simply watching for MTU decreases (which we already did) automagically eliminates all the cases we were trying to protect against.
# 143d7a54	10-Oct-1995	Garrett Wollman <wollman@FreeBSD.org>	More MTU discovery: avoid over-retransmission if route changes in the middle of a fully-open window. Also, keep track of how many retransmits we do as a result of MTU discovery. This may actually do more work than necessary, but it's an unusual condition... Suggested by: Janey Hoe <janey@lcs.mit.edu>
# 6bb9a8e7	04-Oct-1995	Garrett Wollman <wollman@FreeBSD.org>	Make a whole bunch of PCB variables ints rather than shorts. There appear to be no ill effects, and so far as Iknow none of the variables in question depend on 16-bit wraparound behavior. (The sizes are in many cases relics from when a PCB had to fit inside a 128-byte mbuf. PCBs are no longer stored in that way, and the old structure would not have fit, either.)
# 4dc45a5f	22-Sep-1995	Peter Wemm <peter@FreeBSD.org>	Remove duplicate definition for tcps_persistdrop, as added by davidg some time ago. I left in Garrett's one, because his was in the 4.4-Lite-2 location, making any diffs just that little bit smaller. I presume this choice means that netstat needs to be recompiled before "netstat -s" will give a meaningful answer on tcp stats.
# 6c52bc46	21-Sep-1995	Garrett Wollman <wollman@FreeBSD.org>	Merge with 4.4-Lite-2. This just adds a couple of tcpstat entries which we don't currently set, but might in the future.
# efe4b0eb	21-Sep-1995	Garrett Wollman <wollman@FreeBSD.org>	Second try: get 4.4-Lite-2 into the source tree. The conflicts don't matter because none of our working source files are on the CSRG branch any more. Obtained from: 4.4BSD-Lite-2
# cc0964fb	29-Jul-1995	David Greenman <dg@FreeBSD.org>	Add connection drop capability for persist timeouts. Reviewed by: Andras Olah Obtained from: 4.4BSD-lite2 via W. Richard Stevens
# dd224982	10-Jul-1995	Garrett Wollman <wollman@FreeBSD.org>	tcp_input.c - keep track of how many times a route contained a cached rtt or ssthresh that we were able to use tcp_var.h - declare tcpstat entries for above; declare tcp_{send,recv}space in_rmx.c - fill in the MTU and pipe sizes with the defaults TCP would have used anyway in the absence of values here
# fc978271	29-Jun-1995	Garrett Wollman <wollman@FreeBSD.org>	Keep track of the number of samples through the srtt filter so that we know better when to cache values in the route, rather than relying on a heuristic involving sequence numbers that broke when tcp_sendspace was increased to 16k.
# 91677201	19-Jun-1995	Garrett Wollman <wollman@FreeBSD.org>	Now that we've gone to all sorts of effort to allow TCP to cache some of its connection parameters, we want to keep statistics on how often this actually happens to see whether there is any work that needs to be done in TCP itself. Suggested by: John Wroclawski <jtw@lcs.mit.edu>
# 15bd2b43	08-Apr-1995	David Greenman <dg@FreeBSD.org>	Implemented PCB hashing. Includes new functions in_pcbinshash, in_pcbrehash, and in_pcblookuphash.
# b5e8ce9f	16-Mar-1995	Bruce Evans <bde@FreeBSD.org>	Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) and most of the warnings from `gcc -Wnested-externs'. Fix all the bugs found. There were no serious ones.
# 41f82abe	15-Feb-1995	Garrett Wollman <wollman@FreeBSD.org>	Transaction TCP support now standard. Hack away!
# f2ea20e6	15-Feb-1995	Garrett Wollman <wollman@FreeBSD.org>	Add lots of useful MIB variables and a few not-so-useful ones for completeness.
# 2f96f1f4	13-Feb-1995	Garrett Wollman <wollman@FreeBSD.org>	Get rid of some unneeded #ifdef TTCP lines. Also, get rid of some bogus commons declared in header files.
# a0292f23	09-Feb-1995	Garrett Wollman <wollman@FreeBSD.org>	Merge Transaction TCP, courtesy of Andras Olah <olah@cs.utwente.nl> and Bob Braden <braden@isi.edu>. NB: This has not had David's TCP ACK hack re-integrated. It is not clear what the correct solution to this problem is, if any. If a better solution doesn't pop up in response to this message, I'll put David's code back in (or he's welcome to do so himself).
# 512ff5ea	08-Feb-1995	David Greenman <dg@FreeBSD.org>	Fix/#ifdef prototype for tcp_mss...apparantly overlooked by Garrett.
# eb6ad696	08-Feb-1995	Garrett Wollman <wollman@FreeBSD.org>	Merge in T/TCP TCP header file changes.
# 707f139e	20-Aug-1994	Paul Richards <paul@FreeBSD.org>	Made idempotent. Submitted by: Paul
# 3c4dd356	02-Aug-1994	David Greenman <dg@FreeBSD.org>	Added $Id$
# df8bae1d	24-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	BSD 4.4 Lite Kernel Sources