Cross Reference: /freebsd-current/sys/netinet/tcp

History log of /freebsd-current/sys/netinet/tcp_stacks/rack.c
Revision	Date	Author	Comments
# ea916b64	18-May-2024	Randall Stewart <rrs@FreeBSD.org>	Remove TCP_SAD optional code now that the sack filter performs this function. With the commit of D44903 we no longer need the SAD option. Instead all stacks that use the sack filter inherit its protection against sack-attack. Reviewed by: tuexen@ Differential Revision:https://reviews.freebsd.org/D45216
# 2f923a0c	11-May-2024	Michael Tuexen <tuexen@FreeBSD.org>	tcp rack: improve handling of front states When the RACK stack wants to send a FIN, but still has outstanding or unsent data, it sends a challenge ack. Don't do this when the TCP endpoint is still in the front states, since it does not make sense. Reviewed by: rrs MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D45122
# fce03f85	05-May-2024	Randall Stewart <rrs@FreeBSD.org>	TCP can be subject to Sack Attacks lets fix this issue. There is a type of attack that a TCP peer can launch on a connection. This is for sure in Rack or BBR and probably even the default stack if it uses lists in sack processing. The idea of the attack is that the attacker is driving you to look at 100's of sack blocks that only update 1 byte. So for example if you have 1 - 10,000 bytes outstanding the attacker sends in something like: ACK 0 SACK(1-512) SACK(1024 - 1536), SACK(2048-2536), SACK(4096 - 4608), SACK(8192-8704) This first sack looks fine but then the attacker sends ACK 0 SACK(1-512) SACK(1025 - 1537), SACK(2049-2537), SACK(4097 - 4609), SACK(8193-8705) ACK 0 SACK(1-512) SACK(1027 - 1539), SACK(2051-2539), SACK(4099 - 4611), SACK(8195-8707) ... These blocks are making you hunt across your linked list and split things up so that you have an entry for every other byte. Has your list grows you spend more and more CPU running through the lists. The idea here is the attacker chooses entries as far apart as possible that make you run through the list. This example is small but in theory if the window is open to say 1Meg you could end up with 100's of thousands link list entries. To combat this we introduce three things. when the peer requests a very small MSS we stop processing SACK's from them. This prevents a malicious peer from just using a small MSS to do the same thing. Any time we get a sack block, we use the sack-filter to remove sacks that are smaller than the smallest v4 mss (minus 40 for max TCP options) unless it ties up to snd_max (since that is legal). All other sacks in theory should be at least an MSS. If we get such an attacker that means we basically start skipping all but MSS sized Sacked blocks. The sack filter used to throw away data when its bounds were exceeded, instead now we increase its size to 15 and then throw away sack's if the filter gets over-run to prevent the malicious attacker from over-running the sack filter and thus we start to process things anyway. The default stack will need to start using the sack-filter which we have talked about in past conference calls to take full advantage of the protections offered by it (and reduce cpu consumption when processing sacks). After this set of changes is in rack can drop its SAD detection completely Reviewed by:tuexen@, rscheff@ Differential Revision: <https://reviews.freebsd.org/D44903>
# 1941914d	18-Apr-2024	Michael Tuexen <tuexen@FreeBSD.org>	tcp rack: improve BBR_LOG_CWND event Fix a typo, which resulted in missing r_ctl.gate_to_fs in the BBLog event. Reported by: Coverity Scan CID: 1540024 Reviewed by: rrs, rscheff Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D44648
# c9cd686b	18-Apr-2024	Michael Tuexen <tuexen@FreeBSD.org>	tcp: drop data received after a FIN has been processed RFC 9293 describes the handling of data in the CLOSE-WAIT, CLOSING, LAST-ACK, and TIME-WAIT states: This should not occur since a FIN has been received from the remote side. Ignore the segment text. Therefore, implement this handling. Reviewed by: rrs, rscheff MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D44746
# d902c8f5	06-Apr-2024	Michael Tuexen <tuexen@FreeBSD.org>	tcp rack: fix memory corruption When in rack_output() jumping to the label out, don't write errno into the log buffer, since the pointer is not initialized. Reported by: Coverity Scan CID: 1523773 Reviewed by: rscheff MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D44647
# 7df0ef5f	05-Apr-2024	Michael Tuexen <tuexen@FreeBSD.org>	tcp rack: fix sending In rack_output(), idle is used as a boolean variable. So don't use it as an int and don't clear it afterwards. This avoids setting idle to false, when it is not intended. Reported by: olivier Reviewed by: rrs, rscheff MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D44610
# dd7b86e2	18-Mar-2024	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: remove IS_FASTOPEN() macro The macro is more obfuscating than helping as it just checks a single flag of t_flags. All other t_flags bits are checked without a macro. A bigger problem was that declaration of the macro in tcp_var.h depended on a kernel option. It is a bad practice to create such definitions in installable headers. Reviewed by: rscheff, tuexen, kib Differential Revision: https://reviews.freebsd.org/D44362
# e18b97bd	12-Mar-2024	Randall Stewart <rrs@FreeBSD.org>	Update to bring the rack stack with all its fixes in. This brings the rack stack up to the current level used at NF. Many fixes and improvements have been added. I also add in a fix to BBR to deal with the changes that have been in hpts for a while i.e. only one call no matter if mbuf queue or tcp_output. It basically does little except BBlogs and is a placemark for future work on doing path capacity measurements. With a bit of a struggle with git I finally got rack_pcm.c into place (apologies for not noticing this error). The LINT kernel is running on my box now .. sigh. Reviewed by: tuexen, glebius Sponsored by: Netflix Inc. Differential Revision:https://reviews.freebsd.org/D43986
# c112243f	11-Mar-2024	Brooks Davis <brooks@FreeBSD.org>	Revert "Update to bring the rack stack with all its fixes in." This commit was incomplete and breaks LINT kernels. The tree has been broken for 8+ hours. This reverts commit f6d489f402c320f1a6eaa473491a0b8c3878113e.
# f6d489f4	11-Mar-2024	Randall Stewart <rrs@FreeBSD.org>	Update to bring the rack stack with all its fixes in. This brings the rack stack up to the current level used at NF. Many fixes and improvements have been added. I also add in a fix to BBR to deal with the changes that have been in hpts for a while i.e. only one call no matter if mbuf queue or tcp_output. Note there is a new file that I can't figure out how to get in rack_pcm.c It basically does little except BBlogs and is a placemark for future work on doing path capacity measurements. Reviewed by: tuexen, glebius Sponsored by: Netflix Inc. Differential Revision:https://reviews.freebsd.org/D43986
# 2f4e46df	15-Feb-2024	Michael Tuexen <tuexen@FreeBSD.org>	RACK, BBR: handle EACCES like EPERM for IP output handling The FreeBSD TCP base stack handles them also the same way. In case of packet filters dropping packets in the output path, this avoids retranmitting the dropped packet every 10ms or so. Reviewed by: rscheff MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D43773
# 32a6df57	08-Feb-2024	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: calculate ssthresh on RTO according to RFC5681 per RFC5681, only adjust ssthresh on the initital retransmission timeout. Since RTO often happens during loss recovery, while cwnd no longer tracks all data in flight, calculcate pipe properly. Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D43768
# f7d5900a	28-Dec-2023	John Baldwin <jhb@FreeBSD.org>	sys: Style fix for M_EXT \| M_EXTPG Add a space around the \| operator in places testing for either M_EXT or M_EXTPG. Reviewed by: imp, glebius Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D43216
# 7b0b448b	27-Dec-2023	Gordon Bergling <gbe@FreeBSD.org>	tcp_stacks: Fix two typos in a source code comments - s/recieved/received/ MFC after: 3 days
# d2ef52ef	04-Dec-2023	Gleb Smirnoff <glebius@FreeBSD.org>	tcp/hpts: make stacks responsible for clearing themselves out HPTS There already is the tfb_tcp_timer_stop_all method that is supposed to stop all time events associated with a given tcpcb by given stack. Some time ago it was doing actual callout_stop(). Today bbr/rack just mark their internal state as inactive in their tfb_tcp_timer_stop_all methods, but tcpcb stays in HPTS wheel and potentially called in from HPTS. Change the methods to also call tcp_hpts_remove(). Note: I'm not sure if internal flag is still relevant once we are out of HPTS wheel. Call the method when connection goes into TCP_CLOSED state, instead of calling it later when tcpcb is freed. Also call it when we switch between stacks. Reviewed by: tuexen, rrs Differential Revision: https://reviews.freebsd.org/D42857
# 2b3a7746	04-Dec-2023	Gleb Smirnoff <glebius@FreeBSD.org>	hpts: make stacks responsible for tcp_hpts_init() Those stacks that use HPTS should care about init, not generic code. Reviewed by: imp, tuexen, rrs Differential Revision: https://reviews.freebsd.org/D42856
# b10ae5a9	05-Nov-2023	Michael Tuexen <tuexen@FreeBSD.org>	tcp rack: remove references to rb trees The references should have been removed in https://cgit.freebsd.org/src/commit/?id=030434acaf4631c4e205f8bccedcc7f845cbfcbf Reviewed by: rscheff, zlei MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D42386
# 8818f0f1	04-Oct-2023	Randall Stewart <rrs@FreeBSD.org>	TCP: Fix a rack bug that skyzall found which results in a crash. So when we call the fast_rsm retransmit path, we should always move snd_nxt back up to snd_max. In fact during ack-processing if snd_nxt falls behind it should be moved up there as well. Otherwise what can happen is we have an incorrect mark on snd_nxt and incorrectly calculate the offset when we go through the front path (which is what skzyall was able to do) then when we go to clean up the send the offset is all wrong and we crash. Special thanks to Gleb for pointing out the problem and the email that had the reproducer so I could find the issue. Reported-by: syzbot+f5061a372f74f021ec02@syzkaller.appspotmail.com Sponsored by: Netflix Inc
# 685dc743	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove $FreeBSD$: one-line .c pattern Remove /^[\s]__FBSDID$"\$FreeBSD\$"$;?\s*\n/
# ab65c64b	26-Jul-2023	Michael Tuexen <tuexen@FreeBSD.org>	tcp: fix handling of <RST,ACK> segments in SYN-RCVD for RACK and BBR This deals with TCP endpoints in the SYN-RCVD state coming from the SYN-SENT state. Reviewed by: rscheff MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D41203
# 02b885b0	21-Jun-2023	Michael Tuexen <tuexen@FreeBSD.org>	tcp: fix TCP MD5 computation for the BBR and RACK stack PR: 253096 Reviewed by: cc, rscheff MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D40597
# e022f2b0	09-Jun-2023	Randall Stewart <rrs@FreeBSD.org>	tcp: Rack fixes and misc updates So over the past few weeks we have found several bugs and updated hybrid pacing to have more data in the low-level logging. We have also moved more of the BBlogs to "verbose" mode so that we don't generate a lot of the debug data unless you put verbose/debug on. There were a couple of notable bugs, one being the incorrect passing of percentage for reduction to timely and the other the incorrect use of 20% timely Beta instead of 80%. This also expands a simply idea to be able to pace a cwnd (fillcw) as an alternate pacing mechanism combining that with timely reduction/increase. Reviewed by: tuexen Sponsored by: Netflix Inc Differential Revision:https://reviews.freebsd.org/D40391
# 43b117f8	06-Jun-2023	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: make the maximum number of retransmissions tunable per VNET Both Windows (TcpMaxDataRetransmissions) and Linux (tcp_retries2) allow to restrict the maximum number of consecutive timer based retransmissions. Add that same capability on a per-VNet basis to FreeBSD. Reviewed By: cc, tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D40424
# d66540e8	05-Jun-2023	Michael Tuexen <tuexen@FreeBSD.org>	tcp: improve sending of TTL/hoplimit and DSCP Ensure that a user specified value of TTL/hoplimit and DSCP is used when sending packets. Reviewed by: cc, rscheff MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D40423
# 57a3a161	24-May-2023	Randall Stewart <rrs@FreeBSD.org>	tcp: request tracking is not http specific. This change is a name change only. TCP Request tracking can track sendfile and even non-sendfile requests. The names however in the current code use http, and they should not. The feature is not http specific. Lets change the name so they more properly reflect whats going on. This also fixes conflicts with http_req which caused application pain. Reviewed by: tuexen Sponsored by: Netflix Inc Differential Revision:https://reviews.freebsd.org/D40229
# c3c20de3	25-Apr-2023	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: move HPTS/LRO flags out of inpcb to tcpcb These flags are TCP specific. While here, make also several LRO internal functions to pass tcpcb pointer instead of inpcb one. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D39698
# c2a69e84	25-Apr-2023	Gleb Smirnoff <glebius@FreeBSD.org>	tcp_hpts: move HPTS related fields from inpcb to tcpcb This makes inpcb lighter and allows future cache line optimizations of tcpcb. The reason why HPTS originally used inpcb is the compressed TIME-WAIT state (see 0d7445193ab), that used to free a tcpcb, while the associated connection is still on the HPTS ring. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D39697
# 4e8a20a7	19-Apr-2023	Randall Stewart <rrs@FreeBSD.org>	tcp: rack the request level logging is a bit too noisy when doing point logging. When doing request level BB logging the hybrid_bw_log() does not have proper screening to minimize logging when point level logging is in use. Lets fix it properly so you have to have the proper knobs set to get the more noisy logging. Reviewed by: tuexen Sponsored by: Netflix Inc Differential Revision:https://reviews.freebsd.org/D39699
# 7a842346	19-Apr-2023	Randall Stewart <rrs@FreeBSD.org>	tcp: Rack can crash with the new non-TSO fix.. Turns out the location of the check to see if we can do output is in the wrong place. We need to jump off to the compressed acks before handling that case since th is NULL in the compressed ack case which is handled differently anyway. Reviewed by: tuexen Sponsored by: Netflix Inc Differential Revision:https://reviews.freebsd.org/D39690
# 2ad584c5	17-Apr-2023	Randall Stewart <rrs@FreeBSD.org>	tcp: Inconsistent use of hpts_calling flag Gleb has noticed there were some inconsistency's in the way the inp_hpts_calls flag was being used. One such inconsistency results in a bug when we can't allocate enough sendmap entries to entertain a call to rack_output().. basically a timer won't get started like it should. Also in cleaning this up I find that the "no_output" side of input needs to be adjusted to make sure we don't try to re-pace too quickly outside the hpts assurance of 250useconds. Another thing here is we end up with duplicate calls to tcp_output() which we should not. If packets go from hpts for processing the input side of tcp will call the output side of tcp on the last packet if it is needed. This means that when that occurs a second call to tcp_output would be made that is not needed and if pacing is going on may be harmful. Lets fix all this and explicitly state the contract that hpts is making with transports that care about the flag. Reviewed by: tuexen, glebius Sponsored by: Netflix Inc Differential Revision:https://reviews.freebsd.org/D39653
# a540cdca	17-Apr-2023	Gleb Smirnoff <glebius@FreeBSD.org>	tcp_hpts: use queue(9) STAILQ for the input queue Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D39574
# 3cc7b667	14-Apr-2023	Randall Stewart <rrs@FreeBSD.org>	tcp: stack unloading crash in rack and bbr Its possible to induce a crash in either rack or bbr. This would be done if the rack stack were say the default and bbr was being used by a connection. If the bbr stack is then unloaded and it was active, we will trigger a MPASS assert in tcp_hpts since the new stack (default rack) would start a timer, and the old stack (bbr) would have the inp already in hpts. Reviewed by: tuexen Sponsored by: Netflix Inc Differential Revision:https://reviews.freebsd.org/D39576
# 9903bf34	14-Apr-2023	Randall Stewart <rrs@FreeBSD.org>	tcp: rack pacing has some caveats that need to be obeyed when LRO is missing n further non-LRO testing I found a case where rack is supposed to be waking up but it is not now. In this special case it sets the flag rc_ack_can_sendout_data. When that is set we should not prohibit output. Reviewed by: tuexen Sponsored by: Netflix Inc Differential Revision:https://reviews.freebsd.org/D39565
# a2b33c9a	10-Apr-2023	Randall Stewart <rrs@FreeBSD.org>	tcp: Rack - in the absence of LRO fixed rate pacing (loopback or interfaces with no LRO) does not work correctly. Rack is capable of fixed rate or dynamic rate pacing. Both of these can get mixed up when LRO is not available. This is because LRO will hold off waking up the tcp connection to processing the inbound packets until the pacing timer is up. Without LRO the pacing only sort-of works. Sometimes we pace correctly, other times not so much. This set of changes will make it so pacing works properly in the absence of LRO. Reviewed by: tuexen Sponsored by: Netflix Inc Differential Revision:https://reviews.freebsd.org/D39494
# e2224617	10-Apr-2023	John Baldwin <jhb@FreeBSD.org>	rack: mask and tclass are only used for INET6. This fixes the LINT-NOINET6 build.
# 66fbc19f	07-Apr-2023	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: pass tcpcb in the tfb_tcp_ctloutput() method instead of inpcb Just matches rest of the KPI. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D39435
# 35bc0bcc	07-Apr-2023	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: reduce argument list to functions that pass a segment The socket argument is superfluous, as a tcpcb always has one and only one socket. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D39434
# 945f9a7c	07-Apr-2023	Randall Stewart <rrs@FreeBSD.org>	tcp: misc cleanup of options for rack as well as socket option logging. Both BBR and Rack have the ability to log socket options, which is currently disabled. Rack has an experimental SaD (Sack Attack Detection) algorithm that should be made available. Also there is a t_maxpeak_rate that needs to be removed (its un-used). Reviewed by: tuexen, cc Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D39427
# 84b42df8	04-Apr-2023	Gleb Smirnoff <glebius@FreeBSD.org>	rack: fix build on powerpc
# 030434ac	04-Apr-2023	Randall Stewart <rrs@FreeBSD.org>	Update rack to the latest code used at NF. There have been many changes to rack over the last couple of years, including: a) Ability when switching stacks to have one stack query another. b) Internal use of micro-second timers instead of ticks. c) Many changes to pacing in forms of 1) Improvements to Dynamic Goodput Pacing (DGP) 2) Improvements to fixed rate paciing 3) A new feature called hybrid pacing where the requestor can get a combination of DGP and fixed rate pacing with deadlines for delivery that can dynamically speed things up. d) All kinds of bugs found during extensive testing and use of the rack stack for streaming video and in fact all data transferred by NF Reviewed by: glebius, gallatin, tuexen Sponsored By: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D39402
# 73ee5756	31-Mar-2023	Randall Stewart <rrs@FreeBSD.org>	Fixes in the tcp infrastructure with respect to stack changes as well as other infrastructure updates for incoming rack features. So stack switching as always been a bit of a issue. We currently use a break before make setup which means that if something goes wrong you have to try to get back to a stack. This patch among a lot of other things changes that so that it is a make before break. We also expand some of the function blocks in prep for new features in rack that will allow more controlled pacing. We also add other abilities such as the pathway for a stack to query a previous stack to acquire from it critical state information so things in flight don't get dropped or mis-handled when switching stacks. We also add the concept of a timer granularity. This allows an alternate stack to change from the old ticks granularity to microseconds and of course this even gives us a pathway to go to nanosecond timekeeping if we need to (something for the data center to consider for sure). Once all this lands I will then update rack to begin using all these new features. Reviewed by: tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D39210
# 69c7c811	16-Mar-2023	Randall Stewart <rrs@FreeBSD.org>	Move access to tcp's t_logstate into inline functions and provide new tracepoint and bbpoint capabilities. The TCP stacks have long accessed t_logstate directly, but in order to do tracepoints and the new bbpoints we need to move to using the new inline functions. This adds them and moves rack to now use the tcp_tracepoints. Reviewed by: tuexen, gallatin Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D38831
# c0e4090e	08-Feb-2023	Andrew Gallatin <gallatin@FreeBSD.org>	ktls: Accurately track if ifnet ktls is enabled This allows us to avoid spurious calls to ktls_disable_ifnet() When we implemented ifnet kTLSe, we set a flag in the tx socket buffer (SB_TLS_IFNET) to indicate ifnet kTLS. This flag meant that now, or in the past, ifnet ktls was active on a socket. Later, I added code to switch ifnet ktls sessions to software in the case of lossy TCP connections that have a high retransmit rate. Because TCP was using SB_TLS_IFNET to know if it needed to do math to calculate the retransmit ratio and potentially call into ktls_disable_ifnet(), it was doing unneeded work long after a session was moved to software. This patch carefully tracks whether or not ifnet ktls is still enabled on a TCP connection. Because the inp is now embedded in the tcpcb, and because TCP is the most frequent accessor of this state, it made sense to move this from the socket buffer flags to the tcpcb. Because we now need reliable access to the tcbcb, we take a ref on the inp when creating a tx ktls session. While here, I noticed that rack/bbr were incorrectly implementing tfb_hwtls_change(), and applying the change to all pending sends, when it should apply only to future sends. This change reduces spurious calls to ktls_disable_ifnet() by 95% or so in a Netflix CDN environment. Reviewed by: markj, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D38380
# 18b83b62	26-Jan-2023	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: reduce the size of t_rttupdated in tcpcb During tcp session start, various mechanisms need to track a few initial RTTs before becoming active. Prevent overflows of the corresponding tracking counter and reduce the size of tcpcb simultaneously. Reviewed By: #transport, tuexen, guest-ccui Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D21117
# 73e994a9	19-Jan-2023	Gordon Bergling <gbe@FreeBSD.org>	extra_tcp_stacks: Fix a common typo in source code comments - s/orginal/original/ MFC after: 3 days
# 432a398d	10-Jan-2023	Gordon Bergling <gbe@FreeBSD.org>	tcp_rack(4): Fix a typo in a source code comment - s/postion/position/ MFC after: 3 days
# 8ea41829	10-Jan-2023	Andrew Gallatin <gallatin@FreeBSD.org>	tcp: Build RACK and BBR stacks as a part of LINT When RACK and BBR were added to the kernel, they were put behind 'WITH_EXTRA_TCP_STACKS=1'. Unfortunately that was never added to any NOTES file, so RACK & BBR were not compiled with the various LINT-NOINET, LINT-NOINET6, and LINT-NOIP kernels. This lead to the stacks sometimes being broken. This change: - Fixes RACK so that it compiles with the various LINT-NO* kernels - Adds WITH_EXTRA_TCP_STACKS=1 to all NOTES kernels so that RACK and BBR are compile tested regularly Sponsored by: Netflix Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D37903
# 2e2a1c31	14-Dec-2022	Randall Stewart <rrs@FreeBSD.org>	Opps take out a stray left behind printf that was for debugging.. Sorry.
# e2e088ae	14-Dec-2022	Randall Stewart <rrs@FreeBSD.org>	Rack cannot be loaded without cc_newreno compiled into the kernel. Right now rack will fail to load due to its hack in accessing symbol names in cc_newreno. This was fine when newreno was always compiled into the kernel but now ... not so much. Instead lets fix up rack to use the socket option queries to get the information it wants and set the parameters. We also fix the CC parameter so they are always settable. Reviewed by: tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D37622
# eaabc937	14-Dec-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: retire TCPDEBUG This subsystem is superseded by modern debugging facilities, e.g. DTrace probes and TCP black box logging. We intentionally leave SO_DEBUG in place, as many utilities may set it on a socket. Also the tcp::debug DTrace probes look at this flag on a socket. Reviewed by: gnn, tuexen Discussed with: rscheff, rrs, jtl Differential revision: https://reviews.freebsd.org/D37694
# e6fc01f6	14-Dec-2022	Mateusz Guzik <mjg@FreeBSD.org>	tcp: whack the stale declaration of rack_timer_stop Sponsored by: Rubicon Communications, LLC ("Netgate")
# 446ccdd0	07-Dec-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: use single locked callout per tcpcb for the TCP timers Use only one callout structure per tcpcb that is responsible for handling all five TCP timeouts. Use locked version of callout, of course. The callout function tcp_timer_enter() chooses soonest timer and executes it with lock held. Unless the timer reports that the tcpcb has been freed, the callout is rescheduled for next soonest timer, if there is any. With single callout per tcpcb on connection teardown we should be able to fully stop the callout and immediately free it, avoiding use of callout_async_drain(). There is one gotcha here: callout_stop() can actually touch our memory when a rare race condition happens. See comment above tcp_timer_stop(). Synchronous stop of the callout makes tcp_discardcb() the single entry point for tcpcb destructor, merging the tcp_freecb() to the end of the function. While here, also remove lots of lingering checks in the beginning of TCP timer functions. With a locked callout they are unnecessary. While here, clean unused parts of timer KPI for the pluggable TCP stacks. While here, remove TCPDEBUG from tcp_timer.c, as this allows for more simplification of TCP timers. The TCPDEBUG is scheduled for removal. Move the DTrace probes in timers to the beginning of a function, where a tcpcb is always existing. Discussed with: rrs, tuexen, rscheff (the TCP part of the diff) Reviewed by: hselasky, kib, mav (the callout part) Differential revision: https://reviews.freebsd.org/D37321
# 918fa422	07-Dec-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: remove tcp_timer_suspend() It was a temporary code added together with RACK to fight against TCP timer races.
# e68b3792	07-Dec-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: embed inpcb into tcpcb For the TCP protocol inpcb storage specify allocation size that would provide space to most of the data a TCP connection needs, embedding into struct tcpcb several structures, that previously were allocated separately. The most import one is the inpcb itself. With embedding we can provide strong guarantee that with a valid TCP inpcb the tcpcb is always valid and vice versa. Also we reduce number of allocs/frees per connection. The embedded inpcb is placed in the beginning of the struct tcpcb, since in_pcballoc() requires that. However, later we may want to move it around for cache line efficiency, and this can be done with a little effort. The new intotcpcb() macro is ready for such move. The congestion algorithm data, the TCP timers and osd(9) data are also embedded into tcpcb, and temprorary struct tcpcb_mem goes away. There was no extra allocation here, but we went through extra pointer every time we accessed this data. One interesting side effect is that now TCP data is allocated from SMR-protected zone. Potentially this allows the TCP stacks or other TCP related modules to utilize that for their own synchronization. Large part of the change was done with sed script: s/tp->ccv->/tp->t_ccv./g s/tp->ccv/\&tp->t_ccv/g s/tp->cc_algo/tp->t_cc/g s/tp->t_timers->tt_/tp->tt_/g s/CCV$ccv, osd$/\&CCV(ccv, t_osd)/g Dependency side effect is that code that needs to know struct tcpcb should also know struct inpcb, that added several <netinet/in_pcb.h>. Differential revision: https://reviews.freebsd.org/D37127
# bd4f9866	16-Nov-2022	Michael Tuexen <tuexen@FreeBSD.org>	tcp: remove unused t_rttbest No functional change intended. Reviewed by: rscheff@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D37401
# 9eb0e832	08-Nov-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: provide macros to access inpcb and socket from a tcpcb There should be no functional changes with this commit. Reviewed by: rscheff Differential revision: https://reviews.freebsd.org/D37123
# b1258b76	06-Nov-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: add conservative d.cep accounting algorithm Accurate ECN asks to conservatively estimate, when the ACE counter may have wrapped due to a single ACK covering a larger number of segments. This is described in Annex A.2 of the accurate-ecn draft. Event: IETF 115 Hackathon Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D37281
# f504685a	31-Oct-2022	Gleb Smirnoff <glebius@FreeBSD.org>	rack/bbr: put back assertion that connection is not in TIME-WAIT The assertion was incorrectly removed in 0d7445193ab. The leak of a TIME-WAIT state into tfb_do_segment_nounlock method was fixed in 31bc602ff81. The TIME-WAIT connections are processed by the main tcp_input() always.
# 83c1ec92	20-Oct-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: ECN preparations for ECN++, AccECN (tcp_respond) tcp_respond is another function to build a tcp control packet quickly. With ECN++ and AccECN, both the IP ECN header, and the TCP ECN flags are supposed to reflect the correct state. Also ensure that on receiving multiple ECN SYN-ACKs, the responses triggered will reflect the latest state. Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D36973
# 53af6903	06-Oct-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: remove INP_TIMEWAIT flag Mechanically cleanup INP_TIMEWAIT from the kernel sources. After 0d7445193ab, this commit shall not cause any functional changes. Note: this flag was very often checked together with INP_DROPPED. If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we will be able to remove most of this checks and turn them to assertions. Some of them can be turned into assertions right now, but that should be carefully done on a case by case basis. Differential revision: https://reviews.freebsd.org/D36400
# 0d744519	06-Oct-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: remove tcptw, the compressed timewait state structure The memory savings the tcptw brought back in 2003 (see 340c35de6a2) no longer justify the complexity required to maintain it. For longer explanation please check out the email [1]. Surpisingly through almost 20 years the TCP stack functionality of handling the TIME_WAIT state with a normal tcpcb did not bitrot. The existing tcp_input() properly handles a tcpcb in TCPS_TIME_WAIT state, which is confirmed by the packetdrill tcp-testsuite [2]. This change just removes tcptw and leaves INP_TIMEWAIT. The flag will be removed in a separate commit. This makes it easier to review and possibly debug the changes. [1] https://lists.freebsd.org/archives/freebsd-net/2022-January/001206.html [2] https://github.com/freebsd-net/tcp-testsuite Differential revision: https://reviews.freebsd.org/D36398
# 2220b66f	03-Oct-2022	Konstantin Belousov <kib@FreeBSD.org>	Add mbuf_tstmp2timeval() Reviewed by: hselasky, jkim, rscheff Sponsored by: NVIDIA networking MFC after: 1 week Differential revision: https://reviews.freebsd.org/D36870
# e5049a17	26-Sep-2022	Randall Stewart <rrs@FreeBSD.org>	TCP rack does not work properly with cubic. Right now if you use rack with cubic (the new default cc) you will have improper results. This is because rack uses different variables than the base stack (or bbr) and thus tcp_compute_pipe() always returns so that cubic will choose a 30% backoff not the 50% backoff it should when it is newreno compatibility mode. The fix is to allow a stack (rack) to override its own compute_pipe. Reviewed by: tuexen, rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D36711
# 493105c2	21-Sep-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: fix simultaneous open and refine e80062a2d43 - The soisconnected() call on transition from SYN_RCVD to ESTABLISHED is also necessary for a half-synchronized connection. Fix that just setting the flag, when we transfer SYN-SENT -> SYN-RECEIVED. - Provide a comment that explains at what conditions the call to soisconnected() is necessary. - Hence mechanically rename the TF_INCQUEUE flag to TF_SONOTCONN. - Extend the change to the BBR and RACK stacks. Note: the interaction between the accept_filter(9) and the socket layer is not fully consistent, yet. For most accept filters this call to soisconnected() will not move the connection from the incomplete queue to the complete. The move would happen only when the filter has received the desired data, and soisconnected() would be called once again from sorwakeup(). Ideally, we should mark socket as connected only there, and leave the soisconnected() from SYN_RCVD->ESTABLISHED only for the simultaneous open case. However, this doesn't yet work. Reviewed by: rscheff, tuexen, rrs Differential revision: https://reviews.freebsd.org/D36641
# 81560c55	09-Sep-2022	Randall Stewart <rrs@FreeBSD.org>	TCP: Rack ends up sending all that is outstanding every timeout. In doing some testing for a different problem, I have found rack retransmitting all outstanding data every time a timeout occurs. The outstanding is sent 1ms apart between each packet, and then the timeout runs off again. This causes extra retransmissions when we should be waiting for an ack after sending the very first segment. Reviewed by: tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D36494
# fa52f9dc	03-Sep-2022	Gordon Bergling <gbe@FreeBSD.org>	tcp_rack: Fix two typos in source code comments - s/overriden/overridden/ MFC after: 3 days
# 4012ef77	31-Aug-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Functional implementation of Accurate ECN The AccECN handshake and TCP header flags are supported, no support yet for the AccECN option. This minimalistic implementation is sufficient to support DCTCP while dramatically cutting the number of ACKs, and provide ECN response from the receiver to the CC modules. Reviewed By: #transport, #manpages, rrs, pauamma Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D21011
# 62ce18fc	23-Aug-2022	Randall Stewart <rrs@FreeBSD.org>	tcp: Rack rwnd collapse. Currently when the peer collapses its rwnd, we mark packets to be retransmitted and use the must_retran flags like we do when a PMTU collapses to retransmit the collapsed packets. However this causes a problem with some middle boxes that play with the rwnd to control flow. As soon as the rwnd increases we start resending which may be not even a rtt.. and in fact the peer may have gotten the packets. Which means we gratuitously retransmit packets we should not. The fix here is to make sure that a rack time has passed before retransmitting the packets. This makes sure that the rwnd collapse was real and the packets do need retransmission. Reviewed by: tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D35166
# 57cdd13d	14-Aug-2022	Dimitry Andric <dim@FreeBSD.org>	Suppress unused variable warning in tcp_stacks's rack.c With clang 15, the following -Werror warning is produced: sys/netinet/tcp_stacks/rack.c:17405:12: error: variable 'outstanding' set but not used [-Werror,-Wunused-but-set-variable] uint32_t outstanding; ^ The 'outstanding' variable was used later in the rack_output() function, but refactoring in 35c7bb340788f removed the usage. To avoid too much code churn, mark the variable unused to supress the warning. MFC after: 3 days
# e967183c	14-Aug-2022	Dimitry Andric <dim@FreeBSD.org>	Fix unused variable warning in tcp_stacks's rack.c With clang 15, the following -Werror warning is produced: sys/netinet/tcp_stacks/rack.c:16148:6: error: variable 'cnt_thru' set but not used [-Werror,-Wunused-but-set-variable] int cnt_thru = 1; ^ The 'cnt_thru' variable is only used when TCP_ACCOUNTING is defined. Ensure it is only declared and set in that case. MFC after: 3 days
# 1abc27dd	01-Aug-2022	Michael Tuexen <tuexen@FreeBSD.org>	tcp rack: simplify computation of rsm start and end While there, also fix the setting of the SYN related flag. Reviewed by: rrs MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D35862
# 5b741298	19-Jul-2022	Michael Tuexen <tuexen@FreeBSD.org>	tcp rack: fix switching to RACK when FIN has been sent Fix the rack sendmap entry in case a FIN has been sent when the stack is switched over to RACK. Reported by: syzbot+dd55e316428419e9354b@syzkaller.appspotmail.com Reviewed by: rrs MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D35731
# 74703901	04-Jul-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: use a TCP flag to check if connection has been close(2)d The flag SS_NOFDREF is a private flag of the socket layer. It also is supposed to be read with SOCK_LOCK(), which we don't own here. Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D35663
# 32a01b2b	04-Jun-2022	Gordon Bergling <gbe@FreeBSD.org>	rack: Fix a common typo in comments and a sysctl description - s/multipler/multiplier/ MFC after: 3 days
# c93db892	04-Jun-2022	Gordon Bergling <gbe@FreeBSD.org>	rack: Fix a typo in a source code comment - s/enought/enough/ MFC after: 3 days
# 21b923c3	04-Jun-2022	Gordon Bergling <gbe@FreeBSD.org>	tcp_rack: Fix two typos in sysctl descriptions - s/higest/highest/ MFC after: 3 days
# 4581cffb	12-May-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: fix build, convert missed sbreserve_locked() calls Fixes: 4328318445ae
# 04831efd	10-May-2022	Randall Stewart <rrs@FreeBSD.org>	tcp: Rack idle reduce not working. Rack converted to micro-seconds quite some time ago, but in testing we have found a miss in that work. The idle reduce time is still based in ticks, so it must be converted to microseconds before any comparisons else you will likely not do idle reduce. Reviewed by: tuexen, thj Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D35066
# 6edfc10c	14-Apr-2022	Randall Stewart <rrs@FreeBSD.org>	tcp: adding a functionality to define "trace points" so that BB logging can be enabled at specific events. This commit will add a new concept to rack, tracepoints. A tracepoint is a defined point inserted into the code (3 are included in this initial patch) that allows a developer to insert a point that might be of interest. The developer numbers the point in the tcp_rack.h file and then can use sysctl to enable that (or all) trace points. A limit is also given to how many BB logged connections will turn on so that a box is not overrun by BB logging. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D34898
# 86fa80f3	12-Apr-2022	John Baldwin <jhb@FreeBSD.org>	rack: Remove unused variable.
# addb2c65	09-Apr-2022	Gordon Bergling <gbe@FreeBSD.org>	tcp_rack: Fix a typo in a source code comment - s/possiblity/possibility/ MFC after: 3 days
# 36814092	09-Apr-2022	Gordon Bergling <gbe@FreeBSD.org>	tcp_rack: Fix a few typos in sysctl descriptions and comments - s/postion/position/ - s/postions/positions/ - s/repostion/reposition/ MFC after: 5 days
# ee1a08b8	01-Apr-2022	Randall Stewart <rrs@FreeBSD.org>	rack may end up with a stuck connectin fi the rwnd is colapsed on sent data. There is a case where rack will get stuck when it has outstanding data and the peer collapses the rwnd down to 0. This leaves the session hung if the rwnd update is not received. You can test this with the packet drill script below. Without this fix it will be stuck and hang. With it we retransmit everything. This also fixes the mtu retransmit case so we don't go into recovery when the mtu is changed to a smaller value. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D34573
# 75fdc440	07-Feb-2022	Gordon Bergling <gbe@FreeBSD.org>	extra_tcp_stacks: Fix two typos in source code comments - s/recusive/recursive/ MFC after: 3 days
# 2ff07d92	25-Feb-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Restore correct ECT marking behavior on SACK retransmissions While coalescing all ECN-related code into new common source files, the flag to deal with SACK retransmissions was skipped. This leads to non-compliant ECT-marking of SACK retransmissions, as well as the premature sending of other TCP ECN flags (CWR). Reviewed By: rrs, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D34376
# a43b0aca	23-Feb-2022	Randall Stewart <rrs@FreeBSD.org>	tcp: Push bit failure to set in fastpath Recently changes were made to the tcp stack to use a macro/function to set tcp flags. In the process the PUSH bit setting in the fastpath of rack was broken. This fixes that as well as cleans up a warning that is occurring when you don't have INVARIANT on (inp used in KASSERT). We can use the tcp test suite to find this bug the test plan shows the script that fails due to the missing push bit Reviewed by: rscheff, tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D34332
# cc41c174	09-Feb-2022	Randall Stewart <rrs@FreeBSD.org>	opps my patch lost the removal of the tlp_threshold counter increments
# 8d64b4b4	09-Feb-2022	Randall Stewart <rrs@FreeBSD.org>	cleanup of rack variables. During a recent deep dive into all the variables so I could discover why stack switching caused larger retransmits I examined every variable in rack. In the process I found quite a few bits that were not used and needed cleanup. This update pulls out all the unused pieces from rack. Note there are no functional changes here, just the removal of unused variables and a bit of spacing clean up. Reviewed by: Michael Tuexen, Richard Scheffenegger Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D34205
# ab001fcd	06-Feb-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Apply tcp flags after ECN processing in rack_fast_output() Missed to move the tcp_set_flags() past ECN processing in rack_fast_output() earlier. Reviewed By: rrs, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D34180
# a9696510	07-Feb-2022	Randall Stewart <rrs@FreeBSD.org>	tcp: Add hystart++ to our cubic implementation. As promised to the transport call on 11/4/22 here is an implementation of hystart++ for cubic. It also cleans up the tcp_congestion function to have a better name. Common variables are moved into the general cc.h structure so that both cubic and newreno can use them for hystart++ Reviewed by: Michael Tuexen, Richard Scheffenegger Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D33035
# f7220c48	05-Feb-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: move ECN handling code to a common file Reduce the burden to maintain correct and extensible ECN related code across multiple stacks and codepaths. Formally no functional change. Incidentially this establishes correct ECN operation in one instance. Reviewed By: rrs, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D34162
# 7994ef3c	04-Feb-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	Revert "tcp: move ECN handling code to a common file" This reverts commit 0c424c90eaa6602e07bca7836b1d178b91f2a88a.
# 0c424c90	04-Feb-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: move ECN handling code to a common file Reduce the burden to maintain correct and extensible ECN related code across multiple stacks and codepaths. Formally no functional change. Incidentially this establishes correct ECN operation in one instance. Reviewed By: rrs, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D34162
# fd723975	03-Feb-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: fix typo in commit f026275e26d0071ac3dee98526e8b9bcad58f0fa missed one bitmask inversion while committing D34148 Differential Revision: https://reviews.freebsd.org/D34148 Differential Revision: https://reviews.freebsd.org/D34160
# f026275e	03-Feb-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: set IP ECN header codepoint properly TCP RACK can cache the IP header while preparing a new TCP packet for transmission. Thus all the IP ECN codepoint bits need to be assigned, without assuming a clear field beforehand. Reviewed By: tuexen, kbowling, #transport MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D34148
# 1ebf4607	03-Feb-2022	Richard Scheffenegger <rscheff@FreeBSD.org>	tcp: Access all 12 TCP header flags via inline function In order to consistently provide access to all (including reserved) TCP header flag bits, use an accessor function tcp_get_flags and tcp_set_flags. Also expand any flag variable from uint8_t / char to uint16_t. Reviewed By: hselasky, tuexen, glebius, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D34130
# d51c8035	02-Feb-2022	Michael Tuexen <tuexen@FreeBSD.org>	rack: fix compilation and small cleanup Fix a function prototype missed in the last commit and whitespace change. Sponsored by: Netflix, Inc.
# 3b3c08c1	02-Feb-2022	Michael Tuexen <tuexen@FreeBSD.org>	tcp: cleanup functions related to socket option handling Consistently only pass the inp and the sopt around. Don't pass the so around, since in a upcoming commit tcp_ctloutput_set() will be called from a context different from setsockopt(). Also expect the inp to be locked when calling tcp_ctloutput_[gs]et(), this is also required for the upcoming use by tcpsso, a command line tool to set socket options. Reviewed by: glebius, rscheff Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D34151
# 9e58cca3	26-Jan-2022	Gordon Bergling <gbe@FreeBSD.org>	extra_tcp_stacks: Fix two typos in source code comments - s/differnt/different/ MFC after; 3 days
# b3df222e	26-Jan-2022	Gordon Bergling <gbe@FreeBSD.org>	extra_tcp_stacks: Fix a few common typos TCP_BBR: - Fix a typo introducted in 1b90dfa5d2b0, which was reported by tuexen@ TCP_RACK: - Correct two sysctl descriptions: s/corret/correct/ tcp_bbr(4): Also fix s/measurment/measurement/ in the man page MFC after: 1 week
# aac52f94	18-Jan-2022	Randall Stewart <rrs@FreeBSD.org>	tcp: Warning cleanup from new compiler. The clang compiler recently got an update that generates warnings of unused variables where they were set, and then never used. This revision goes through the tcp stack and cleans all of those up. Reviewed by: Michael Tuexen, Gleb Smirnoff Sponsored by: Netflix Inc. Differential Revision:
# a370832b	26-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: remove delayed drop KPI No longer needed after tcp_output() can ask caller to drop. Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D33371
# f64dc2ab	26-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: TCP output method can request tcp_drop The advanced TCP stacks (bbr, rack) may decide to drop a TCP connection when they do output on it. The default stack never does this, thus existing framework expects tcp_output() always to return locked and valid tcpcb. Provide KPI extension to satisfy demands of advanced stacks. If the output method returns negative error code, it means that caller must call tcp_drop(). In tcp_var() provide three inline methods to call tcp_output(): - tcp_output() is a drop-in replacement for the default stack, so that default stack can continue using it internally without modifications. For advanced stacks it would perform tcp_drop() and unlock and report that with negative error code. - tcp_output_unlock() handles the negative code and always converts it to positive and always unlocks. - tcp_output_nodrop() just calls the method and leaves the responsibility to drop on the caller. Sweep over the advanced stacks and use new KPI instead of using HPTS delayed drop queue for that. Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D33370
# dbbcc777	26-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	rack: rack_do_compressed_ack_processing() can call tcp_drop() Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D33369
# 66aeb0b5	26-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	rack: drop connection synchronously, when we can For all functions that are leaves of tcp_input() call ctf_do_dropwithreset_conn() instead of ctf_do_dropwithreset(), cause we always got tp and we want it to be dropped. Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D33368
# 40fa3e40	26-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: mechanically substitute call to tfb_tcp_output to new method. Made with sed(1) execution: sed -Ef sed -i "" $(grep --exclude tcp_var.h -lr tcp_output sys/) sed: s/tp->t_fb->tfb_tcp_output$tp$/tcp_output(tp)/ s/to tfb_tcp_output/to tcp_output()/ Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D33366
# 9b602965	15-Dec-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Rack in a rare case we can get stuck sending a very small amount. If a tlp sending new data fails, and then the peer starts talking to us again, we can be in a situation where the tlp_new_data count is set, we are not in recovery and we always send one packet every RTT. The failure has to occur when we send the TLP initially from the ip_output() which is rare. But if it occurs you are basically stuck. This fixes it so we use the new_data count and clear it so we know it will be cleared. If a failure occurs the tlp timer will regenerate a new amount anyway so it is un-needed to carry the value on. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D33325
# dadbc042	06-Dec-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: rack fails to send out a TLP after a MTU change When rack sends out a TLP it sets up various state to make sure it avoids the cwnd (its been more than 1 RTT since our last send) and it may at times send new data. If an MTU change as occurred and our cwnd has collapsed we can have a situation where must_retran flag is set and we obey the cwnd thus never sending the TLP and then sitting stuck. This one line fix addresses that problem Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D33231
# db0ac6de	02-Dec-2021	Cy Schubert <cy@FreeBSD.org>	Revert "wpa: Import wpa_supplicant/hostapd commit 14ab4a816" This reverts commit 266f97b5e9a7958e365e78288616a459b40d924a, reversing changes made to a10253cffea84c0c980a36ba6776b00ed96c3e3b. A mismerge of a merge to catch up to main resulted in files being committed which should not have been.
# f971e791	02-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	tcp_hpts: rename input queue to drop queue and trim dead code The HPTS input queue is in reality used only for "delayed drops". When a TCP stack decides to drop a connection on the output path it can't do that due to locking protocol between main tcp_output() and stacks. So, rack/bbr utilize HPTS to drop the connection in a different context. In the past the queue could also process input packets in context of HPTS thread, but now no stack uses this, so remove this functionality. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33025
# 50f081ec	02-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	tcp_hpts: provide tcp_in_hpts(). It will hide some internal HPTS knowledge from the consumers. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33023
# 1dadeab3	30-Nov-2021	Gordon Bergling <gbe@FreeBSD.org>	netinet: Fix a common typo in source code comments - s/segement/segment/ MFC after: 3 days
# 97e28f0f	17-Nov-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Rack ack war with a mis-behaving firewall or nat with resets. Previously we added ack-war prevention for misbehaving firewalls. This is where the f/w or nat messes up its sequence numbers and causes an ack-war. There is yet another type of ack war that we have found in the wild that is like unto this. Basically the f/w or nat gets a ack (keep-alive probe or such) and instead of turning the ack/seq around and adding a TH_RST it does something real stupid and sends a new packet with seq=0. This of course triggers the challenge ack in the reset processing which then sends in a challenge ack (if the seq=0 is within the range of possible sequence numbers allowed by the challenge) and then we rinse-repeat. This will add the needed tweaks (similar to the last ack-war prevention using the same sysctls and counters) to prevent it and allow say 5 per second by default. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32938
# 26cbd002	11-Nov-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Rack may still calculate long RTT on persists probes. When a persists probe is lost, we will end up calculating a long RTT based on the initial probe and when the response comes from the second probe (or third etc). This means we have a minimum of a confidence level of 3 on a incorrect probe. This commit will change it so that we have one of two options a) Just not count RTT of probes where we had a loss <or> b) Count them still but degrade the confidence to 0. I have set in this the default being to just not measure them, but I am open to having the default be otherwise. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32897
# 477aeb3d	08-Nov-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Printf should be removed. There is a printf when a socket option down to the CC module fails, this really should not be a printf. In fact this whole option needs to be re-thought in coordination with some other changes in the CC modules (its just not right but its ok what it does here if it fails since it will just use the ECN beta). Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32894
# c28e39c3	03-Nov-2021	Gordon Bergling <gbe@FreeBSD.org>	Fix a common typo in syctl descriptions - s/maxiumum/maximum/ MFC after: 3 days
# 141a53cd	29-Oct-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Rack might retransmit forever. If we get a Sacked peer with an MTU change we can retransmit forever if the last bytes are sacked and the client goes away (think power off). Then we never see the end condition and continually retransmit. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32671
# aeda8527	29-Oct-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Rack at times can miscalculate the RTT from what it thinks is a persists probe respone. Turns out that if a peer sends in a window update right after rack fires off a persists probe, we can mis-interpret the window update and calculate a bogus RTT (very short). We still process the window update and send the data but we incorrectly generate an RTT. We should be only doing the RTT stuff if the rwnd is still small and has not changed. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32717
# 5d3bf5b1	25-Oct-2021	Gleb Smirnoff <glebius@FreeBSD.org>	rack: Update the fast send block on setsockopt(2) Rack caches TCP/IP header for fast send, so it doesn't call tcpip_fillheaders(). After certain socket option changes, namely IPV6_TCLASS, IP_TOS and IP_TTL it needs to update its fast block to be in sync with the inpcb. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D32655
# f581a26e	25-Oct-2021	Gleb Smirnoff <glebius@FreeBSD.org>	Factor out tcp6_use_min_mtu() to handle IPV6_USE_MIN_MTU by TCP. Pass control for IP/IP6 level options from generic tcp_ctloutput_set() down to per-stack ctloutput. Call tcp6_use_min_mtu() from tcp stack tcp_default_ctloutput(). Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D32655
# 12752978	26-Oct-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: The rack stack can incorrectly have an overflow when calculating a burst delay. If the congestion window is very large the fact that we multiply it by 1000 (for microseconds) can cause the uint32_t to overflow and we incorrectly calculate a very small divisor. This will then cause the burst timer to be very large when it should be 0. Instead lets make the three variables uint64_t and avoid the issue. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32668
# 4e4c84f8	22-Oct-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Add hystart-plus to cc_newreno and rack. TCP Hystart draft version -03: https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-hystartplusplus Is a new version of hystart that allows one to carefully exit slow start if the RTT spikes too much. The newer version has a slower-slow-start so to speak that then kicks in for five round trips. To see if you exited too early, if not into congestion avoidance. This commit will add that feature to our newreno CC and add the needed bits in rack to be able to enable it. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32373
# a36230f7	01-Oct-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Make dsack stats available in netstat and also make sure its aware of TLP's. DSACK accounting has been for quite some time under a NETFLIX_STATS ifdef. Statistics on DSACKs however are very useful in figuring out how much bad retransmissions you are doing. This is further complicated, however, by stacks that do TLP. A TLP when discovering a lost ack in the reverse path will cause the generation of a DSACK. For this situation we introduce a new dsack-tlp-bytes as well as the more traditional dsack-bytes and dsack-packets. These will now all display in netstat -p tcp -s. This also updates all stacks that are currently built to keep track of these stats. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32158
# 1ca931a5	23-Sep-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Rack compressed ack path updates the recv window too easily The compressed ack path of rack is not following proper procedures in updating the peers window. It should be checking the seq and ack values before updating and instead it is blindly updating the values. This could in theory get the wrong window in the connection for some length of time. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32082
# fd69939e	23-Sep-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Two bugs in rack one of which can lead to a panic. In extensive testing in NF we have found two issues inside the rack stack. 1) An incorrect offset is being generated by the fast send path when a fast send is initiated on the end of the socket buffer and before the fast send runs, the sb_compress macro adds data to the trailing socket. This fools the fast send code into thinking the sb offset changed and it miscalculates a "updated offset". It should only do that when the mbuf in question got smaller.. i.e. an ack was processed. This can lead to a panic deref'ing a NULL mbuf if that packet is ever retransmitted. At the best case it leads to invalid data being sent to the client which usually terminates the connection. The fix is to have the proper logic (that is in the rsm fast path) to make sure we only update the offset when the mbuf shrinks. 2) The other issue is more bothersome. The timestamp check in rack needs to use the msec timestamp when comparing the timestamp echo to now. It was using a microsecond timestamp which ends up giving error prone results but causes only small harm in trying to identify which send to use in RTT calculations if its a retransmit. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32062
# 5baf32c9	17-Aug-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Add support for DSACK based reordering window to rack. The rack stack, with respect to the rack bits in it, was originally built based on an early I-D of rack. In fact at that time the TLP bits were in a separate I-D. The dynamic reordering window based on DSACK events was not present in rack at that time. It is now part of the RFC and we need to update our stack to include these features. However we want to have a way to control the feature so that we can, if the admin decides, make it stay the same way system wide as well as via socket option. The new sysctl and socket option has the following meaning for setting: 00 (0) - Keep the old way, i.e. reordering window is 1 and do not use DSACK bytes to add to reorder window 01 (1) - Change the Reordering window to 1/4 of an RTT but do not use DSACK bytes to add to reorder window 10 (2) - Keep the reordering window as 1, but do use SACK bytes to add additional 1/4 RTT delay to the reorder window 11 (3) - reordering window is 1/4 of an RTT and add additional DSACK bytes to increase the reordering window (RFC behavior) The default currently in the sysctl is 3 so we get standards based behavior. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31506
# b1e806c0	07-Jul-2021	Andrew Gallatin <gallatin@FreeBSD.org>	tcp: fix alternate stack build with LINT-NO{INET,INET6,IP} When fixing another bug, I noticed that the alternate TCP stacks do not build when various combinations of ipv4 and ipv6 are disabled. Reviewed by: rrs, tuexen Differential Revision: https://reviews.freebsd.org/D31094 Sponsored by: Netflix
# d7955cc0	06-Jul-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: HPTS performance enhancements HPTS drives both rack and bbr, and yet there have been many complaints about performance. This bit of work restructures hpts to help reduce CPU overhead. It does this by now instead of relying on the timer/callout to drive it instead use user return from a system call as well as lro flushes to drive hpts. The timer becomes a backstop that dynamically adjusts based on how "late" we are. Reviewed by: tuexen, glebius Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31083
# e834f9a4	06-Jul-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Address goodput and TLP edge cases. There are several cases where we make a goodput measurement and we are running out of data when we decide to make the measurement. In reality we should not make such a measurement if there is no chance we can have "enough" data. There is also some corner case TLP's that end up not registering as a TLP like they should, we fix this by pushing the doing_tlp setup to the actual timeout that knows it did a TLP. This makes it so we always have the appropriate flag on the sendmap indicating a TLP being done as well as count correctly so we make no more that two TLP's. In addressing the goodput lets also add a "quality" metric that can be viewed via blackbox logs so that a casual observer does not have to figure out how good of a measurement it is. This is needed due to the fact that we may still make a measurement that is of a poorer quality as we run out of data but still have a minimal amount of data to make a measurement. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31076
# 9e4d9e4c	25-Jun-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Preparation for allowing hardware TLS to be able to kick a tcp connection that is retransmitting too much out of hardware and back to software. Hardware TLS is now supported in some interface cards and it works well. Except that when we have connections that retransmit a lot we get into trouble with all the retransmits. This prep step makes way for change that Drew will be making so that we can "kick out" a session from hardware TLS. Reviewed by: mtuexen, gallatin Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30895
# 66aec14a	24-Jun-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Rack not being very friendly with V6:4 socket and having a connection from V4 There were two bugs that prevented V4 sockets from connecting to a rack server running a V4/V6 socket. As well as a bug that stops the mapped v4 in V6 address from working. Reviewed by: mtuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30885
# f1536bb5	11-Jun-2021	Michael Tuexen <tuexen@FreeBSD.org>	tcp: remove debug output from RACK Reported by: iron.udjin@gmail.com, Marek Zarychta Reviewed by: rrs PR: 256538 MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D30723 Sponsored by: Netflix, Inc.
# ba1b3e48	11-Jun-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Missing mfree in rack and bbr Recently (Nov) we added logic that protects against a peer negotiating a timestamp, and then not including a timestamp. This involved in the input path doing a goto done_with_input label. Now I suspect the code was cribbed from one in Rack that has to do with the SYN. This had a bug, i.e. it should have a m_freem(m) before going to the label (bbr had this missing m_freem() but rack did not). This then caused the missing m_freem to show up in both BBR and Rack. Also looking at the code referencing m->m_pkthdr.lro_nsegs later (after processing) is not a good idea, even though its only for logging. Best to copy that off before any frees can take place. Reviewed by: mtuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30727
# 224cf7b3	11-Jun-2021	Michael Tuexen <tuexen@FreeBSD.org>	tcp: fix compilation of IPv4-only builds PR: 256538 Reported by: iron.udjin@gmail.com MFC after: 3 days Sponsored by: Netflix, Inc.
# 67e89281	10-Jun-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Mbuf leak while holding a socket buffer lock. When running at NF the current Rack and BBR changes with the recent commits from Richard that cause the socket buffer lock to be held over the ip_output() call and then finally culminating in a call to tcp_handle_wakeup() we get a lot of leaked mbufs. I don't think that this leak is actually caused by holding the lock or what Richard has done, but is exposing some other bug that has probably been lying dormant for a long time. I will continue to look (using his changes) at what is going on to try to root cause out the issue. In the meantime I can't leave the leaks out for everyone else. So this commit will revert all of Richards changes and move both Rack and BBR back to just doing the old sorwakeup_locked() calls after messing with the so_rcv buffer. We may want to look at adding back in Richards changes after I have pinpointed the root cause of the mbuf leak and fixed it. Reviewed by: mtuexen,rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30704
# 4747500d	04-Jun-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: A better fix for the previously attempted fix of the ack-war issue with tcp. So it turns out that my fix before was not correct. It ended with us failing some of the "improved" SYN tests, since we are not in the correct states. With more digging I have figured out the root of the problem is that when we receive a SYN\|FIN the reassembly code made it so we create a segq entry to hold the FIN. In the established state where we were not in order this would be correct i.e. a 0 len with a FIN would need to be accepted. But if you are in a front state we need to strip the FIN so we correctly handle the ACK but ignore the FIN. This gets us into the proper states and avoids the previous ack war. I back out some of the previous changes but then add a new change here in tcp_reass() that fixes the root cause of the issue. We still leave the rack panic fixes in place however. Reviewed by: mtuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30627
# 8c69d988	27-May-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: When we have an out-of-order FIN we do want to strip off the FIN bit. The last set of commits fixed both a panic (in rack) and an ACK-war (in freebsd and bbr). However there was a missing case, i.e. where we get an out-of-order FIN by itself. In such a case we don't want to leave the FIN bit set, otherwise we will do the wrong thing and ack the FIN incorrectly. Instead we need to go through the tcp_reasm() code and that way the FIN will be stripped and all will be well. Reviewed by: mtuexen,rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30497
# 4f3addd9	26-May-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Add a socket option to rack so we can test various changes to the slop value in timers. Timer_slop, in TCP, has been 200ms for a long time. This value dates back a long time when delayed ack timers were longer and links were slower. A 200ms timer slop allows 1 MSS to be sent over a 60kbps link. Its possible that lowering this value to something more in line with todays delayed ack values (40ms) might improve TCP. This bit of code makes it so rack can, via a socket option, adjust the timer slop. Reviewed by: mtuexen Sponsered by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30249
# 13c0e198	25-May-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Fix bugs related to the PUSH bit and rack and an ack war Michaels testing with UDP tunneling found an issue with the push bit, which was only partly fixed in the last commit. The problem is the left edge gets transmitted before the adjustments are done to the send_map, this means that right edge bits must be considered to be added only if the entire RSM is being retransmitted. Now syzkaller also continued to find a crash, which Michael sent me the reproducer for. Turns out that the reproducer on default (freebsd) stack made the stack get into an ack-war with itself. After fixing the reference issues in rack the same ack-war was found in rack (and bbr). Basically what happens is we go into the reassembly code and lose the FIN bit. The trick here is we should not be going into the reassembly code if tlen == 0 i.e. the peer never sent you anything. That then gets the proper action on the FIN bit but then you end up in LAST_ACK with no timers running. This is because the usrclosed function gets called and the FIN's and such have already been exchanged. So when we should be entering FIN_WAIT2 (or even FIN_WAIT1) we get stuck in LAST_ACK. Fixing this means tweaking the usrclosed function so that we properly recognize the condition and drop into FIN_WAIT2 where a timer will allow at least TP_MAXIDLE before closing (to allow time for the peer to retransmit its FIN if the ack is lost). Setting the fast_finwait2 timer can speed this up in testing. Reviewed by: mtuexen,rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30451
# 9bbd1a8f	24-May-2021	Michael Tuexen <tuexen@FreeBSD.org>	tcp: fix a RACK socket buffer lock issue Fix a missing socket buffer unlocking of the socket receive buffer. Reviewed by: gallatin, rrs MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D30402
# 631449d5	24-May-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Fix an issue with the PUSH bit as well as fill in the missing mtu change for fsb's The push bit itself was also not actually being properly moved to the right edge. The FIN bit was incorrectly on the left edge. We fix these two issues as well as plumb in the mtu_change for alternate stacks. Reviewed by: mtuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30413
# 8923ce63	22-May-2021	Michael Tuexen <tuexen@FreeBSD.org>	tcp: Handle stack switch while processing socket options Handle the case where during socket option processing, the user switches a stack such that processing the stack specific socket option does not make sense anymore. Return an error in this case. MFC after: 1 week Reviewed by: markj Reported by: syzbot+a6e1d91f240ad5d72cd1@syzkaller.appspotmail.com Sponsored by: Netflix, Inc. Differential revision: https://reviews.freebsd.org/D30395
# 39756885	21-May-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	rack: honor prior socket buffer lock when doing the upcall While partially reverting D24237 with D29690, due to introducing some unintended effects for in-kernel TCP consumers, the preexisting lock on the socket send buffer was not considered properly. Found by: markj MFC after: 2 weeks Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D30390
# 032bf749	21-May-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	[tcp] Keep socket buffer locked until upcall r367492 would unlock the socket buffer before eventually calling the upcall. This leads to problematic interaction with NFS kernel server/client components (MP threads) accessing the socket buffer with potentially not correctly updated state. Reported by: rmacklem Reviewed By: tuexen, #transport Tested by: rmacklem, otis MFC after: 2 weeks Sponsored By: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29690
# 500eb6dd	21-May-2021	Michael Tuexen <tuexen@FreeBSD.org>	tcp: Fix sending of TCP segments with IP level options When bringing in TCP over UDP support in https://cgit.FreeBSD.org/src/commit/?id=9e644c23000c2f5028b235f6263d17ffb24d3605, the length of IP level options was considered when locating the transport header. This was incorrect and is fixed by this patch. X-MFC with: https://cgit.FreeBSD.org/src/commit/?id=9e644c23000c2f5028b235f6263d17ffb24d3605 MFC after: 3 days Reviewed by: markj, rscheff Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D30358
# 02cffbc2	13-May-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: Incorrect KASSERT causes a panic in rack Skyzall found an interesting panic in rack. When a SYN and FIN are both sent together a KASSERT gets tripped where it is validating that a mbuf pointer is in the sendmap. But a SYN and FIN often will not have a mbuf pointer. So the fix is two fold a) make sure that the SYN and FIN split the right way when cloning an RSM SYN on left edge and FIN on right. And also make sure the KASSERT properly accounts for the case that we have a SYN or FIN so we don't panic. Reviewed by: mtuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D30241
# 251842c6	12-May-2021	Michael Tuexen <tuexen@FreeBSD.org>	tcp rack: improve initialisation of retransmit timeout When the TCP is in the front states, don't take the slop variable into account. This improves consistency with the base stack. Reviewed by: rrs@ Differential Revision: https://reviews.freebsd.org/D30230 MFC after: 1 week Sponsored by: Netflix, Inc.
# 4b86a24a	11-May-2021	Randall Stewart <rrs@FreeBSD.org>	tcp: In rack, we must only convert restored rtt when the hostcache does restore them. Rack now after the previous commit is very careful to translate any value in the hostcache for srtt/rttvar into its proper format. However there is a snafu here in that if tp->srtt is 0 is the only time that the HC will actually restore the srtt. We need to then only convert the srtt restored when it is actually restored. We do this by making sure it was zero before the call to cc_conn_init and it is non-zero afterwards. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30213
# 9867224b	10-May-2021	Randall Stewart <rrs@FreeBSD.org>	tcp:Host cache and rack ending up with incorrect values. The hostcache up to now as been updated in the discard callback but without checking if we are all done (the race where there are more than one calls and the counter has not yet reached zero). This means that when the race occurs, we end up calling the hc_upate more than once. Also alternate stacks can keep there srtt/rttvar in different formats (example rack keeps its values in microseconds). Since we call the hc_update before the stack fini() then the values will be in the wrong format. Rack on the other hand, needs to convert items pulled from the hostcache into its internal format else it may end up with very much incorrect values from the hostcache. In the process lets commonize the update mechanism for srtt/rttvar since we now have more than one place that needs to call it. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30172
# a16cee02	07-May-2021	Randall Stewart <rrs@FreeBSD.org>	Fix a UDP tunneling issue with rack. Basically there are two issues. A) Not enough hdrlen was being calculated when a UDP tunnel is in place. and B) Not enough memory is allocated in racks fsb. We need to overbook the fsb to include a udphdr just in case. Submitted by: Peter Lei Reviewed by: Michael Tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30157
# 5d8fd932	06-May-2021	Randall Stewart <rrs@FreeBSD.org>	This brings into sync FreeBSD with the netflix versions of rack and bbr. This fixes several breakages (panics) since the tcp_lro code was committed that have been reported. Quite a few new features are now in rack (prefecting of DGP -- Dynamic Goodput Pacing among the largest). There is also support for ack-war prevention. Documents comming soon on rack.. Sponsored by: Netflix Reviewed by: rscheff, mtuexen Differential Revision: https://reviews.freebsd.org/D30036
# 9e644c23	18-Apr-2021	Michael Tuexen <tuexen@FreeBSD.org>	tcp: add support for TCP over UDP Adding support for TCP over UDP allows communication with TCP stacks which can be implemented in userspace without requiring special priviledges or specific support by the OS. This is joint work with rrs. Reviewed by: rrs Sponsored by: Netflix, Inc. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29469
# 2e978260	17-Apr-2021	Richard Scheffenegger <rscheff@FreeBSD.org>	rack: Fix ECN on finalizing session. Maintain code similarity between RACK and base stack for ECN. This may not strictly be necessary, depending when a state transition to FIN_WAIT_1 is done in RACK after a shutdown() or close() syscall. MFC after: 3 days Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29658
# 40f41ece	22-Mar-2021	Michael Tuexen <tuexen@FreeBSD.org>	tcp: improve handling of SYN segments in SYN-SENT state Ensure that the stack does not generate a DSACK block for user data received on a SYN segment in SYN-SENT state. Reviewed by: rscheff MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D29376 Sponsored by: Netflix, Inc.
# 5666643a	13-Mar-2021	Gordon Bergling <gbe@FreeBSD.org>	Fix some common typos in comments - occured -> occurred - normaly -> normally - controling -> controlling - fileds -> fields - insterted -> inserted - outputing -> outputting MFC after: 1 week
# 705d06b2	05-Mar-2021	Michael Tuexen <tuexen@FreeBSD.org>	rack: unbreak TCP fast open for the client side Allow sending user data on the SYN segment. Reviewed by: rrs MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D29082 Sponsored by: Netflix, Inc.
# 99adf230	01-Mar-2021	Michael Tuexen <tuexen@FreeBSD.org>	RACK: fix an issue triggered by using the CDG CC module Obtained from: rrs@ MFC after: 3 days PR: 238741 Sponsored by: Netlix, Inc.
# 1a714ff2	26-Jan-2021	Randall Stewart <rrs@FreeBSD.org>	This pulls over all the changes that are in the netflix tree that fix the ratelimit code. There were several bugs in tcp_ratelimit itself and we needed further work to support the multiple tag format coming for the joint TLS and Ratelimit dances. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D28357
# d2b3cedd	13-Jan-2021	Michael Tuexen <tuexen@FreeBSD.org>	tcp: add sysctl to tolerate TCP segments missing timestamps When timestamp support has been negotiated, TCP segements received without a timestamp should be discarded. However, there are broken TCP implementations (for example, stacks used by Omniswitch 63xx and 64xx models), which send TCP segments without timestamps although they negotiated timestamp support. This patch adds a sysctl variable which tolerates such TCP segments and allows to interoperate with broken stacks. Reviewed by: jtl@, rscheff@ Differential Revision: https://reviews.freebsd.org/D28142 Sponsored by: Netflix, Inc. PR: 252449 MFC after: 1 week
# cc3c3485	13-Jan-2021	Michael Tuexen <tuexen@FreeBSD.org>	tcp: fix handling of TCP RST segments missing timestamps A TCP RST segment should be processed even it is missing TCP timestamps. Reported by: dmgk@, kevans@ Reviewed by: rscheff@, dmgk@ Sponsored by: Netflix, Inc. MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D28143
# 283c76c7	09-Nov-2020	Michael Tuexen <tuexen@FreeBSD.org>	RFC 7323 specifies that: * TCP segments without timestamps should be dropped when support for the timestamp option has been negotiated. * TCP segments with timestamps should be processed normally if support for the timestamp option has not been negotiated. This patch enforces the above. PR: 250499 Reviewed by: gnn, rrs MFC after: 1 week Sponsored by: Netflix, Inc Differential Revision: https://reviews.freebsd.org/D27148
# 4d0770f1	08-Nov-2020	Richard Scheffenegger <rscheff@FreeBSD.org>	Prevent premature SACK block transmission during loss recovery Under specific conditions, a window update can be sent with outdated SACK information. Some clients react to this by subsequently delaying loss recovery, making TCP perform very poorly. Reported by: chengc_netapp.com Reviewed by: rrs, jtl MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D24237
# 285385ba	09-Sep-2020	Randall Stewart <rrs@FreeBSD.org>	So it turns out that syzkaller hit another crash. It has to do with switching stacks with a SENT_FIN outstanding. Both rack and bbr will only send a FIN if all data is ack'd so this must be enforced. Also if the previous stack sent the FIN we need to make sure in rack that when we manufacture the "unknown" sends that we include the proper HAS_FIN bits. Note for BBR we take a simpler approach and just refuse to switch. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D26269
# 662c1305	01-Sep-2020	Mateusz Guzik <mjg@FreeBSD.org>	net: clean up empty lines in .c and .h files
# 1951fa79	25-Aug-2020	Michael Tuexen <tuexen@FreeBSD.org>	RFC 3465 defines a limit L used in TCP slow start for limiting the number of acked bytes as described in Section 2.2 of that document. This patch ensures that this limit is not also applied in congestion avoidance. Applying this limit also in congestion avoidance can result in using less bandwidth than allowed. Reported by: l.tian.email@gmail.com Reviewed by: rrs, rscheff MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D26120
# b9978183	19-Aug-2020	Andrew Gallatin <gallatin@FreeBSD.org>	TCP: remove special treatment for hardware (ifnet) TLS Remove most special treatment for ifnet TLS in the TCP stack, except for code to avoid mixing handshakes and bulk data. This code made heroic efforts to send down entire TLS records to NICs. It was added to improve the PCIe bus efficiency of older TLS offload NICs which did not keep state per-session, and so would need to re-DMA the first part(s) of a TLS record if a TLS record was sent in multiple TCP packets or TSOs. Newer TLS offload NICs do not need this feature. At Netflix, we've run extensive QoE tests which show that this feature reduces client quality metrics, presumably because the effort to send TLS records atomically causes the server to both wait too long to send data (leading to buffers running dry), and to send too much data at once (leading to packet loss). Reviewed by: hselasky, jhb, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26103
# 95ef69c6	16-Jun-2020	Randall Stewart <rrs@FreeBSD.org>	iSo in doing final checks on OCA firmware with all the latest tweaks the dup-ack checking packet drill script was failing with a number of unexpected acks. So it turns out if you have the default recvwin set up to 1Meg (like OCA's do) and you have no window scaling (like the dupack checking code) then we have another case where we are always trying to update the rwnd and sending an ack when we should not. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D25298
# 4d418f8d	15-Jun-2020	Randall Stewart <rrs@FreeBSD.org>	So it turns out rack has a shortcoming in dup-ack counting. It counts the dupacks but then does not properly respond to them. This is because a few missing bits are not present. BBR actually does properly respond (though it also sends a TLP which is interesting and maybe something to fix).. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D25294
# f092a3c7	12-Jun-2020	Randall Stewart <rrs@FreeBSD.org>	So it turns out with the right window scaling you can get the code in all stacks to always want to do a window update, even when no data can be sent. Now in cases where you are not pacing thats probably ok, you just send an extra window update or two. However with bbr (and rack if its paced) every time the pacer goes off its going to send a "window update". Also in testing bbr I have found that if we are not responding to data right away we end up staying in startup but incorrectly holding a pacing gain of 192 (a loss). This is because the idle window code does not restict itself to only work with PROBE_BW. In all other states you dont want it doing a PROBE_BW state change. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D25247
# e854dd38	08-Jun-2020	Randall Stewart <rrs@FreeBSD.org>	An important statistic in determining if a server process (or client) is being delayed is to know the time to first byte in and time to first byte out. Currently we have no way to know these all we have is t_starttime. That (t_starttime) tells us what time the 3 way handshake completed. We don't know when the first request came in or how quickly we responded. Nor from a client perspective do we know how long from when we sent out the first byte before the server responded. This small change adds the ability to track the TTFB's. This will show up in BB logging which then can be pulled for later analysis. Note that currently the tracking is via the ticks variable of all three variables. This provides a very rough estimate (hz=1000 its 1ms). A follow-on set of work will be to change all three of these values into something with a much finer resolution (either microseconds or nanoseconds), though we may want to make the resolution configurable so that on lower powered machines we could still use the much cheaper ticks variable. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D24902
# f1ea4e41	03-Jun-2020	Randall Stewart <rrs@FreeBSD.org>	This fixes a couple of skyzaller crashes. Most of them have to do with TFO. Even the default stack had one of the issues: 1) We need to make sure for rack that we don't advance snd_nxt beyond iss when we are not doing fast open. We otherwise can get a bunch of SYN's sent out incorrectly with the seq number advancing. 2) When we complete the 3-way handshake we should not ever append to reassembly if the tlen is 0, if TFO is enabled prior to this fix we could still call the reasemmbly. Note this effects all three stacks. 3) Rack like its cousin BBR should track if a SYN is on a send map entry. 4) Both bbr and rack need to only consider len incremented on a SYN if the starting seq is iss, otherwise we don't increment len which may mean we return without adding a sendmap entry. This work was done in collaberation with Michael Tuexen, thanks for all the testing! Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D25000
# af2fb894	21-May-2020	Richard Scheffenegger <rscheff@FreeBSD.org>	With RFC3168 ECN, CWR SHOULD only be sent with new data Overly conservative data receivers may ignore the CWR flag on other packets, and keep ECE latched. This can result in continous reduction of the congestion window, and very poor performance when ECN is enabled. Reviewed by: rgrimes (mentor), rrs Approved by: rgrimes (mentor), tuexen (mentor) MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D23364
# 8e051165	21-May-2020	Richard Scheffenegger <rscheff@FreeBSD.org>	Retain only mutually supported TCP options after simultaneous SYN When receiving a parallel SYN in SYN-SENT state, remove all the options only we supported locally before sending the SYN,ACK. This addresses a consistency issue on parallel opens. Also, on such a parallel open, the stack could be coaxed into running with timestamps enabled, even if administratively disabled. Reviewed by: tuexen (mentor) Approved by: tuexen (mentor) MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D23371
# 6e16d877	21-May-2020	Richard Scheffenegger <rscheff@FreeBSD.org>	Handle ECN handshake in simultaneous open While testing simultaneous open TCP with ECN, found that negotiation fails to arrive at the expected final state. Reviewed by: tuexen (mentor) Approved by: tuexen (mentor), rgrimes (mentor) MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D23373
# 777b88d6	15-May-2020	Randall Stewart <rrs@FreeBSD.org>	This fixes several skyzaller issues found with the help of Michael Tuexen. There was some accounting errors with TCPFO for bbr and also for both rack and bbr there was a FO case where we should be jumping to the just_return_nolock label to exit instead of returning 0. This of course caused no timer to be running and thus the stuck sessions. Reported by: Michael Tuexen and Skyzaller Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D24852
# b1ddcbc6	07-May-2020	Randall Stewart <rrs@FreeBSD.org>	When in the SYN-SENT state bbr and rack will not properly send an ACK but instead start the D-ACK timer. This causes so_reuseport_lb_test to fail since it slows down how quickly the program runs until the timeout occurs and fails the test Sponsored by: Netflix inc. Differential Revision: https://reviews.freebsd.org/D24747
# 8717b8f1	07-May-2020	Randall Stewart <rrs@FreeBSD.org>	NF has an internal option that changes the tcp_mcopy_m routine slightly (has a few extra arguments). Recently that changed to only have one arg extra so that two ifdefs around the call are no longer needed. Lets take out the extra ifdef and arg. Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D24736
# 51a53922	04-May-2020	Michael Tuexen <tuexen@FreeBSD.org>	Add net epoch support back, which was taken out by accident in https://svnweb.freebsd.org/changeset/base/360639 Reviewed by: rrs Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D24694
# 963fb2ad	04-May-2020	Randall Stewart <rrs@FreeBSD.org>	This commit brings things into sync with the advancements that have been made in rack and adds a few fixes in BBR. This also removes any possibility of incorrectly doing OOB data the stacks do not support it. Should fix the skyzaller crashes seen in the past. Still to fix is the BBR issue just reported this weekend with the SYN and on sending a RST. Note that this version of rack can now do pacing as well. Sponsored by:Netflix Inc Differential Revision:https://reviews.freebsd.org/D24576
# 9028b6e0	29-Apr-2020	Richard Scheffenegger <rscheff@FreeBSD.org>	Prevent premature shrinking of the scaled receive window which can cause a TCP client to use invalid or stale TCP sequence numbers for ACK packets. Packets with old sequence numbers are ignored and not used to update the send window size. This might cause the TCP session to hang indefinitely under some circumstances. Reported by: Cui Cheng Reviewed by: tuexen (mentor), rgrimes (mentor) Approved by: tuexen (mentor), rgrimes (mentor) MFC after: 3 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D24515
# b2ade6b1	29-Apr-2020	Richard Scheffenegger <rscheff@FreeBSD.org>	Correctly set up the initial TCP congestion window in all cases, by not including the SYN bit sequence space in cwnd related calculations. Snd_und is adjusted explicitly in all cases, outside the cwnd update, instead. This fixes an off-by-one conformance issue with regular TCP sessions not using Appropriate Byte Counting (RFC3465), sending one more packet during the initial window than expected. PR: 235256 Reviewed by: tuexen (mentor), rgrimes (mentor) Approved by: tuexen (mentor), rgrimes (mentor) MFC after: 3 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D19000
# 454d3896	25-Apr-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix LINT build #2 after r360292. Pointyhat to: melifaro
# bb410f9f	21-Apr-2020	Richard Scheffenegger <rscheff@FreeBSD.org>	revert rS360143 - Correctly set up initial cwnd due to syzkaller panics found Reported by: tuexen Approved by: tuexen (mentor) Sponsored by: NetApp, Inc.
# 73b76966	21-Apr-2020	Richard Scheffenegger <rscheff@FreeBSD.org>	Correctly set up the initial TCP congestion window in all cases, by adjust snd_una right after the connection initialization, to include the one byte in sequence space occupied by the SYN bit. This does not change the regular ACK processing, while making the BYTES_THIS_ACK macro to work properly. PR: 235256 Reviewed by: tuexen (mentor), rgrimes (mentor) Approved by: tuexen (mentor), rgrimes (mentor) MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D19000
# 413c3db1	31-Mar-2020	Michael Tuexen <tuexen@FreeBSD.org>	Allow the TCP backhole detection to be disabled at all, enabled only for IPv4, enabled only for IPv6, and enabled for IPv4 and IPv6. The current blackhole detection might classify a temporary outage as an MTU issue and reduces permanently the MSS. Since the consequences of such a reduction due to a misclassification are much more drastically for IPv4 than for IPv6, allow the administrator to enable it for IPv6 only. Reviewed by: bcr@ (man page), Richard Scheffenegger Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D24219
# 7ca6e296	12-Mar-2020	Michael Tuexen <tuexen@FreeBSD.org>	Use KMOD_TCPSTAT_INC instead of TCPSTAT_INC for RACK and BBR, since these are kernel modules. Also add a KMOD_TCPSTAT_ADD and use that instead of TCPSTAT_ADD. Reviewed by: jtl@, rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D23904
# 7029da5c	26-Feb-2020	Pawel Biernacki <kaktus@FreeBSD.org>	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718
# 3fba40d9	11-Feb-2020	Randall Stewart <rrs@FreeBSD.org>	Remove all trailing white space from the BBR/Rack fold. Bits left around by emacs (thanks emacs).
# d2517ab0	11-Feb-2020	Randall Stewart <rrs@FreeBSD.org>	Now that all of the stats framework is in FreeBSD the bits that disabled stats when netflix-stats is not defined is no longer needed. Lets remove these bits so that we will properly use stats per its definition in BBR and Rack. Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D23088
# 9cc711c9	25-Jan-2020	Michael Tuexen <tuexen@FreeBSD.org>	Sending CWR after an RTO is according to RFC 3168 generally required and not only for the DCTCP congestion control. Submitted by: Richard Scheffenegger Reviewed by: rgrimes, tuexen@, Cheng Cui MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23119
# 47e2c17c	25-Jan-2020	Michael Tuexen <tuexen@FreeBSD.org>	Don't set the ECT codepoint on retransmitted packets during SACK loss recovery. This is required by RFC 3168. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, tuexen@, Cheng Cui MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23118
# a2d59694	25-Jan-2020	Michael Tuexen <tuexen@FreeBSD.org>	As a TCP client only enable ECN when the corresponding sysctl variable indicates that ECN should be negotiated for the client side. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, tuexen@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23228
# ee97681e	24-Jan-2020	Michael Tuexen <tuexen@FreeBSD.org>	Don't delay the ACK for a TCP segment with the CWR flag set. This allows the data sender to increase the CWND faster. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, tuexen@, Cheng Cui MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D22670
# e1d2b469	24-Jan-2020	Gleb Smirnoff <glebius@FreeBSD.org>	Enter the network epoch when rack_output() is called in setsockopt(2).
# 109eb549	21-Jan-2020	Gleb Smirnoff <glebius@FreeBSD.org>	Make tcp_output() require network epoch. Enter the epoch before calling into tcp_output() from those functions, that didn't do that before. This eliminates a bunch of epoch recursions in TCP.
# b9555453	21-Jan-2020	Gleb Smirnoff <glebius@FreeBSD.org>	Make ip6_output() and ip_output() require network epoch. All callers that before may called into these functions without network epoch now must enter it.
# 334fc582	08-Jan-2020	Bjoern A. Zeeb <bz@FreeBSD.org>	vnet: virtualise more network stack sysctls. Virtualise tcp_always_keepalive, TCP and UDP log_in_vain. All three are set in the netoptions startup script, which we would love to run for VNETs as well [1]. While virtualising the log_in_vain sysctls seems pointles at first for as long as the kernel message buffer is not virtualised, it at least allows an administrator to debug the base system or an individual jail if needed without turning the logging on for all jails running on a system. PR: 243193 [1] MFC after: 2 weeks
# 4ad24737	06-Jan-2020	Randall Stewart <rrs@FreeBSD.org>	This catches rack up in the recent changes to ECN and also commonizes the functions that both the freebsd and rack stack uses. Sponsored by:Netflix Inc Differential Revision: https://reviews.freebsd.org/D23052
# 1cf55767	17-Dec-2019	Randall Stewart <rrs@FreeBSD.org>	This commit is a bit of a re-arrange of deck chairs. It gets both rack and bbr ready for the completion of the STATs framework in FreeBSD. For now if you don't have both NF_stats and stats on it disables them. As soon as the rest of the stats framework lands we can remove that restriction and then just uses stats when defined. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D22479
# 3cf38784	01-Dec-2019	Michael Tuexen <tuexen@FreeBSD.org>	Move all ECN related flags from the flags to the flags2 field. This allows adding more ECN related flags in the future. No functional change intended. Submitted by: Richard Scheffenegger Reviewed by: rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22497
# b72e56e7	01-Dec-2019	Michael Tuexen <tuexen@FreeBSD.org>	This is an initial step in implementing the new congestion window validation as specified in RFC 7661. Submitted by: Richard Scheffenegger Reviewed by: rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D21798
# 8df12ffc	01-Dec-2019	Michael Tuexen <tuexen@FreeBSD.org>	Make the IPTOS value available to all substate handlers. This will allow to add support for L4S or SCE, which require processing of the IP TOS field. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22426
# d40c0d47	07-Nov-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Now that all of the tcp_input() and all its branches are executed in the network epoch, we can greatly simplify synchronization. Remove all unneccesary epoch enters hidden under INP_INFO_RLOCK macro. Remove some unneccesary assertions and convert necessary ones into the NET_EPOCH_ASSERT macro.
# 8ee1cf03	14-Oct-2019	Randall Stewart <rrs@FreeBSD.org>	if_hw_tsomaxsegsize needs to be initialized to zero, just like in bbr.c and tcp_output.c
# 12a43d0d	29-Sep-2019	Michael Tuexen <tuexen@FreeBSD.org>	RFC 7112 requires a host to put the complete IP header chain including the TCP header in the first IP packet. Enforce this in tcp_output(). In addition make sure that at least one byte payload fits in the TCP segement to allow making progress. Without this check, a kernel with INVARIANTS will panic. This issue was found by running an instance of syzkaller. Reviewed by: jtl@ MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21665
# 79c2a2a0	28-Sep-2019	Michael Tuexen <tuexen@FreeBSD.org>	Ensure that the INP lock is released before leaving [gs]etsockopt() for RACK specific socket options. These issues were found by a syzkaller instance. Reviewed by: rrs@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21825
# ac7bd23a	24-Sep-2019	Randall Stewart <rrs@FreeBSD.org>	lets put (void) in a couple of functions to keep older platforms that are stuck with gcc happy (ppc). The changes are needed in both bbr and rack. Obtained from: Michael Tuexen (mtuexen@)
# 35c7bb34	24-Sep-2019	Randall Stewart <rrs@FreeBSD.org>	This commit adds BBR (Bottleneck Bandwidth and RTT) congestion control. This is a completely separate TCP stack (tcp_bbr.ko) that will be built only if you add the make options WITH_EXTRA_TCP_STACKS=1 and also include the option TCPHPTS. You can also include the RATELIMIT option if you have a NIC interface that supports hardware pacing, BBR understands how to use such a feature. Note that this commit also adds in a general purpose time-filter which allows you to have a min-filter or max-filter. A filter allows you to have a low (or high) value for some period of time and degrade slowly to another value has time passes. You can find out the details of BBR by looking at the original paper at: https://queue.acm.org/detail.cfm?id=3022184 or consult many other web resources you can find on the web referenced by "BBR congestion control". It should be noted that BBRv1 (which this is) does tend to unfairness in cases of small buffered paths, and it will usually get less bandwidth in the case of large BDP paths(when competing with new-reno or cubic flows). BBR is still an active research area and we do plan on implementing V2 of BBR to see if it is an improvement over V1. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D21582
# dd3121a8	19-Sep-2019	Michael Tuexen <tuexen@FreeBSD.org>	When the RACK stack computes the space for user data in a TCP segment, it wasn't taking the IP level options into account. This patch fixes this. In addition, it also corrects a KASSERT and adds protection code to assure that the IP header chain and the TCP head fit in the first fragment as required by RFC 7112. Reviewed by: rrs@ MFC after: 3 days Sponsored by: Nertflix, Inc. Differential Revision: https://reviews.freebsd.org/D21666
# 6d261981	09-Sep-2019	Michael Tuexen <tuexen@FreeBSD.org>	Only update SACK/DSACK lists when a non-empty segment was received. This fixes hitting a KASSERT with a valid packet exchange. Reviewed by: rrs@, Richard Scheffenegger MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21567
# 191ae5bf	03-Sep-2019	Michael Tuexen <tuexen@FreeBSD.org>	Fix two TCP RACK issues: * Convert the TCP delayed ACK timer from ms to ticks as required. This fixes the timer on platforms with hz != 1000. * Don't delay acknowledgements which report duplicate data using DSACKs. Reviewed by: rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21512
# fe5dee73	02-Sep-2019	Michael Tuexen <tuexen@FreeBSD.org>	This patch improves the DSACK handling to conform with RFC 2883. The lowest SACK block is used when multiple Blocks would be elegible as DSACK blocks ACK blocks get reordered - while maintaining the ordering of SACK blocks not relevant in the DSACK context is maintained. Reviewed by: rrs@, tuexen@ Obtained from: Richard Scheffenegger MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D21038
# b2e60773	26-Aug-2019	John Baldwin <jhb@FreeBSD.org>	Add kernel-side support for in-kernel TLS. KTLS adds support for in-kernel framing and encryption of Transport Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports offload of TLS for transmitted data. Key negotation must still be performed in userland. Once completed, transmit session keys for a connection are provided to the kernel via a new TCP_TXTLS_ENABLE socket option. All subsequent data transmitted on the socket is placed into TLS frames and encrypted using the supplied keys. Any data written to a KTLS-enabled socket via write(2), aio_write(2), or sendfile(2) is assumed to be application data and is encoded in TLS frames with an application data type. Individual records can be sent with a custom type (e.g. handshake messages) via sendmsg(2) with a new control message (TLS_SET_RECORD_TYPE) specifying the record type. At present, rekeying is not supported though the in-kernel framework should support rekeying. KTLS makes use of the recently added unmapped mbufs to store TLS frames in the socket buffer. Each TLS frame is described by a single ext_pgs mbuf. The ext_pgs structure contains the header of the TLS record (and trailer for encrypted records) as well as references to the associated TLS session. KTLS supports two primary methods of encrypting TLS frames: software TLS and ifnet TLS. Software TLS marks mbufs holding socket data as not ready via M_NOTREADY similar to sendfile(2) when TLS framing information is added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then called to schedule TLS frames for encryption. In the case of sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving the mbufs marked M_NOTREADY until encryption is completed. For other writes (vn_sendfile when pages are available, write(2), etc.), the PRUS_NOTREADY is set when invoking pru_send() along with invoking ktls_enqueue(). A pool of worker threads (the "KTLS" kernel process) encrypts TLS frames queued via ktls_enqueue(). Each TLS frame is temporarily mapped using the direct map and passed to a software encryption backend to perform the actual encryption. (Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if someone wished to make this work on architectures without a direct map.) KTLS supports pluggable software encryption backends. Internally, Netflix uses proprietary pure-software backends. This commit includes a simple backend in a new ktls_ocf.ko module that uses the kernel's OpenCrypto framework to provide AES-GCM encryption of TLS frames. As a result, software TLS is now a bit of a misnomer as it can make use of hardware crypto accelerators. Once software encryption has finished, the TLS frame mbufs are marked ready via pru_ready(). At this point, the encrypted data appears as regular payload to the TCP stack stored in unmapped mbufs. ifnet TLS permits a NIC to offload the TLS encryption and TCP segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS) is allocated on the interface a socket is routed over and associated with a TLS session. TLS records for a TLS session using ifnet TLS are not marked M_NOTREADY but are passed down the stack unencrypted. The ip_output_send() and ip6_output_send() helper functions that apply send tags to outbound IP packets verify that the send tag of the TLS record matches the outbound interface. If so, the packet is tagged with the TLS send tag and sent to the interface. The NIC device driver must recognize packets with the TLS send tag and schedule them for TLS encryption and TCP segmentation. If the the outbound interface does not match the interface in the TLS send tag, the packet is dropped. In addition, a task is scheduled to refresh the TLS send tag for the TLS session. If a new TLS send tag cannot be allocated, the connection is dropped. If a new TLS send tag is allocated, however, subsequent packets will be tagged with the correct TLS send tag. (This latter case has been tested by configuring both ports of a Chelsio T6 in a lagg and failing over from one port to another. As the connections migrated to the new port, new TLS send tags were allocated for the new port and connections resumed without being dropped.) ifnet TLS can be enabled and disabled on supported network interfaces via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported across both vlan devices and lagg interfaces using failover, lacp with flowid enabled, or lacp with flowid enabled. Applications may request the current KTLS mode of a connection via a new TCP_TXTLS_MODE socket option. They can also use this socket option to toggle between software and ifnet TLS modes. In addition, a testing tool is available in tools/tools/switch_tls. This is modeled on tcpdrop and uses similar syntax. However, instead of dropping connections, -s is used to force KTLS connections to switch to software TLS and -i is used to switch to ifnet TLS. Various sysctls and counters are available under the kern.ipc.tls sysctl node. The kern.ipc.tls.enable node must be set to true to enable KTLS (it is off by default). The use of unmapped mbufs must also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS. KTLS is enabled via the KERN_TLS kernel option. This patch is the culmination of years of work by several folks including Scott Long and Randall Stewart for the original design and implementation; Drew Gallatin for several optimizations including the use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records awaiting software encryption, and pluggable software crypto backends; and John Baldwin for modifications to support hardware TLS offload. Reviewed by: gallatin, hselasky, rrs Obtained from: Netflix Sponsored by: Netflix, Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21277
# e13ad86c	21-Aug-2019	Randall Stewart <rrs@FreeBSD.org>	Fix an issue when TSO and Rack play together. Basically an retransmission of the initial SYN (with data) would cause us to strip the SYN and decrement/increase offset/len which then caused us a -1 offset and a panic. Reported by: Larry Rosenman (Michael Tuexen helped me debug this at the IETF)
# 23fa2dbc	12-Aug-2019	Randall Stewart <rrs@FreeBSD.org>	Place back in the dependency on HPTS via module depends versus a fatal error in compiling. This was taken out by mistake when I mis-merged from the 18q22p2 sources of rack in NF. Opps. Reported by: sbruno
# e5926fd3	14-Jul-2019	Randall Stewart <rrs@FreeBSD.org>	This is the second in a number of patches needed to get BBRv1 into the tree. This fixes the DSACK bug but is also needed by BBR. We have yet to go two more one will be for the pacing code (tcp_ratelimit.c) and the second will be for the new updated LRO code that allows a transport to know the arrival times of packets and (tcp_lro.c). After that we should finally be able to get BBRv1 into head. Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D20908
# 55f79588	12-Jul-2019	Randall Stewart <rrs@FreeBSD.org>	add back the comment around the pending DSACK fixes.
# 1cf999a5	10-Jul-2019	Randall Stewart <rrs@FreeBSD.org>	Update to jhb's other suggestion, use #error when we are missing HPTS.
# 9cf3c235	10-Jul-2019	Randall Stewart <rrs@FreeBSD.org>	Update copyright per JBH's suggestions.. thanks.
# 3b0b41e6	10-Jul-2019	Randall Stewart <rrs@FreeBSD.org>	This commit updates rack to what is basically being used at NF as well as sets in some of the groundwork for committing BBR. The hpts system is updated as well as some other needed utilities for the entrance of BBR. This is actually part 1 of 3 more needed commits which will finally complete with BBRv1 being added as a new tcp stack. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D20834
# 5e02b277	19-Jun-2019	Jonathan T. Looney <jtl@FreeBSD.org>	Add the ability to limit how much the code will fragment the RACK send map in response to SACKs. The default behavior is unchanged; however, the limit can be activated by changing the new net.inet.tcp.rack.split_limit sysctl. Submitted by: Peter Lei <peterlei@netflix.com> Reported by: jtl Reviewed by: lstewart (earlier version) Security: CVE-2019-5599
# d1156b05	06-Jun-2019	Michael Tuexen <tuexen@FreeBSD.org>	r347382 added receiver side DSACK support for the TCP base stack. The corresponding changes for the RACK stack where missed and are added by this commit. Reviewed by: Richard Scheffenegger, rrs@ MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D20372
# c6dcb64b	20-Feb-2019	Michael Tuexen <tuexen@FreeBSD.org>	Use exponential backoff for retransmitting SYN segments as specified in the TCP RFCs. Reviewed by: rrs@, Richard Scheffenegger Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18974
# 52467047	04-Feb-2019	Warner Losh <imp@FreeBSD.org>	Regularize the Netflix copyright Use recent best practices for Copyright form at the top of the license: 1. Remove all the All Rights Reserved clauses on our stuff. Where we piggybacked others, use a separate line to make things clear. 2. Use "Netflix, Inc." everywhere. 3. Use a single line for the copyright for grep friendliness. 4. Use date ranges in all places for our stuff. Approved by: Netflix Legal (who gave me the form), adrian@ (pmc files)
# 116ef4d6	31-Jan-2019	Michael Tuexen <tuexen@FreeBSD.org>	When handling SYN-ACK segments in the SYN-RCVD state, set tp->snd_wnd consistently. This inconsistency was observed when working on the bug reported in PR 235256, although it does not fix the reported issue. The fix for the PR will be a separate commit. PR: 235256 Reviewed by: rrs@, Richard Scheffenegger MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D19033
# bf7fcdb1	27-Jan-2019	Michael Tuexen <tuexen@FreeBSD.org>	Fix the detection of ECN-setup SYN-ACK packets. RFC 3168 defines an ECN-setup SYN-ACK packet as on with the ECE flags set and the CWR flags not set. The code was only checking if ECE flag is set. This patch adds the check to verify that the CWR flags is not set. Submitted by: Richard Scheffenegger Reviewed by: tuexen@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D18996
# 7dc90a1d	25-Jan-2019	Michael Tuexen <tuexen@FreeBSD.org>	Fix a bug in the restart window computation of TCP New Reno When implementing support for IW10, an update in the computation of the restart window used after an idle phase was missed. To minimize code duplication, implement the logic in tcp_compute_initwnd() and call it. This fixes a bug in NewReno, which was not aware of IW10. Submitted by: Richard Scheffenegger Reviewed by: tuexen@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D18940
# ad2be389	22-Nov-2018	Michael Tuexen <tuexen@FreeBSD.org>	A TCP stack is required to check SEG.ACK first, when processing a segment in the SYN-SENT state as stated in Section 3.9 of RFC 793, page 66. Ensure this is also done by the TCP RACK stack. Reviewed by: rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18034
# fef56019	22-Nov-2018	Michael Tuexen <tuexen@FreeBSD.org>	Ensure that the TCP RACK stack honours the setting of the net.inet.tcp.drop_synfin sysctl-variable. Reviewed by: rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18033
# 7e729f07	22-Nov-2018	Michael Tuexen <tuexen@FreeBSD.org>	Ensure that the default RTT stack can make an RTT measurement if the TCP connection was initiated using the RACK stack, but the peer does not support the TCP RACK extension. This ensures that the TCP behaviour on the wire is the same if the TCP connection is initated using the RACK stack or the default stack. Reviewed by: rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18032
# 79410718	22-Nov-2018	Michael Tuexen <tuexen@FreeBSD.org>	Ensure that TCP RST-segments announce consistently a receiver window of zero. This was already done when sending them via tcp_respond(). Reviewed by: rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D17949
# 3bea9a26	21-Nov-2018	Michael Tuexen <tuexen@FreeBSD.org>	Improve two KASSERTs in the TCP RACK stack. There are two locations where an always true comparison was made in a KASSERT. Replace this by an appropriate check and use a consistent panic message. Also use this code when checking a similar condition. PR: 229664 Reviewed by: rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18021
# a8a8a8a8	12-Sep-2018	Michael Tuexen <tuexen@FreeBSD.org>	Fix TCP Fast Open for the TCP RACK stack. * Fix a bug where the SYN handling during established state was applied to a front state. * Move a check for retransmission after the timer handling. This was suppressing timer based retransmissions. * Fix an off-by one byte in the sequence number of retransmissions. * Apply fixes corresponding to https://svnweb.freebsd.org/changeset/base/336934 Reviewed by: rrs@ Approved by: re (kib@) MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16912
# c28440db	19-Aug-2018	Randall Stewart <rrs@FreeBSD.org>	This change represents a substantial restructure of the way we reassembly inbound tcp segments. The old algorithm just blindly dropped in segments without coalescing. This meant that every segment could take up greater and greater room on the linked list of segments. This of course is now subject to a tighter limit (100) of segments which in a high BDP situation will cause us to be a lot more in-efficent as we drop segments beyond 100 entries that we receive. What this restructure does is cause the reassembly buffer to coalesce segments putting an emphasis on the two common cases (which avoid walking the list of segments) i.e. where we add to the back of the queue of segments and where we add to the front. We also have the reassembly buffer supporting a couple of debug options (black box logging as well as counters for code coverage). These are compiled out by default but can be added by uncommenting the defines. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D16626
# d18ea344	08-Aug-2018	Randall Stewart <rrs@FreeBSD.org>	Fix a small bug in rack where it will end up sending the FIN twice. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D16604
# 936b2b64	06-Aug-2018	Randall Stewart <rrs@FreeBSD.org>	This fixes a bug in Rack where we were not properly using the correct value for Delayed Ack. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D16579
# 4ad5b7a0	30-Jul-2018	Randall Stewart <rrs@FreeBSD.org>	This fixes a hole where rack could end up sending an invalid segment into the reassembly queue. This would happen if you enabled the data after close option. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D16453
# 8de9ac5e	18-Jul-2018	Randall Stewart <rrs@FreeBSD.org>	Bump the ICMP echo limits to match the RFC Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D16333
# a026a53a	10-Jul-2018	Michael Tuexen <tuexen@FreeBSD.org>	Use appropriate MSS value when populating the TCP FO client cookie cache When a client receives a SYN-ACK segment with a TFP fast open cookie, but without an MSS option, an MSS value from uninitialised stack memory is used. This patch ensures that in case no MSS option is included in the SYN-ACK, the appropriate value as given in RFC 7413 is used. Reviewed by: kbowling@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16175
# 5f1347d7	06-Jul-2018	Michael Tuexen <tuexen@FreeBSD.org>	Allow alternate TCP stack to populate the TCP FO client cookie cache. Without this patch, TCP FO could be used when using alternate TCP stack, but only existing entires in the TCP client cookie cache could be used. This cache was not populated by connections using alternate TCP stacks. Sponsored by: Netflix, Inc.
# 6573d758	03-Jul-2018	Matt Macy <mmacy@FreeBSD.org>	epoch(9): allow preemptible epochs to compose - Add tracker argument to preemptible epochs - Inline epoch read path in kernel and tied modules - Change in_epoch to take an epoch as argument - Simplify tfb_tcp_do_segment to not take a ti_locked argument, there's no longer any benefit to dropping the pcbinfo lock and trying to do so just adds an error prone branchfest to these functions - Remove cases of same function recursion on the epoch as recursing is no longer free. - Remove the the TAILQ_ENTRY and epoch_section from struct thread as the tracker field is now stack or heap allocated as appropriate. Tested by: pho and Limelight Networks Reviewed by: kbowling at llnw dot com Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16066
# a00f4ac2	23-Jun-2018	Gleb Smirnoff <glebius@FreeBSD.org>	Revert r334843, and partially revert r335180. tcp_outflags[] were defined since 4BSD and are defined nowadays in all its descendants. Removing them breaks third party application.
# c6f76759	19-Jun-2018	Randall Stewart <rrs@FreeBSD.org>	Make sure that the t_peakrate_thr is not compiled in by default until NF can upstream it. Reviewed by: and suggested lstewart Sponsored by: Netflix Inc.
# 9e58ff6f	18-Jun-2018	Matt Macy <mmacy@FreeBSD.org>	convert inpcbinfo hash and info rwlocks to epoch + mutex - Convert inpcbinfo info & hash locks to epoch for read and mutex for write - Garbage collect code that handled INP_INFO_TRY_RLOCK failures as INP_INFO_RLOCK which can no longer fail When running 64 netperfs sending minimal sized packets on a 2x8x2 reduces unhalted core cycles samples in rwlock rlock/runlock in udp_send from 51% to 3%. Overall packet throughput rate limited by CPU affinity and NIC driver design choices. On the receiver unhalted core cycles samples in in_pcblookup_hash went from 13% to to 1.6% Tested by LLNW and pho@ Reviewed by: jtl Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15686
# 9293873e	14-Jun-2018	Gleb Smirnoff <glebius@FreeBSD.org>	TCPOUTFLAGS no longer exists since r334843.
# 4aec110f	13-Jun-2018	Randall Stewart <rrs@FreeBSD.org>	This fixes several bugs that Larry Rosenman helped me find in Rack with respect to its handling of TCP Fast Open. Several fixes all related to TFO are included in this commit: 1) Handling of non-TFO retransmissions 2) Building the proper send-map when we are doing TFO 3) Dealing with the ack that comes back that includes the SYN and data. It appears that with this commit TFO now works :-) Thanks Larry for all your help!! Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15758
# cff21e48	11-Jun-2018	Jonathan T. Looney <jtl@FreeBSD.org>	Change RACK dependency on TCPHPTS from a build-time dependency to a load- time dependency. At present, RACK requires the TCPHPTS option to run. However, because modules can be moved from machine to machine, this dependency is really best assessed at load time rather than at build time. Reviewed by: rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D15756
# 89e560f4	07-Jun-2018	Randall Stewart <rrs@FreeBSD.org>	This commit brings in a new refactored TCP stack called Rack. Rack includes the following features: - A different SACK processing scheme (the old sack structures are not used). - RACK (Recent acknowledgment) where counting dup-acks is no longer done instead time is used to knwo when to retransmit. (see the I-D) - TLP (Tail Loss Probe) where we will probe for tail-losses to attempt to try not to take a retransmit time-out. (see the I-D) - Burst mitigation using TCPHTPS - PRR (partial rate reduction) see the RFC. Once built into your kernel, you can select this stack by either socket option with the name of the stack is "rack" or by setting the global sysctl so the default is rack. Note that any connection that does not support SACK will be kicked back to the "default" base FreeBSD stack (currently known as "default"). To build this into your kernel you will need to enable in your kernel: makeoptions WITH_EXTRA_TCP_STACKS=1 options TCPHPTS Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15525