History log of /freebsd-current/sys/netinet/tcp_stacks/rack.c
Revision Date Author Comments
# ea916b64 18-May-2024 Randall Stewart <rrs@FreeBSD.org>

Remove TCP_SAD optional code now that the sack filter performs this function.

With the commit of D44903 we no longer need the SAD option. Instead all stacks that
use the sack filter inherit its protection against sack-attack.

Reviewed by: tuexen@
Differential Revision:https://reviews.freebsd.org/D45216


# 2f923a0c 11-May-2024 Michael Tuexen <tuexen@FreeBSD.org>

tcp rack: improve handling of front states

When the RACK stack wants to send a FIN, but still has outstanding
or unsent data, it sends a challenge ack. Don't do this when the
TCP endpoint is still in the front states, since it does not
make sense.
Reviewed by: rrs
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D45122


# fce03f85 05-May-2024 Randall Stewart <rrs@FreeBSD.org>

TCP can be subject to Sack Attacks lets fix this issue.

There is a type of attack that a TCP peer can launch on a connection. This is for sure in Rack or BBR and probably even the default stack if it uses lists in sack processing. The idea of the attack is that the attacker is driving you to look at 100's of sack blocks that only update 1 byte. So for example if you have 1 - 10,000 bytes outstanding the attacker sends in something like:

ACK 0 SACK(1-512) SACK(1024 - 1536), SACK(2048-2536), SACK(4096 - 4608), SACK(8192-8704)
This first sack looks fine but then the attacker sends

ACK 0 SACK(1-512) SACK(1025 - 1537), SACK(2049-2537), SACK(4097 - 4609), SACK(8193-8705)
ACK 0 SACK(1-512) SACK(1027 - 1539), SACK(2051-2539), SACK(4099 - 4611), SACK(8195-8707)
...
These blocks are making you hunt across your linked list and split things up so that you have an entry for every other byte. Has your list grows you spend more and more CPU running through the lists. The idea here is the attacker chooses entries as far apart as possible that make you run through the list. This example is small but in theory if the window is open to say 1Meg you could end up with 100's of thousands link list entries.

To combat this we introduce three things.

when the peer requests a very small MSS we stop processing SACK's from them. This prevents a malicious peer from just using a small MSS to do the same thing.
Any time we get a sack block, we use the sack-filter to remove sacks that are smaller than the smallest v4 mss (minus 40 for max TCP options) unless it ties up to snd_max (since that is legal). All other sacks in theory should be at least an MSS. If we get such an attacker that means we basically start skipping all but MSS sized Sacked blocks.
The sack filter used to throw away data when its bounds were exceeded, instead now we increase its size to 15 and then throw away sack's if the filter gets over-run to prevent the malicious attacker from over-running the sack filter and thus we start to process things anyway.
The default stack will need to start using the sack-filter which we have talked about in past conference calls to take full advantage of the protections offered by it (and reduce cpu consumption when processing sacks).

After this set of changes is in rack can drop its SAD detection completely

Reviewed by:tuexen@, rscheff@
Differential Revision: <https://reviews.freebsd.org/D44903>


# 1941914d 18-Apr-2024 Michael Tuexen <tuexen@FreeBSD.org>

tcp rack: improve BBR_LOG_CWND event

Fix a typo, which resulted in missing r_ctl.gate_to_fs in the BBLog
event.

Reported by: Coverity Scan
CID: 1540024
Reviewed by: rrs, rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44648


# c9cd686b 18-Apr-2024 Michael Tuexen <tuexen@FreeBSD.org>

tcp: drop data received after a FIN has been processed

RFC 9293 describes the handling of data in the CLOSE-WAIT, CLOSING,
LAST-ACK, and TIME-WAIT states:
This should not occur since a FIN has been received from the remote
side. Ignore the segment text.
Therefore, implement this handling.

Reviewed by: rrs, rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44746


# d902c8f5 06-Apr-2024 Michael Tuexen <tuexen@FreeBSD.org>

tcp rack: fix memory corruption

When in rack_output() jumping to the label out, don't write errno into
the log buffer, since the pointer is not initialized.

Reported by: Coverity Scan
CID: 1523773
Reviewed by: rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44647


# 7df0ef5f 05-Apr-2024 Michael Tuexen <tuexen@FreeBSD.org>

tcp rack: fix sending

In rack_output(), idle is used as a boolean variable. So don't use it
as an int and don't clear it afterwards.
This avoids setting idle to false, when it is not intended.

Reported by: olivier
Reviewed by: rrs, rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44610


# dd7b86e2 18-Mar-2024 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove IS_FASTOPEN() macro

The macro is more obfuscating than helping as it just checks a single flag
of t_flags. All other t_flags bits are checked without a macro.

A bigger problem was that declaration of the macro in tcp_var.h depended
on a kernel option. It is a bad practice to create such definitions in
installable headers.

Reviewed by: rscheff, tuexen, kib
Differential Revision: https://reviews.freebsd.org/D44362


# e18b97bd 12-Mar-2024 Randall Stewart <rrs@FreeBSD.org>

Update to bring the rack stack with all its fixes in.

This brings the rack stack up to the current level used at NF. Many fixes
and improvements have been added. I also add in a fix to BBR to deal with
the changes that have been in hpts for a while i.e. only one call no matter
if mbuf queue or tcp_output.

It basically does little except BBlogs and is a placemark for future work on
doing path capacity measurements.

With a bit of a struggle with git I finally got rack_pcm.c into place (apologies
for not noticing this error). The LINT kernel is running on my box now .. sigh.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision:https://reviews.freebsd.org/D43986


# c112243f 11-Mar-2024 Brooks Davis <brooks@FreeBSD.org>

Revert "Update to bring the rack stack with all its fixes in."

This commit was incomplete and breaks LINT kernels. The tree has been
broken for 8+ hours.

This reverts commit f6d489f402c320f1a6eaa473491a0b8c3878113e.


# f6d489f4 11-Mar-2024 Randall Stewart <rrs@FreeBSD.org>

Update to bring the rack stack with all its fixes in.

This brings the rack stack up to the current level used at NF. Many fixes
and improvements have been added. I also add in a fix to BBR to deal with
the changes that have been in hpts for a while i.e. only one call no matter
if mbuf queue or tcp_output.

Note there is a new file that I can't figure out how to get in rack_pcm.c

It basically does little except BBlogs and is a placemark for future work on
doing path capacity measurements.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision:https://reviews.freebsd.org/D43986


# 2f4e46df 15-Feb-2024 Michael Tuexen <tuexen@FreeBSD.org>

RACK, BBR: handle EACCES like EPERM for IP output handling

The FreeBSD TCP base stack handles them also the same way.
In case of packet filters dropping packets in the output path,
this avoids retranmitting the dropped packet every 10ms or so.

Reviewed by: rscheff
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D43773


# 32a6df57 08-Feb-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: calculate ssthresh on RTO according to RFC5681

per RFC5681, only adjust ssthresh on the initital
retransmission timeout. Since RTO often happens
during loss recovery, while cwnd no longer tracks
all data in flight, calculcate pipe properly.

Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43768


# f7d5900a 28-Dec-2023 John Baldwin <jhb@FreeBSD.org>

sys: Style fix for M_EXT | M_EXTPG

Add a space around the | operator in places testing for either M_EXT
or M_EXTPG.

Reviewed by: imp, glebius
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D43216


# 7b0b448b 27-Dec-2023 Gordon Bergling <gbe@FreeBSD.org>

tcp_stacks: Fix two typos in a source code comments

- s/recieved/received/

MFC after: 3 days


# d2ef52ef 04-Dec-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp/hpts: make stacks responsible for clearing themselves out HPTS

There already is the tfb_tcp_timer_stop_all method that is supposed to stop
all time events associated with a given tcpcb by given stack. Some time
ago it was doing actual callout_stop(). Today bbr/rack just mark their
internal state as inactive in their tfb_tcp_timer_stop_all methods, but
tcpcb stays in HPTS wheel and potentially called in from HPTS. Change the
methods to also call tcp_hpts_remove(). Note: I'm not sure if internal
flag is still relevant once we are out of HPTS wheel.

Call the method when connection goes into TCP_CLOSED state, instead of
calling it later when tcpcb is freed. Also call it when we switch between
stacks.

Reviewed by: tuexen, rrs
Differential Revision: https://reviews.freebsd.org/D42857


# 2b3a7746 04-Dec-2023 Gleb Smirnoff <glebius@FreeBSD.org>

hpts: make stacks responsible for tcp_hpts_init()

Those stacks that use HPTS should care about init, not generic code.

Reviewed by: imp, tuexen, rrs
Differential Revision: https://reviews.freebsd.org/D42856


# b10ae5a9 05-Nov-2023 Michael Tuexen <tuexen@FreeBSD.org>

tcp rack: remove references to rb trees

The references should have been removed in
https://cgit.freebsd.org/src/commit/?id=030434acaf4631c4e205f8bccedcc7f845cbfcbf

Reviewed by: rscheff, zlei
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D42386


# 8818f0f1 04-Oct-2023 Randall Stewart <rrs@FreeBSD.org>

TCP: Fix a rack bug that skyzall found which results in a crash.

So when we call the fast_rsm retransmit path, we should always move
snd_nxt back up to snd_max. In fact during ack-processing if snd_nxt
falls behind it should be moved up there as well. Otherwise what
can happen is we have an incorrect mark on snd_nxt and incorrectly
calculate the offset when we go through the front path (which is
what skzyall was able to do) then when we go to clean up the
send the offset is all wrong and we crash.

Special thanks to Gleb for pointing out the problem and the email
that had the reproducer so I could find the issue.

Reported-by: syzbot+f5061a372f74f021ec02@syzkaller.appspotmail.com
Sponsored by: Netflix Inc


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# ab65c64b 26-Jul-2023 Michael Tuexen <tuexen@FreeBSD.org>

tcp: fix handling of <RST,ACK> segments in SYN-RCVD for RACK and BBR

This deals with TCP endpoints in the SYN-RCVD state coming from the
SYN-SENT state.

Reviewed by: rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D41203


# 02b885b0 21-Jun-2023 Michael Tuexen <tuexen@FreeBSD.org>

tcp: fix TCP MD5 computation for the BBR and RACK stack

PR: 253096
Reviewed by: cc, rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D40597


# e022f2b0 09-Jun-2023 Randall Stewart <rrs@FreeBSD.org>

tcp: Rack fixes and misc updates

So over the past few weeks we have found several bugs and updated hybrid pacing to have
more data in the low-level logging. We have also moved more of the BBlogs to "verbose" mode
so that we don't generate a lot of the debug data unless you put verbose/debug on.
There were a couple of notable bugs, one being the incorrect passing of percentage
for reduction to timely and the other the incorrect use of 20% timely Beta instead of
80%. This also expands a simply idea to be able to pace a cwnd (fillcw) as an alternate
pacing mechanism combining that with timely reduction/increase.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D40391


# 43b117f8 06-Jun-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: make the maximum number of retransmissions tunable per VNET

Both Windows (TcpMaxDataRetransmissions) and Linux (tcp_retries2)
allow to restrict the maximum number of consecutive timer based
retransmissions. Add that same capability on a per-VNet basis to
FreeBSD.

Reviewed By: cc, tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D40424


# d66540e8 05-Jun-2023 Michael Tuexen <tuexen@FreeBSD.org>

tcp: improve sending of TTL/hoplimit and DSCP

Ensure that a user specified value of TTL/hoplimit and DSCP is
used when sending packets.

Reviewed by: cc, rscheff
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D40423


# 57a3a161 24-May-2023 Randall Stewart <rrs@FreeBSD.org>

tcp: request tracking is not http specific.

This change is a name change only. TCP Request tracking can track sendfile and even non-sendfile requests. The
names however in the current code use http, and they should not. The feature is not http specific. Lets change the
name so they more properly reflect whats going on. This also fixes conflicts with http_req which caused application pain.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D40229


# c3c20de3 25-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: move HPTS/LRO flags out of inpcb to tcpcb

These flags are TCP specific. While here, make also several LRO
internal functions to pass tcpcb pointer instead of inpcb one.

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39698


# c2a69e84 25-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_hpts: move HPTS related fields from inpcb to tcpcb

This makes inpcb lighter and allows future cache line optimizations
of tcpcb. The reason why HPTS originally used inpcb is the compressed
TIME-WAIT state (see 0d7445193ab), that used to free a tcpcb, while the
associated connection is still on the HPTS ring.

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39697


# 4e8a20a7 19-Apr-2023 Randall Stewart <rrs@FreeBSD.org>

tcp: rack the request level logging is a bit too noisy when doing point logging.

When doing request level BB logging the hybrid_bw_log() does not have proper screening to minimize logging
when point level logging is in use. Lets fix it properly so you have to have the proper knobs set to get the
more noisy logging.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39699


# 7a842346 19-Apr-2023 Randall Stewart <rrs@FreeBSD.org>

tcp: Rack can crash with the new non-TSO fix..

Turns out the location of the check to see if we can do output is in the wrong place. We need
to jump off to the compressed acks before handling that case since th is NULL in the
compressed ack case which is handled differently anyway.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39690


# 2ad584c5 17-Apr-2023 Randall Stewart <rrs@FreeBSD.org>

tcp: Inconsistent use of hpts_calling flag

Gleb has noticed there were some inconsistency's in the way the inp_hpts_calls flag was being used. One
such inconsistency results in a bug when we can't allocate enough sendmap entries to entertain a call to
rack_output().. basically a timer won't get started like it should. Also in cleaning this up I find that the
"no_output" side of input needs to be adjusted to make sure we don't try to re-pace too quickly outside
the hpts assurance of 250useconds.

Another thing here is we end up with duplicate calls to tcp_output() which we should not. If packets go
from hpts for processing the input side of tcp will call the output side of tcp on the last packet if it is needed.
This means that when that occurs a second call to tcp_output would be made that is not needed and if pacing
is going on may be harmful.

Lets fix all this and explicitly state the contract that hpts is making with transports that care about the
flag.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39653


# a540cdca 17-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_hpts: use queue(9) STAILQ for the input queue

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39574


# 3cc7b667 14-Apr-2023 Randall Stewart <rrs@FreeBSD.org>

tcp: stack unloading crash in rack and bbr

Its possible to induce a crash in either rack or bbr. This would be done
if the rack stack were say the default and bbr was being used by a connection.
If the bbr stack is then unloaded and it was active, we will trigger a MPASS assert
in tcp_hpts since the new stack (default rack) would start a timer, and the old stack
(bbr) would have the inp already in hpts.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39576


# 9903bf34 14-Apr-2023 Randall Stewart <rrs@FreeBSD.org>

tcp: rack pacing has some caveats that need to be obeyed when LRO is missing

n further non-LRO testing I found a case where rack is supposed to be waking up but
it is not now. In this special case it sets the flag rc_ack_can_sendout_data. When that is
set we should not prohibit output.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39565


# a2b33c9a 10-Apr-2023 Randall Stewart <rrs@FreeBSD.org>

tcp: Rack - in the absence of LRO fixed rate pacing (loopback or interfaces with no LRO) does not work correctly.

Rack is capable of fixed rate or dynamic rate pacing. Both of these can get mixed up when
LRO is not available. This is because LRO will hold off waking up the tcp connection to
processing the inbound packets until the pacing timer is up. Without LRO the pacing only
sort-of works. Sometimes we pace correctly, other times not so much.

This set of changes will make it so pacing works properly in the absence of LRO.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39494


# e2224617 10-Apr-2023 John Baldwin <jhb@FreeBSD.org>

rack: mask and tclass are only used for INET6.

This fixes the LINT-NOINET6 build.


# 66fbc19f 07-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: pass tcpcb in the tfb_tcp_ctloutput() method instead of inpcb

Just matches rest of the KPI.

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39435


# 35bc0bcc 07-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: reduce argument list to functions that pass a segment

The socket argument is superfluous, as a tcpcb always has one and
only one socket.

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39434


# 945f9a7c 07-Apr-2023 Randall Stewart <rrs@FreeBSD.org>

tcp: misc cleanup of options for rack as well as socket option logging.

Both BBR and Rack have the ability to log socket options, which is currently disabled. Rack
has an experimental SaD (Sack Attack Detection) algorithm that should be made available. Also
there is a t_maxpeak_rate that needs to be removed (its un-used).

Reviewed by: tuexen, cc
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D39427


# 84b42df8 04-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

rack: fix build on powerpc


# 030434ac 04-Apr-2023 Randall Stewart <rrs@FreeBSD.org>

Update rack to the latest code used at NF.

There have been many changes to rack over the last couple of years, including:
a) Ability when switching stacks to have one stack query another.
b) Internal use of micro-second timers instead of ticks.
c) Many changes to pacing in forms of
1) Improvements to Dynamic Goodput Pacing (DGP)
2) Improvements to fixed rate paciing
3) A new feature called hybrid pacing where the requestor can
get a combination of DGP and fixed rate pacing with deadlines
for delivery that can dynamically speed things up.
d) All kinds of bugs found during extensive testing and use of the
rack stack for streaming video and in fact all data transferred
by NF

Reviewed by: glebius, gallatin, tuexen
Sponsored By: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D39402


# 73ee5756 31-Mar-2023 Randall Stewart <rrs@FreeBSD.org>

Fixes in the tcp infrastructure with respect to stack changes as well as other infrastructure updates for incoming rack features.

So stack switching as always been a bit of a issue. We currently use a break before make setup which means that
if something goes wrong you have to try to get back to a stack. This patch among a lot of other things changes that so
that it is a make before break. We also expand some of the function blocks in prep for new features in rack that will allow
more controlled pacing. We also add other abilities such as the pathway for a stack to query a previous stack to acquire from
it critical state information so things in flight don't get dropped or mis-handled when switching stacks. We also add the
concept of a timer granularity. This allows an alternate stack to change from the old ticks granularity to microseconds and
of course this even gives us a pathway to go to nanosecond timekeeping if we need to (something for the data center to consider
for sure).

Once all this lands I will then update rack to begin using all these new features.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D39210


# 69c7c811 16-Mar-2023 Randall Stewart <rrs@FreeBSD.org>

Move access to tcp's t_logstate into inline functions and provide new tracepoint and bbpoint capabilities.

The TCP stacks have long accessed t_logstate directly, but in order to do tracepoints and the new bbpoints
we need to move to using the new inline functions. This adds them and moves rack to now use
the tcp_tracepoints.

Reviewed by: tuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D38831


# c0e4090e 08-Feb-2023 Andrew Gallatin <gallatin@FreeBSD.org>

ktls: Accurately track if ifnet ktls is enabled

This allows us to avoid spurious calls to ktls_disable_ifnet()

When we implemented ifnet kTLSe, we set a flag in the tx socket
buffer (SB_TLS_IFNET) to indicate ifnet kTLS. This flag meant that
now, or in the past, ifnet ktls was active on a socket. Later,
I added code to switch ifnet ktls sessions to software in the case
of lossy TCP connections that have a high retransmit rate.
Because TCP was using SB_TLS_IFNET to know if it needed to do math
to calculate the retransmit ratio and potentially call into
ktls_disable_ifnet(), it was doing unneeded work long after
a session was moved to software.

This patch carefully tracks whether or not ifnet ktls is still enabled
on a TCP connection. Because the inp is now embedded in the tcpcb, and
because TCP is the most frequent accessor of this state, it made sense to
move this from the socket buffer flags to the tcpcb. Because we now need
reliable access to the tcbcb, we take a ref on the inp when creating a tx
ktls session.

While here, I noticed that rack/bbr were incorrectly implementing
tfb_hwtls_change(), and applying the change to all pending sends,
when it should apply only to future sends.

This change reduces spurious calls to ktls_disable_ifnet() by 95% or so
in a Netflix CDN environment.

Reviewed by: markj, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D38380


# 18b83b62 26-Jan-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: reduce the size of t_rttupdated in tcpcb

During tcp session start, various mechanisms need to
track a few initial RTTs before becoming active.
Prevent overflows of the corresponding tracking counter
and reduce the size of tcpcb simultaneously.

Reviewed By: #transport, tuexen, guest-ccui
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D21117


# 73e994a9 19-Jan-2023 Gordon Bergling <gbe@FreeBSD.org>

extra_tcp_stacks: Fix a common typo in source code comments

- s/orginal/original/

MFC after: 3 days


# 432a398d 10-Jan-2023 Gordon Bergling <gbe@FreeBSD.org>

tcp_rack(4): Fix a typo in a source code comment

- s/postion/position/

MFC after: 3 days


# 8ea41829 10-Jan-2023 Andrew Gallatin <gallatin@FreeBSD.org>

tcp: Build RACK and BBR stacks as a part of LINT

When RACK and BBR were added to the kernel, they were put
behind 'WITH_EXTRA_TCP_STACKS=1'. Unfortunately that was
never added to any NOTES file, so RACK & BBR were not compiled
with the various LINT-NOINET, LINT-NOINET6, and LINT-NOIP kernels.
This lead to the stacks sometimes being broken.

This change:

- Fixes RACK so that it compiles with the various LINT-NO* kernels
- Adds WITH_EXTRA_TCP_STACKS=1 to all NOTES kernels so that
RACK and BBR are compile tested regularly

Sponsored by: Netflix
Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D37903


# 2e2a1c31 14-Dec-2022 Randall Stewart <rrs@FreeBSD.org>

Opps take out a stray left behind printf that was
for debugging.. Sorry.


# e2e088ae 14-Dec-2022 Randall Stewart <rrs@FreeBSD.org>

Rack cannot be loaded without cc_newreno compiled into the kernel.

Right now rack will fail to load due to its hack in accessing symbol names
in cc_newreno. This was fine when newreno was always compiled into the
kernel but now ... not so much. Instead lets fix up rack to use the socket
option queries to get the information it wants and set the parameters. We
also fix the CC parameter so they are always settable.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D37622


# eaabc937 14-Dec-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: retire TCPDEBUG

This subsystem is superseded by modern debugging facilities,
e.g. DTrace probes and TCP black box logging.

We intentionally leave SO_DEBUG in place, as many utilities may
set it on a socket. Also the tcp::debug DTrace probes look at
this flag on a socket.

Reviewed by: gnn, tuexen
Discussed with: rscheff, rrs, jtl
Differential revision: https://reviews.freebsd.org/D37694


# e6fc01f6 14-Dec-2022 Mateusz Guzik <mjg@FreeBSD.org>

tcp: whack the stale declaration of rack_timer_stop

Sponsored by: Rubicon Communications, LLC ("Netgate")


# 446ccdd0 07-Dec-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: use single locked callout per tcpcb for the TCP timers

Use only one callout structure per tcpcb that is responsible for handling
all five TCP timeouts. Use locked version of callout, of course. The
callout function tcp_timer_enter() chooses soonest timer and executes it
with lock held. Unless the timer reports that the tcpcb has been freed,
the callout is rescheduled for next soonest timer, if there is any.

With single callout per tcpcb on connection teardown we should be able
to fully stop the callout and immediately free it, avoiding use of
callout_async_drain(). There is one gotcha here: callout_stop() can
actually touch our memory when a rare race condition happens. See
comment above tcp_timer_stop(). Synchronous stop of the callout makes
tcp_discardcb() the single entry point for tcpcb destructor, merging the
tcp_freecb() to the end of the function.

While here, also remove lots of lingering checks in the beginning of
TCP timer functions. With a locked callout they are unnecessary.

While here, clean unused parts of timer KPI for the pluggable TCP stacks.

While here, remove TCPDEBUG from tcp_timer.c, as this allows for more
simplification of TCP timers. The TCPDEBUG is scheduled for removal.

Move the DTrace probes in timers to the beginning of a function, where
a tcpcb is always existing.

Discussed with: rrs, tuexen, rscheff (the TCP part of the diff)
Reviewed by: hselasky, kib, mav (the callout part)
Differential revision: https://reviews.freebsd.org/D37321


# 918fa422 07-Dec-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove tcp_timer_suspend()

It was a temporary code added together with RACK to fight against
TCP timer races.


# e68b3792 07-Dec-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: embed inpcb into tcpcb

For the TCP protocol inpcb storage specify allocation size that would
provide space to most of the data a TCP connection needs, embedding
into struct tcpcb several structures, that previously were allocated
separately.

The most import one is the inpcb itself. With embedding we can provide
strong guarantee that with a valid TCP inpcb the tcpcb is always valid
and vice versa. Also we reduce number of allocs/frees per connection.
The embedded inpcb is placed in the beginning of the struct tcpcb,
since in_pcballoc() requires that. However, later we may want to move
it around for cache line efficiency, and this can be done with a little
effort. The new intotcpcb() macro is ready for such move.

The congestion algorithm data, the TCP timers and osd(9) data are
also embedded into tcpcb, and temprorary struct tcpcb_mem goes away.
There was no extra allocation here, but we went through extra pointer
every time we accessed this data.

One interesting side effect is that now TCP data is allocated from
SMR-protected zone. Potentially this allows the TCP stacks or other
TCP related modules to utilize that for their own synchronization.

Large part of the change was done with sed script:

s/tp->ccv->/tp->t_ccv./g
s/tp->ccv/\&tp->t_ccv/g
s/tp->cc_algo/tp->t_cc/g
s/tp->t_timers->tt_/tp->tt_/g
s/CCV\(ccv, osd\)/\&CCV(ccv, t_osd)/g

Dependency side effect is that code that needs to know struct tcpcb
should also know struct inpcb, that added several <netinet/in_pcb.h>.

Differential revision: https://reviews.freebsd.org/D37127


# bd4f9866 16-Nov-2022 Michael Tuexen <tuexen@FreeBSD.org>

tcp: remove unused t_rttbest

No functional change intended.

Reviewed by: rscheff@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D37401


# 9eb0e832 08-Nov-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: provide macros to access inpcb and socket from a tcpcb

There should be no functional changes with this commit.

Reviewed by: rscheff
Differential revision: https://reviews.freebsd.org/D37123


# b1258b76 06-Nov-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: add conservative d.cep accounting algorithm

Accurate ECN asks to conservatively estimate, when the
ACE counter may have wrapped due to a single ACK covering a larger
number of segments. This is described in Annex A.2 of the
accurate-ecn draft.

Event: IETF 115 Hackathon
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D37281


# f504685a 31-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

rack/bbr: put back assertion that connection is not in TIME-WAIT

The assertion was incorrectly removed in 0d7445193ab. The leak of
a TIME-WAIT state into tfb_do_segment_nounlock method was fixed in
31bc602ff81. The TIME-WAIT connections are processed by the main
tcp_input() always.


# 83c1ec92 20-Oct-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: ECN preparations for ECN++, AccECN (tcp_respond)

tcp_respond is another function to build a tcp control packet
quickly. With ECN++ and AccECN, both the IP ECN header, and
the TCP ECN flags are supposed to reflect the correct state.

Also ensure that on receiving multiple ECN SYN-ACKs, the
responses triggered will reflect the latest state.

Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36973


# 53af6903 06-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove INP_TIMEWAIT flag

Mechanically cleanup INP_TIMEWAIT from the kernel sources. After
0d7445193ab, this commit shall not cause any functional changes.

Note: this flag was very often checked together with INP_DROPPED.
If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we
will be able to remove most of this checks and turn them to
assertions. Some of them can be turned into assertions right now,
but that should be carefully done on a case by case basis.

Differential revision: https://reviews.freebsd.org/D36400


# 0d744519 06-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove tcptw, the compressed timewait state structure

The memory savings the tcptw brought back in 2003 (see 340c35de6a2) no
longer justify the complexity required to maintain it. For longer
explanation please check out the email [1].

Surpisingly through almost 20 years the TCP stack functionality of
handling the TIME_WAIT state with a normal tcpcb did not bitrot. The
existing tcp_input() properly handles a tcpcb in TCPS_TIME_WAIT state,
which is confirmed by the packetdrill tcp-testsuite [2].

This change just removes tcptw and leaves INP_TIMEWAIT. The flag will
be removed in a separate commit. This makes it easier to review and
possibly debug the changes.

[1] https://lists.freebsd.org/archives/freebsd-net/2022-January/001206.html
[2] https://github.com/freebsd-net/tcp-testsuite

Differential revision: https://reviews.freebsd.org/D36398


# 2220b66f 03-Oct-2022 Konstantin Belousov <kib@FreeBSD.org>

Add mbuf_tstmp2timeval()

Reviewed by: hselasky, jkim, rscheff
Sponsored by: NVIDIA networking
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D36870


# e5049a17 26-Sep-2022 Randall Stewart <rrs@FreeBSD.org>

TCP rack does not work properly with cubic.

Right now if you use rack with cubic (the new default cc) you will have
improper results. This is because rack uses different variables than
the base stack (or bbr) and thus tcp_compute_pipe() always returns
so that cubic will choose a 30% backoff not the 50% backoff it should
when it is newreno compatibility mode. The fix is to allow a stack (rack)
to override its own compute_pipe.

Reviewed by: tuexen, rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36711


# 493105c2 21-Sep-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: fix simultaneous open and refine e80062a2d43

- The soisconnected() call on transition from SYN_RCVD to ESTABLISHED
is also necessary for a half-synchronized connection. Fix that
just setting the flag, when we transfer SYN-SENT -> SYN-RECEIVED.
- Provide a comment that explains at what conditions the call to
soisconnected() is necessary.
- Hence mechanically rename the TF_INCQUEUE flag to TF_SONOTCONN.
- Extend the change to the BBR and RACK stacks.

Note: the interaction between the accept_filter(9) and the socket layer
is not fully consistent, yet. For most accept filters this call to
soisconnected() will not move the connection from the incomplete queue
to the complete. The move would happen only when the filter has received
the desired data, and soisconnected() would be called once again from
sorwakeup(). Ideally, we should mark socket as connected only there,
and leave the soisconnected() from SYN_RCVD->ESTABLISHED only for the
simultaneous open case. However, this doesn't yet work.

Reviewed by: rscheff, tuexen, rrs
Differential revision: https://reviews.freebsd.org/D36641


# 81560c55 09-Sep-2022 Randall Stewart <rrs@FreeBSD.org>

TCP: Rack ends up sending all that is outstanding every timeout.

In doing some testing for a different problem, I have found rack retransmitting
all outstanding data every time a timeout occurs. The outstanding is sent 1ms
apart between each packet, and then the timeout runs off again. This causes
extra retransmissions when we should be waiting for an ack after sending the
very first segment.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36494


# fa52f9dc 03-Sep-2022 Gordon Bergling <gbe@FreeBSD.org>

tcp_rack: Fix two typos in source code comments

- s/overriden/overridden/

MFC after: 3 days


# 4012ef77 31-Aug-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Functional implementation of Accurate ECN

The AccECN handshake and TCP header flags are supported,
no support yet for the AccECN option. This minimalistic
implementation is sufficient to support DCTCP while
dramatically cutting the number of ACKs, and provide ECN
response from the receiver to the CC modules.

Reviewed By: #transport, #manpages, rrs, pauamma
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D21011


# 62ce18fc 23-Aug-2022 Randall Stewart <rrs@FreeBSD.org>

tcp: Rack rwnd collapse.

Currently when the peer collapses its rwnd, we mark packets to be retransmitted
and use the must_retran flags like we do when a PMTU collapses to retransmit the
collapsed packets. However this causes a problem with some middle boxes that
play with the rwnd to control flow. As soon as the rwnd increases we start resending
which may be not even a rtt.. and in fact the peer may have gotten the packets. Which
means we gratuitously retransmit packets we should not.

The fix here is to make sure that a rack time has passed before retransmitting the packets.
This makes sure that the rwnd collapse was real and the packets do need retransmission.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D35166


# 57cdd13d 14-Aug-2022 Dimitry Andric <dim@FreeBSD.org>

Suppress unused variable warning in tcp_stacks's rack.c

With clang 15, the following -Werror warning is produced:

sys/netinet/tcp_stacks/rack.c:17405:12: error: variable 'outstanding' set but not used [-Werror,-Wunused-but-set-variable]
uint32_t outstanding;
^

The 'outstanding' variable was used later in the rack_output() function,
but refactoring in 35c7bb340788f removed the usage. To avoid too much
code churn, mark the variable unused to supress the warning.

MFC after: 3 days


# e967183c 14-Aug-2022 Dimitry Andric <dim@FreeBSD.org>

Fix unused variable warning in tcp_stacks's rack.c

With clang 15, the following -Werror warning is produced:

sys/netinet/tcp_stacks/rack.c:16148:6: error: variable 'cnt_thru' set but not used [-Werror,-Wunused-but-set-variable]
int cnt_thru = 1;
^

The 'cnt_thru' variable is only used when TCP_ACCOUNTING is defined.
Ensure it is only declared and set in that case.

MFC after: 3 days


# 1abc27dd 01-Aug-2022 Michael Tuexen <tuexen@FreeBSD.org>

tcp rack: simplify computation of rsm start and end

While there, also fix the setting of the SYN related flag.

Reviewed by: rrs
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D35862


# 5b741298 19-Jul-2022 Michael Tuexen <tuexen@FreeBSD.org>

tcp rack: fix switching to RACK when FIN has been sent

Fix the rack sendmap entry in case a FIN has been sent when the
stack is switched over to RACK.

Reported by: syzbot+dd55e316428419e9354b@syzkaller.appspotmail.com
Reviewed by: rrs
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D35731


# 74703901 04-Jul-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: use a TCP flag to check if connection has been close(2)d

The flag SS_NOFDREF is a private flag of the socket layer. It also
is supposed to be read with SOCK_LOCK(), which we don't own here.

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D35663


# 32a01b2b 04-Jun-2022 Gordon Bergling <gbe@FreeBSD.org>

rack: Fix a common typo in comments and a sysctl description

- s/multipler/multiplier/

MFC after: 3 days


# c93db892 04-Jun-2022 Gordon Bergling <gbe@FreeBSD.org>

rack: Fix a typo in a source code comment

- s/enought/enough/

MFC after: 3 days


# 21b923c3 04-Jun-2022 Gordon Bergling <gbe@FreeBSD.org>

tcp_rack: Fix two typos in sysctl descriptions

- s/higest/highest/

MFC after: 3 days


# 4581cffb 12-May-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: fix build, convert missed sbreserve_locked() calls

Fixes: 4328318445ae


# 04831efd 10-May-2022 Randall Stewart <rrs@FreeBSD.org>

tcp: Rack idle reduce not working.

Rack converted to micro-seconds quite some time ago, but in testing
we have found a miss in that work. The idle reduce time is still based
in ticks, so it must be converted to microseconds before any comparisons
else you will likely not do idle reduce.

Reviewed by: tuexen, thj
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D35066


# 6edfc10c 14-Apr-2022 Randall Stewart <rrs@FreeBSD.org>

tcp: adding a functionality to define "trace points" so that BB logging can be enabled at specific events.

This commit will add a new concept to rack, tracepoints. A tracepoint
is a defined point inserted into the code (3 are included in this initial patch) that
allows a developer to insert a point that might be of interest. The developer numbers
the point in the tcp_rack.h file and then can use sysctl to enable that (or all) trace
points. A limit is also given to how many BB logged connections will turn on
so that a box is not overrun by BB logging.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D34898


# 86fa80f3 12-Apr-2022 John Baldwin <jhb@FreeBSD.org>

rack: Remove unused variable.


# addb2c65 09-Apr-2022 Gordon Bergling <gbe@FreeBSD.org>

tcp_rack: Fix a typo in a source code comment

- s/possiblity/possibility/

MFC after: 3 days


# 36814092 09-Apr-2022 Gordon Bergling <gbe@FreeBSD.org>

tcp_rack: Fix a few typos in sysctl descriptions and comments

- s/postion/position/
- s/postions/positions/
- s/repostion/reposition/

MFC after: 5 days


# ee1a08b8 01-Apr-2022 Randall Stewart <rrs@FreeBSD.org>

rack may end up with a stuck connectin fi the rwnd is colapsed on sent data.

There is a case where rack will get stuck when it has outstanding data and
the peer collapses the rwnd down to 0. This leaves the session hung if
the rwnd update is not received. You can test this with the packet drill script
below. Without this fix it will be stuck and hang. With it we retransmit everything.
This also fixes the mtu retransmit case so we don't go into recovery when
the mtu is changed to a smaller value.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D34573


# 75fdc440 07-Feb-2022 Gordon Bergling <gbe@FreeBSD.org>

extra_tcp_stacks: Fix two typos in source code comments

- s/recusive/recursive/

MFC after: 3 days


# 2ff07d92 25-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Restore correct ECT marking behavior on SACK retransmissions

While coalescing all ECN-related code into new common source files,
the flag to deal with SACK retransmissions was skipped. This leads
to non-compliant ECT-marking of SACK retransmissions, as well as
the premature sending of other TCP ECN flags (CWR).

Reviewed By: rrs, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34376


# a43b0aca 23-Feb-2022 Randall Stewart <rrs@FreeBSD.org>

tcp: Push bit failure to set in fastpath

Recently changes were made to the tcp stack to use a macro/function
to set tcp flags. In the process the PUSH bit setting in the fastpath of
rack was broken. This fixes that as well as cleans up a warning that
is occurring when you don't have INVARIANT on (inp used in KASSERT).

We can use the tcp test suite to find this bug the test plan shows the script
that fails due to the missing push bit

Reviewed by: rscheff, tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D34332


# cc41c174 09-Feb-2022 Randall Stewart <rrs@FreeBSD.org>

opps my patch lost the removal of the tlp_threshold counter increments


# 8d64b4b4 09-Feb-2022 Randall Stewart <rrs@FreeBSD.org>

cleanup of rack variables.

During a recent deep dive into all the variables so I could
discover why stack switching caused larger retransmits I examined
every variable in rack. In the process I found quite a few bits
that were not used and needed cleanup. This update pulls
out all the unused pieces from rack. Note there are *no* functional
changes here, just the removal of unused variables and a bit of
spacing clean up.

Reviewed by: Michael Tuexen, Richard Scheffenegger
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D34205


# ab001fcd 06-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Apply tcp flags after ECN processing in rack_fast_output()

Missed to move the tcp_set_flags() past ECN processing
in rack_fast_output() earlier.

Reviewed By: rrs, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34180


# a9696510 07-Feb-2022 Randall Stewart <rrs@FreeBSD.org>

tcp: Add hystart++ to our cubic implementation.

As promised to the transport call on 11/4/22 here is an implementation
of hystart++ for cubic. It also cleans up the tcp_congestion function
to have a better name. Common variables are moved into the general
cc.h structure so that both cubic and newreno can use them for
hystart++

Reviewed by: Michael Tuexen, Richard Scheffenegger
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D33035


# f7220c48 05-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: move ECN handling code to a common file

Reduce the burden to maintain correct and
extensible ECN related code across multiple
stacks and codepaths.

Formally no functional change.

Incidentially this establishes correct
ECN operation in one instance.

Reviewed By: rrs, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34162


# 7994ef3c 04-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

Revert "tcp: move ECN handling code to a common file"

This reverts commit 0c424c90eaa6602e07bca7836b1d178b91f2a88a.


# 0c424c90 04-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: move ECN handling code to a common file

Reduce the burden to maintain correct and
extensible ECN related code across multiple
stacks and codepaths.

Formally no functional change.

Incidentially this establishes correct
ECN operation in one instance.

Reviewed By: rrs, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34162


# fd723975 03-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: fix typo in commit f026275e26d0071ac3dee98526e8b9bcad58f0fa

missed one bitmask inversion while committing D34148

Differential Revision: https://reviews.freebsd.org/D34148
Differential Revision: https://reviews.freebsd.org/D34160


# f026275e 03-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: set IP ECN header codepoint properly

TCP RACK can cache the IP header while preparing
a new TCP packet for transmission. Thus all the
IP ECN codepoint bits need to be assigned, without
assuming a clear field beforehand.

Reviewed By: tuexen, kbowling, #transport
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34148


# 1ebf4607 03-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Access all 12 TCP header flags via inline function

In order to consistently provide access to all
(including reserved) TCP header flag bits,
use an accessor function tcp_get_flags and
tcp_set_flags. Also expand any flag variable from
uint8_t / char to uint16_t.

Reviewed By: hselasky, tuexen, glebius, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34130


# d51c8035 02-Feb-2022 Michael Tuexen <tuexen@FreeBSD.org>

rack: fix compilation and small cleanup

Fix a function prototype missed in the last commit and whitespace
change.
Sponsored by: Netflix, Inc.


# 3b3c08c1 02-Feb-2022 Michael Tuexen <tuexen@FreeBSD.org>

tcp: cleanup functions related to socket option handling

Consistently only pass the inp and the sopt around. Don't pass the
so around, since in a upcoming commit tcp_ctloutput_set() will be
called from a context different from setsockopt(). Also expect
the inp to be locked when calling tcp_ctloutput_[gs]et(), this is
also required for the upcoming use by tcpsso, a command line tool
to set socket options.
Reviewed by: glebius, rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D34151


# 9e58cca3 26-Jan-2022 Gordon Bergling <gbe@FreeBSD.org>

extra_tcp_stacks: Fix two typos in source code comments

- s/differnt/different/

MFC after; 3 days


# b3df222e 26-Jan-2022 Gordon Bergling <gbe@FreeBSD.org>

extra_tcp_stacks: Fix a few common typos

TCP_BBR:
- Fix a typo introducted in 1b90dfa5d2b0, which was reported by tuexen@

TCP_RACK:
- Correct two sysctl descriptions: s/corret/correct/

tcp_bbr(4): Also fix s/measurment/measurement/ in the man page

MFC after: 1 week


# aac52f94 18-Jan-2022 Randall Stewart <rrs@FreeBSD.org>

tcp: Warning cleanup from new compiler.

The clang compiler recently got an update that generates warnings of unused
variables where they were set, and then never used. This revision goes through
the tcp stack and cleans all of those up.

Reviewed by: Michael Tuexen, Gleb Smirnoff
Sponsored by: Netflix Inc.
Differential Revision:


# a370832b 26-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove delayed drop KPI

No longer needed after tcp_output() can ask caller to drop.

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D33371


# f64dc2ab 26-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: TCP output method can request tcp_drop

The advanced TCP stacks (bbr, rack) may decide to drop a TCP connection
when they do output on it. The default stack never does this, thus
existing framework expects tcp_output() always to return locked and
valid tcpcb.

Provide KPI extension to satisfy demands of advanced stacks. If the
output method returns negative error code, it means that caller must
call tcp_drop().

In tcp_var() provide three inline methods to call tcp_output():
- tcp_output() is a drop-in replacement for the default stack, so that
default stack can continue using it internally without modifications.
For advanced stacks it would perform tcp_drop() and unlock and report
that with negative error code.
- tcp_output_unlock() handles the negative code and always converts
it to positive and always unlocks.
- tcp_output_nodrop() just calls the method and leaves the responsibility
to drop on the caller.

Sweep over the advanced stacks and use new KPI instead of using HPTS
delayed drop queue for that.

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D33370


# dbbcc777 26-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

rack: rack_do_compressed_ack_processing() can call tcp_drop()

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D33369


# 66aeb0b5 26-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

rack: drop connection synchronously, when we can

For all functions that are leaves of tcp_input() call
ctf_do_dropwithreset_conn() instead of ctf_do_dropwithreset(), cause
we always got tp and we want it to be dropped.

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D33368


# 40fa3e40 26-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: mechanically substitute call to tfb_tcp_output to new method.

Made with sed(1) execution:

sed -Ef sed -i "" $(grep --exclude tcp_var.h -lr tcp_output sys/)

sed:
s/tp->t_fb->tfb_tcp_output\(tp\)/tcp_output(tp)/
s/to tfb_tcp_output\(\)/to tcp_output()/

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D33366


# 9b602965 15-Dec-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Rack in a rare case we can get stuck sending a very small amount.

If a tlp sending new data fails, and then the peer starts
talking to us again, we can be in a situation where the
tlp_new_data count is set, we are not in recovery and
we always send one packet every RTT. The failure
has to occur when we send the TLP initially from the ip_output()
which is rare. But if it occurs you are basically stuck.

This fixes it so we use the new_data count and clear it so
we know it will be cleared. If a failure occurs the tlp timer
will regenerate a new amount anyway so it is un-needed to
carry the value on.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D33325


# dadbc042 06-Dec-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: rack fails to send out a TLP after a MTU change

When rack sends out a TLP it sets up various state to make sure
it avoids the cwnd (its been more than 1 RTT since our last send) and
it may at times send new data. If an MTU change as occurred
and our cwnd has collapsed we can have a situation where must_retran
flag is set and we obey the cwnd thus never sending the TLP and then
sitting stuck.

This one line fix addresses that problem
Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D33231


# db0ac6de 02-Dec-2021 Cy Schubert <cy@FreeBSD.org>

Revert "wpa: Import wpa_supplicant/hostapd commit 14ab4a816"

This reverts commit 266f97b5e9a7958e365e78288616a459b40d924a, reversing
changes made to a10253cffea84c0c980a36ba6776b00ed96c3e3b.

A mismerge of a merge to catch up to main resulted in files being
committed which should not have been.


# f971e791 02-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_hpts: rename input queue to drop queue and trim dead code

The HPTS input queue is in reality used only for "delayed drops".
When a TCP stack decides to drop a connection on the output path
it can't do that due to locking protocol between main tcp_output()
and stacks. So, rack/bbr utilize HPTS to drop the connection in
a different context.

In the past the queue could also process input packets in context
of HPTS thread, but now no stack uses this, so remove this
functionality.

Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33025


# 50f081ec 02-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_hpts: provide tcp_in_hpts().

It will hide some internal HPTS knowledge from the consumers.

Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33023


# 1dadeab3 30-Nov-2021 Gordon Bergling <gbe@FreeBSD.org>

netinet: Fix a common typo in source code comments

- s/segement/segment/

MFC after: 3 days


# 97e28f0f 17-Nov-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Rack ack war with a mis-behaving firewall or nat with resets.

Previously we added ack-war prevention for misbehaving firewalls. This is
where the f/w or nat messes up its sequence numbers and causes an ack-war.
There is yet another type of ack war that we have found in the wild that is
like unto this. Basically the f/w or nat gets a ack (keep-alive probe or such)
and instead of turning the ack/seq around and adding a TH_RST it does something
real stupid and sends a new packet with seq=0. This of course triggers the challenge
ack in the reset processing which then sends in a challenge ack (if the seq=0 is within
the range of possible sequence numbers allowed by the challenge) and then we rinse-repeat.

This will add the needed tweaks (similar to the last ack-war prevention using the same sysctls and counters)
to prevent it and allow say 5 per second by default.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32938


# 26cbd002 11-Nov-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Rack may still calculate long RTT on persists probes.

When a persists probe is lost, we will end up calculating a long
RTT based on the initial probe and when the response comes from the
second probe (or third etc). This means we have a minimum of a
confidence level of 3 on a incorrect probe. This commit will change it
so that we have one of two options
a) Just not count RTT of probes where we had a loss
<or>
b) Count them still but degrade the confidence to 0.

I have set in this the default being to just not measure them, but I am open
to having the default be otherwise.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32897


# 477aeb3d 08-Nov-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Printf should be removed.

There is a printf when a socket option down to the CC module fails, this really
should not be a printf. In fact this whole option needs to be re-thought in coordination
with some other changes in the CC modules (its just not right but its ok what it
does here if it fails since it will just use the ECN beta).

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32894


# c28e39c3 03-Nov-2021 Gordon Bergling <gbe@FreeBSD.org>

Fix a common typo in syctl descriptions

- s/maxiumum/maximum/

MFC after: 3 days


# 141a53cd 29-Oct-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Rack might retransmit forever.

If we get a Sacked peer with an MTU change we can retransmit forever if the
last bytes are sacked and the client goes away (think power off). Then we
never see the end condition and continually retransmit.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32671


# aeda8527 29-Oct-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Rack at times can miscalculate the RTT from what it thinks is a persists probe respone.

Turns out that if a peer sends in a window update right after rack fires off
a persists probe, we can mis-interpret the window update and calculate
a bogus RTT (very short). We still process the window update and send
the data but we incorrectly generate an RTT. We should be only doing
the RTT stuff if the rwnd is still small and has not changed.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32717


# 5d3bf5b1 25-Oct-2021 Gleb Smirnoff <glebius@FreeBSD.org>

rack: Update the fast send block on setsockopt(2)

Rack caches TCP/IP header for fast send, so it doesn't call
tcpip_fillheaders(). After certain socket option changes,
namely IPV6_TCLASS, IP_TOS and IP_TTL it needs to update
its fast block to be in sync with the inpcb.

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D32655


# f581a26e 25-Oct-2021 Gleb Smirnoff <glebius@FreeBSD.org>

Factor out tcp6_use_min_mtu() to handle IPV6_USE_MIN_MTU by TCP.

Pass control for IP/IP6 level options from generic tcp_ctloutput_set()
down to per-stack ctloutput.

Call tcp6_use_min_mtu() from tcp stack tcp_default_ctloutput().

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D32655


# 12752978 26-Oct-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: The rack stack can incorrectly have an overflow when calculating a burst delay.

If the congestion window is very large the fact that we multiply it by 1000 (for microseconds) can
cause the uint32_t to overflow and we incorrectly calculate a very small divisor. This will then
cause the burst timer to be very large when it should be 0. Instead lets make the three variables
uint64_t and avoid the issue.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32668


# 4e4c84f8 22-Oct-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Add hystart-plus to cc_newreno and rack.

TCP Hystart draft version -03:
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-hystartplusplus

Is a new version of hystart that allows one to carefully exit slow start if the RTT
spikes too much. The newer version has a slower-slow-start so to speak that then
kicks in for five round trips. To see if you exited too early, if not into congestion avoidance.
This commit will add that feature to our newreno CC and add the needed bits in rack to
be able to enable it.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32373


# a36230f7 01-Oct-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Make dsack stats available in netstat and also make sure its aware of TLP's.

DSACK accounting has been for quite some time under a NETFLIX_STATS ifdef. Statistics
on DSACKs however are very useful in figuring out how much bad retransmissions you
are doing. This is further complicated, however, by stacks that do TLP. A TLP
when discovering a lost ack in the reverse path will cause the generation
of a DSACK. For this situation we introduce a new dsack-tlp-bytes as well
as the more traditional dsack-bytes and dsack-packets. These will now
all display in netstat -p tcp -s. This also updates all stacks that
are currently built to keep track of these stats.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32158


# 1ca931a5 23-Sep-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Rack compressed ack path updates the recv window too easily

The compressed ack path of rack is not following proper procedures in updating
the peers window. It should be checking the seq and ack values before updating and
instead it is blindly updating the values. This could in theory get the wrong window
in the connection for some length of time.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32082


# fd69939e 23-Sep-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Two bugs in rack one of which can lead to a panic.

In extensive testing in NF we have found two issues inside
the rack stack.

1) An incorrect offset is being generated by the fast send path when a fast send is initiated on
the end of the socket buffer and before the fast send runs, the sb_compress macro adds data to the trailing socket.
This fools the fast send code into thinking the sb offset changed and it miscalculates a "updated offset".
It should only do that when the mbuf in question got smaller.. i.e. an ack was processed. This can lead to
a panic deref'ing a NULL mbuf if that packet is ever retransmitted. At the best case it leads to invalid data being
sent to the client which usually terminates the connection. The fix is to have the proper logic (that is in the rsm fast path)
to make sure we only update the offset when the mbuf shrinks.
2) The other issue is more bothersome. The timestamp check in rack needs to use the msec timestamp when
comparing the timestamp echo to now. It was using a microsecond timestamp which ends up giving error
prone results but causes only small harm in trying to identify which send to use in RTT calculations if its a retransmit.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32062


# 5baf32c9 17-Aug-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Add support for DSACK based reordering window to rack.

The rack stack, with respect to the rack bits in it, was originally built based
on an early I-D of rack. In fact at that time the TLP bits were in a separate
I-D. The dynamic reordering window based on DSACK events was not present
in rack at that time. It is now part of the RFC and we need to update our stack
to include these features. However we want to have a way to control the feature
so that we can, if the admin decides, make it stay the same way system wide as
well as via socket option. The new sysctl and socket option has the following
meaning for setting:

00 (0) - Keep the old way, i.e. reordering window is 1 and do not use DSACK bytes to add to reorder window
01 (1) - Change the Reordering window to 1/4 of an RTT but do not use DSACK bytes to add to reorder window
10 (2) - Keep the reordering window as 1, but do use SACK bytes to add additional 1/4 RTT delay to the reorder window
11 (3) - reordering window is 1/4 of an RTT and add additional DSACK bytes to increase the reordering window (RFC behavior)

The default currently in the sysctl is 3 so we get standards based behavior.
Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31506


# b1e806c0 07-Jul-2021 Andrew Gallatin <gallatin@FreeBSD.org>

tcp: fix alternate stack build with LINT-NO{INET,INET6,IP}

When fixing another bug, I noticed that the alternate
TCP stacks do not build when various combinations of
ipv4 and ipv6 are disabled.

Reviewed by: rrs, tuexen
Differential Revision: https://reviews.freebsd.org/D31094
Sponsored by: Netflix


# d7955cc0 06-Jul-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: HPTS performance enhancements

HPTS drives both rack and bbr, and yet there have been many complaints
about performance. This bit of work restructures hpts to help reduce CPU
overhead. It does this by now instead of relying on the timer/callout to
drive it instead use user return from a system call as well as lro flushes
to drive hpts. The timer becomes a backstop that dynamically adjusts
based on how "late" we are.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31083


# e834f9a4 06-Jul-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Address goodput and TLP edge cases.

There are several cases where we make a goodput measurement and we are running
out of data when we decide to make the measurement. In reality we should not make
such a measurement if there is no chance we can have "enough" data. There is also
some corner case TLP's that end up not registering as a TLP like they should, we
fix this by pushing the doing_tlp setup to the actual timeout that knows it did
a TLP. This makes it so we always have the appropriate flag on the sendmap
indicating a TLP being done as well as count correctly so we make no more
that two TLP's.

In addressing the goodput lets also add a "quality" metric that can be viewed via
blackbox logs so that a casual observer does not have to figure out how good
of a measurement it is. This is needed due to the fact that we may still make
a measurement that is of a poorer quality as we run out of data but still have
a minimal amount of data to make a measurement.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31076


# 9e4d9e4c 25-Jun-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Preparation for allowing hardware TLS to be able to kick a tcp connection that is retransmitting too much out of hardware and back to software.

Hardware TLS is now supported in some interface cards and it works well. Except that
when we have connections that retransmit a lot we get into trouble with all the retransmits.
This prep step makes way for change that Drew will be making so that we can "kick out" a
session from hardware TLS.

Reviewed by: mtuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30895


# 66aec14a 24-Jun-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Rack not being very friendly with V6:4 socket and having a connection from V4

There were two bugs that prevented V4 sockets from connecting to
a rack server running a V4/V6 socket. As well as a bug that stops the
mapped v4 in V6 address from working.

Reviewed by: mtuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30885


# f1536bb5 11-Jun-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: remove debug output from RACK

Reported by: iron.udjin@gmail.com, Marek Zarychta
Reviewed by: rrs
PR: 256538
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D30723
Sponsored by: Netflix, Inc.


# ba1b3e48 11-Jun-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Missing mfree in rack and bbr

Recently (Nov) we added logic that protects against a peer negotiating a timestamp, and
then not including a timestamp. This involved in the input path doing a goto done_with_input
label. Now I suspect the code was cribbed from one in Rack that has to do with the SYN.
This had a bug, i.e. it should have a m_freem(m) before going to the label (bbr had this
missing m_freem() but rack did not). This then caused the missing m_freem to show
up in both BBR and Rack. Also looking at the code referencing m->m_pkthdr.lro_nsegs
later (after processing) is not a good idea, even though its only for logging. Best to
copy that off before any frees can take place.

Reviewed by: mtuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30727


# 224cf7b3 11-Jun-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: fix compilation of IPv4-only builds

PR: 256538
Reported by: iron.udjin@gmail.com
MFC after: 3 days
Sponsored by: Netflix, Inc.


# 67e89281 10-Jun-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Mbuf leak while holding a socket buffer lock.

When running at NF the current Rack and BBR changes with the recent
commits from Richard that cause the socket buffer lock to be held over
the ip_output() call and then finally culminating in a call to tcp_handle_wakeup()
we get a lot of leaked mbufs. I don't think that this leak is actually caused
by holding the lock or what Richard has done, but is exposing some other
bug that has probably been lying dormant for a long time. I will continue to
look (using his changes) at what is going on to try to root cause out the issue.

In the meantime I can't leave the leaks out for everyone else. So this commit
will revert all of Richards changes and move both Rack and BBR back to just
doing the old sorwakeup_locked() calls after messing with the so_rcv buffer.

We may want to look at adding back in Richards changes after I have pinpointed
the root cause of the mbuf leak and fixed it.

Reviewed by: mtuexen,rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30704


# 4747500d 04-Jun-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: A better fix for the previously attempted fix of the ack-war issue with tcp.

So it turns out that my fix before was not correct. It ended with us failing
some of the "improved" SYN tests, since we are not in the correct states.
With more digging I have figured out the root of the problem is that when
we receive a SYN|FIN the reassembly code made it so we create a segq entry
to hold the FIN. In the established state where we were not in order this
would be correct i.e. a 0 len with a FIN would need to be accepted. But
if you are in a front state we need to strip the FIN so we correctly handle
the ACK but ignore the FIN. This gets us into the proper states
and avoids the previous ack war.

I back out some of the previous changes but then add a new change
here in tcp_reass() that fixes the root cause of the issue. We still
leave the rack panic fixes in place however.

Reviewed by: mtuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30627


# 8c69d988 27-May-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: When we have an out-of-order FIN we do want to strip off the FIN bit.

The last set of commits fixed both a panic (in rack) and an ACK-war (in freebsd and bbr).
However there was a missing case, i.e. where we get an out-of-order FIN by itself.
In such a case we don't want to leave the FIN bit set, otherwise we will do the
wrong thing and ack the FIN incorrectly. Instead we need to go through the
tcp_reasm() code and that way the FIN will be stripped and all will be well.

Reviewed by: mtuexen,rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30497


# 4f3addd9 26-May-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Add a socket option to rack so we can test various changes to the slop value in timers.

Timer_slop, in TCP, has been 200ms for a long time. This value dates back
a long time when delayed ack timers were longer and links were slower. A
200ms timer slop allows 1 MSS to be sent over a 60kbps link. Its possible that
lowering this value to something more in line with todays delayed ack values (40ms)
might improve TCP. This bit of code makes it so rack can, via a socket option,
adjust the timer slop.

Reviewed by: mtuexen
Sponsered by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30249


# 13c0e198 25-May-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Fix bugs related to the PUSH bit and rack and an ack war

Michaels testing with UDP tunneling found an issue with the push bit, which was only partly fixed
in the last commit. The problem is the left edge gets transmitted before the adjustments are done
to the send_map, this means that right edge bits must be considered to be added only if
the entire RSM is being retransmitted.

Now syzkaller also continued to find a crash, which Michael sent me the reproducer for. Turns
out that the reproducer on default (freebsd) stack made the stack get into an ack-war with itself.
After fixing the reference issues in rack the same ack-war was found in rack (and bbr). Basically
what happens is we go into the reassembly code and lose the FIN bit. The trick here is we
should not be going into the reassembly code if tlen == 0 i.e. the peer never sent you anything.
That then gets the proper action on the FIN bit but then you end up in LAST_ACK with no
timers running. This is because the usrclosed function gets called and the FIN's and such have
already been exchanged. So when we should be entering FIN_WAIT2 (or even FIN_WAIT1) we get
stuck in LAST_ACK. Fixing this means tweaking the usrclosed function so that we properly
recognize the condition and drop into FIN_WAIT2 where a timer will allow at least TP_MAXIDLE
before closing (to allow time for the peer to retransmit its FIN if the ack is lost). Setting the fast_finwait2
timer can speed this up in testing.

Reviewed by: mtuexen,rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30451


# 9bbd1a8f 24-May-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: fix a RACK socket buffer lock issue

Fix a missing socket buffer unlocking of the socket receive buffer.

Reviewed by: gallatin, rrs
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D30402


# 631449d5 24-May-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Fix an issue with the PUSH bit as well as fill in the missing mtu change for fsb's

The push bit itself was also not actually being properly moved to
the right edge. The FIN bit was incorrectly on the left edge. We
fix these two issues as well as plumb in the mtu_change for
alternate stacks.

Reviewed by: mtuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30413


# 8923ce63 22-May-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: Handle stack switch while processing socket options

Handle the case where during socket option processing, the user
switches a stack such that processing the stack specific socket
option does not make sense anymore. Return an error in this case.

MFC after: 1 week
Reviewed by: markj
Reported by: syzbot+a6e1d91f240ad5d72cd1@syzkaller.appspotmail.com
Sponsored by: Netflix, Inc.
Differential revision: https://reviews.freebsd.org/D30395


# 39756885 21-May-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

rack: honor prior socket buffer lock when doing the upcall

While partially reverting D24237 with D29690, due to introducing some
unintended effects for in-kernel TCP consumers, the preexisting lock
on the socket send buffer was not considered properly.

Found by: markj
MFC after: 2 weeks
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D30390


# 032bf749 21-May-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

[tcp] Keep socket buffer locked until upcall

r367492 would unlock the socket buffer before eventually calling the upcall.
This leads to problematic interaction with NFS kernel server/client components
(MP threads) accessing the socket buffer with potentially not correctly updated
state.

Reported by: rmacklem
Reviewed By: tuexen, #transport
Tested by: rmacklem, otis
MFC after: 2 weeks
Sponsored By: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29690


# 500eb6dd 21-May-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: Fix sending of TCP segments with IP level options

When bringing in TCP over UDP support in
https://cgit.FreeBSD.org/src/commit/?id=9e644c23000c2f5028b235f6263d17ffb24d3605,
the length of IP level options was considered when locating the
transport header. This was incorrect and is fixed by this patch.

X-MFC with: https://cgit.FreeBSD.org/src/commit/?id=9e644c23000c2f5028b235f6263d17ffb24d3605
MFC after: 3 days
Reviewed by: markj, rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D30358


# 02cffbc2 13-May-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Incorrect KASSERT causes a panic in rack

Skyzall found an interesting panic in rack. When a SYN and FIN are
both sent together a KASSERT gets tripped where it is validating that
a mbuf pointer is in the sendmap. But a SYN and FIN often will not
have a mbuf pointer. So the fix is two fold a) make sure that the
SYN and FIN split the right way when cloning an RSM SYN on left
edge and FIN on right. And also make sure the KASSERT properly
accounts for the case that we have a SYN or FIN so we don't
panic.

Reviewed by: mtuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D30241


# 251842c6 12-May-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp rack: improve initialisation of retransmit timeout

When the TCP is in the front states, don't take the slop variable
into account. This improves consistency with the base stack.

Reviewed by: rrs@
Differential Revision: https://reviews.freebsd.org/D30230
MFC after: 1 week
Sponsored by: Netflix, Inc.


# 4b86a24a 11-May-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: In rack, we must only convert restored rtt when the hostcache does restore them.

Rack now after the previous commit is very careful to translate any
value in the hostcache for srtt/rttvar into its proper format. However
there is a snafu here in that if tp->srtt is 0 is the only time that
the HC will actually restore the srtt. We need to then only convert
the srtt restored when it is actually restored. We do this by making
sure it was zero before the call to cc_conn_init and it is non-zero
afterwards.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30213


# 9867224b 10-May-2021 Randall Stewart <rrs@FreeBSD.org>

tcp:Host cache and rack ending up with incorrect values.

The hostcache up to now as been updated in the discard callback
but without checking if we are all done (the race where there are
more than one calls and the counter has not yet reached zero). This
means that when the race occurs, we end up calling the hc_upate
more than once. Also alternate stacks can keep there srtt/rttvar
in different formats (example rack keeps its values in microseconds).
Since we call the hc_update *before* the stack fini() then the
values will be in the wrong format.

Rack on the other hand, needs to convert items pulled from the
hostcache into its internal format else it may end up with
very much incorrect values from the hostcache. In the process
lets commonize the update mechanism for srtt/rttvar since we
now have more than one place that needs to call it.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30172


# a16cee02 07-May-2021 Randall Stewart <rrs@FreeBSD.org>

Fix a UDP tunneling issue with rack. Basically there are two
issues.
A) Not enough hdrlen was being calculated when a UDP tunnel is
in place.
and
B) Not enough memory is allocated in racks fsb. We need to
overbook the fsb to include a udphdr just in case.

Submitted by: Peter Lei
Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30157


# 5d8fd932 06-May-2021 Randall Stewart <rrs@FreeBSD.org>

This brings into sync FreeBSD with the netflix versions of rack and bbr.
This fixes several breakages (panics) since the tcp_lro code was
committed that have been reported. Quite a few new features are
now in rack (prefecting of DGP -- Dynamic Goodput Pacing among the
largest). There is also support for ack-war prevention. Documents
comming soon on rack..

Sponsored by: Netflix
Reviewed by: rscheff, mtuexen
Differential Revision: https://reviews.freebsd.org/D30036


# 9e644c23 18-Apr-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: add support for TCP over UDP

Adding support for TCP over UDP allows communication with
TCP stacks which can be implemented in userspace without
requiring special priviledges or specific support by the OS.
This is joint work with rrs.

Reviewed by: rrs
Sponsored by: Netflix, Inc.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D29469


# 2e978260 17-Apr-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

rack: Fix ECN on finalizing session.

Maintain code similarity between RACK and base stack
for ECN. This may not strictly be necessary, depending
when a state transition to FIN_WAIT_1 is done in RACK
after a shutdown() or close() syscall.

MFC after: 3 days
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29658


# 40f41ece 22-Mar-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: improve handling of SYN segments in SYN-SENT state

Ensure that the stack does not generate a DSACK block for user
data received on a SYN segment in SYN-SENT state.

Reviewed by: rscheff
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D29376
Sponsored by: Netflix, Inc.


# 5666643a 13-Mar-2021 Gordon Bergling <gbe@FreeBSD.org>

Fix some common typos in comments

- occured -> occurred
- normaly -> normally
- controling -> controlling
- fileds -> fields
- insterted -> inserted
- outputing -> outputting

MFC after: 1 week


# 705d06b2 05-Mar-2021 Michael Tuexen <tuexen@FreeBSD.org>

rack: unbreak TCP fast open for the client side

Allow sending user data on the SYN segment.

Reviewed by: rrs
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D29082
Sponsored by: Netflix, Inc.


# 99adf230 01-Mar-2021 Michael Tuexen <tuexen@FreeBSD.org>

RACK: fix an issue triggered by using the CDG CC module

Obtained from: rrs@
MFC after: 3 days
PR: 238741
Sponsored by: Netlix, Inc.


# 1a714ff2 26-Jan-2021 Randall Stewart <rrs@FreeBSD.org>

This pulls over all the changes that are in the netflix
tree that fix the ratelimit code. There were several bugs
in tcp_ratelimit itself and we needed further work to support
the multiple tag format coming for the joint TLS and Ratelimit dances.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D28357


# d2b3cedd 13-Jan-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: add sysctl to tolerate TCP segments missing timestamps

When timestamp support has been negotiated, TCP segements received
without a timestamp should be discarded. However, there are broken
TCP implementations (for example, stacks used by Omniswitch 63xx and
64xx models), which send TCP segments without timestamps although
they negotiated timestamp support.
This patch adds a sysctl variable which tolerates such TCP segments
and allows to interoperate with broken stacks.

Reviewed by: jtl@, rscheff@
Differential Revision: https://reviews.freebsd.org/D28142
Sponsored by: Netflix, Inc.
PR: 252449
MFC after: 1 week


# cc3c3485 13-Jan-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: fix handling of TCP RST segments missing timestamps

A TCP RST segment should be processed even it is missing TCP
timestamps.

Reported by: dmgk@, kevans@
Reviewed by: rscheff@, dmgk@
Sponsored by: Netflix, Inc.
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D28143


# 283c76c7 09-Nov-2020 Michael Tuexen <tuexen@FreeBSD.org>

RFC 7323 specifies that:
* TCP segments without timestamps should be dropped when support for
the timestamp option has been negotiated.
* TCP segments with timestamps should be processed normally if support
for the timestamp option has not been negotiated.
This patch enforces the above.

PR: 250499
Reviewed by: gnn, rrs
MFC after: 1 week
Sponsored by: Netflix, Inc
Differential Revision: https://reviews.freebsd.org/D27148


# 4d0770f1 08-Nov-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Prevent premature SACK block transmission during loss recovery

Under specific conditions, a window update can be sent with
outdated SACK information. Some clients react to this by
subsequently delaying loss recovery, making TCP perform very
poorly.

Reported by: chengc_netapp.com
Reviewed by: rrs, jtl
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D24237


# 285385ba 09-Sep-2020 Randall Stewart <rrs@FreeBSD.org>

So it turns out that syzkaller hit another crash. It has to do with switching
stacks with a SENT_FIN outstanding. Both rack and bbr will only send a
FIN if all data is ack'd so this must be enforced. Also if the previous stack
sent the FIN we need to make sure in rack that when we manufacture the
"unknown" sends that we include the proper HAS_FIN bits.

Note for BBR we take a simpler approach and just refuse to switch.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D26269


# 662c1305 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

net: clean up empty lines in .c and .h files


# 1951fa79 25-Aug-2020 Michael Tuexen <tuexen@FreeBSD.org>

RFC 3465 defines a limit L used in TCP slow start for limiting the number
of acked bytes as described in Section 2.2 of that document.
This patch ensures that this limit is not also applied in congestion
avoidance. Applying this limit also in congestion avoidance can result in
using less bandwidth than allowed.

Reported by: l.tian.email@gmail.com
Reviewed by: rrs, rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D26120


# b9978183 19-Aug-2020 Andrew Gallatin <gallatin@FreeBSD.org>

TCP: remove special treatment for hardware (ifnet) TLS

Remove most special treatment for ifnet TLS in the TCP stack, except
for code to avoid mixing handshakes and bulk data.

This code made heroic efforts to send down entire TLS records to
NICs. It was added to improve the PCIe bus efficiency of older TLS
offload NICs which did not keep state per-session, and so would need
to re-DMA the first part(s) of a TLS record if a TLS record was sent
in multiple TCP packets or TSOs. Newer TLS offload NICs do not need
this feature.

At Netflix, we've run extensive QoE tests which show that this feature
reduces client quality metrics, presumably because the effort to send
TLS records atomically causes the server to both wait too long to send
data (leading to buffers running dry), and to send too much data at
once (leading to packet loss).

Reviewed by: hselasky, jhb, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26103


# 95ef69c6 16-Jun-2020 Randall Stewart <rrs@FreeBSD.org>

iSo in doing final checks on OCA firmware with all the latest tweaks the dup-ack checking
packet drill script was failing with a number of unexpected acks. So it turns
out if you have the default recvwin set up to 1Meg (like OCA's do) and you
have no window scaling (like the dupack checking code) then we have another
case where we are always trying to update the rwnd and sending an
ack when we should not.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D25298


# 4d418f8d 15-Jun-2020 Randall Stewart <rrs@FreeBSD.org>

So it turns out rack has a shortcoming in dup-ack counting. It counts the dupacks but
then does not properly respond to them. This is because a few missing bits are not present.
BBR actually does properly respond (though it also sends a TLP which is interesting and
maybe something to fix)..

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D25294


# f092a3c7 12-Jun-2020 Randall Stewart <rrs@FreeBSD.org>

So it turns out with the right window scaling you can get the code in all stacks to
always want to do a window update, even when no data can be sent. Now in
cases where you are not pacing thats probably ok, you just send an extra
window update or two. However with bbr (and rack if its paced) every time
the pacer goes off its going to send a "window update".

Also in testing bbr I have found that if we are not responding to
data right away we end up staying in startup but incorrectly holding
a pacing gain of 192 (a loss). This is because the idle window code
does not restict itself to only work with PROBE_BW. In all other
states you dont want it doing a PROBE_BW state change.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D25247


# e854dd38 08-Jun-2020 Randall Stewart <rrs@FreeBSD.org>

An important statistic in determining if a server process (or client) is being delayed
is to know the time to first byte in and time to first byte out. Currently we
have no way to know these all we have is t_starttime. That (t_starttime) tells us
what time the 3 way handshake completed. We don't know when the first
request came in or how quickly we responded. Nor from a client perspective
do we know how long from when we sent out the first byte before the
server responded.

This small change adds the ability to track the TTFB's. This will show up in
BB logging which then can be pulled for later analysis. Note that currently
the tracking is via the ticks variable of all three variables. This provides
a very rough estimate (hz=1000 its 1ms). A follow-on set of work will be
to change all three of these values into something with a much finer resolution
(either microseconds or nanoseconds), though we may want to make the resolution
configurable so that on lower powered machines we could still use the much
cheaper ticks variable.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D24902


# f1ea4e41 03-Jun-2020 Randall Stewart <rrs@FreeBSD.org>

This fixes a couple of skyzaller crashes. Most
of them have to do with TFO. Even the default stack
had one of the issues:

1) We need to make sure for rack that we don't advance
snd_nxt beyond iss when we are not doing fast open. We
otherwise can get a bunch of SYN's sent out incorrectly
with the seq number advancing.
2) When we complete the 3-way handshake we should not ever
append to reassembly if the tlen is 0, if TFO is enabled
prior to this fix we could still call the reasemmbly. Note
this effects all three stacks.
3) Rack like its cousin BBR should track if a SYN is on a
send map entry.
4) Both bbr and rack need to only consider len incremented on a SYN
if the starting seq is iss, otherwise we don't increment len which
may mean we return without adding a sendmap entry.

This work was done in collaberation with Michael Tuexen, thanks for
all the testing!
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D25000


# af2fb894 21-May-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

With RFC3168 ECN, CWR SHOULD only be sent with new data

Overly conservative data receivers may ignore the CWR flag
on other packets, and keep ECE latched. This can result in
continous reduction of the congestion window, and very poor
performance when ECN is enabled.

Reviewed by: rgrimes (mentor), rrs
Approved by: rgrimes (mentor), tuexen (mentor)
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D23364


# 8e051165 21-May-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Retain only mutually supported TCP options after simultaneous SYN

When receiving a parallel SYN in SYN-SENT state, remove all the
options only we supported locally before sending the SYN,ACK.

This addresses a consistency issue on parallel opens.

Also, on such a parallel open, the stack could be coaxed into
running with timestamps enabled, even if administratively disabled.

Reviewed by: tuexen (mentor)
Approved by: tuexen (mentor)
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D23371


# 6e16d877 21-May-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Handle ECN handshake in simultaneous open

While testing simultaneous open TCP with ECN, found that
negotiation fails to arrive at the expected final state.

Reviewed by: tuexen (mentor)
Approved by: tuexen (mentor), rgrimes (mentor)
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D23373


# 777b88d6 15-May-2020 Randall Stewart <rrs@FreeBSD.org>

This fixes several skyzaller issues found with the
help of Michael Tuexen. There was some accounting
errors with TCPFO for bbr and also for both rack
and bbr there was a FO case where we should be
jumping to the just_return_nolock label to
exit instead of returning 0. This of course
caused no timer to be running and thus the
stuck sessions.

Reported by: Michael Tuexen and Skyzaller
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D24852


# b1ddcbc6 07-May-2020 Randall Stewart <rrs@FreeBSD.org>

When in the SYN-SENT state bbr and rack will not properly send an ACK but instead start the D-ACK timer. This
causes so_reuseport_lb_test to fail since it slows down how quickly the program runs until the timeout occurs
and fails the test

Sponsored by: Netflix inc.
Differential Revision: https://reviews.freebsd.org/D24747


# 8717b8f1 07-May-2020 Randall Stewart <rrs@FreeBSD.org>

NF has an internal option that changes the tcp_mcopy_m routine slightly (has
a few extra arguments). Recently that changed to only have one arg extra so
that two ifdefs around the call are no longer needed. Lets take out the
extra ifdef and arg.

Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D24736


# 51a53922 04-May-2020 Michael Tuexen <tuexen@FreeBSD.org>

Add net epoch support back, which was taken out by accident in
https://svnweb.freebsd.org/changeset/base/360639

Reviewed by: rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D24694


# 963fb2ad 04-May-2020 Randall Stewart <rrs@FreeBSD.org>

This commit brings things into sync with the advancements that
have been made in rack and adds a few fixes in BBR. This also
removes any possibility of incorrectly doing OOB data the stacks
do not support it. Should fix the skyzaller crashes seen in the
past. Still to fix is the BBR issue just reported this weekend
with the SYN and on sending a RST. Note that this version of
rack can now do pacing as well.

Sponsored by:Netflix Inc
Differential Revision:https://reviews.freebsd.org/D24576


# 9028b6e0 29-Apr-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Prevent premature shrinking of the scaled receive window
which can cause a TCP client to use invalid or stale TCP sequence numbers for ACK packets.

Packets with old sequence numbers are ignored and not used to update the send window size.
This might cause the TCP session to hang indefinitely under some circumstances.

Reported by: Cui Cheng
Reviewed by: tuexen (mentor), rgrimes (mentor)
Approved by: tuexen (mentor), rgrimes (mentor)
MFC after: 3 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D24515


# b2ade6b1 29-Apr-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Correctly set up the initial TCP congestion window in all cases,
by not including the SYN bit sequence space in cwnd related calculations.
Snd_und is adjusted explicitly in all cases, outside the cwnd update, instead.

This fixes an off-by-one conformance issue with regular TCP sessions not
using Appropriate Byte Counting (RFC3465), sending one more packet during
the initial window than expected.

PR: 235256
Reviewed by: tuexen (mentor), rgrimes (mentor)
Approved by: tuexen (mentor), rgrimes (mentor)
MFC after: 3 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D19000


# 454d3896 25-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix LINT build #2 after r360292.

Pointyhat to: melifaro


# bb410f9f 21-Apr-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

revert rS360143 - Correctly set up initial cwnd
due to syzkaller panics found

Reported by: tuexen
Approved by: tuexen (mentor)
Sponsored by: NetApp, Inc.


# 73b76966 21-Apr-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Correctly set up the initial TCP congestion window
in all cases, by adjust snd_una right after the
connection initialization, to include the one byte
in sequence space occupied by the SYN bit.

This does not change the regular ACK processing,
while making the BYTES_THIS_ACK macro to work properly.

PR: 235256
Reviewed by: tuexen (mentor), rgrimes (mentor)
Approved by: tuexen (mentor), rgrimes (mentor)
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D19000


# 413c3db1 31-Mar-2020 Michael Tuexen <tuexen@FreeBSD.org>

Allow the TCP backhole detection to be disabled at all, enabled only
for IPv4, enabled only for IPv6, and enabled for IPv4 and IPv6.
The current blackhole detection might classify a temporary outage as
an MTU issue and reduces permanently the MSS. Since the consequences of
such a reduction due to a misclassification are much more drastically
for IPv4 than for IPv6, allow the administrator to enable it for IPv6 only.

Reviewed by: bcr@ (man page), Richard Scheffenegger
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D24219


# 7ca6e296 12-Mar-2020 Michael Tuexen <tuexen@FreeBSD.org>

Use KMOD_TCPSTAT_INC instead of TCPSTAT_INC for RACK and BBR, since
these are kernel modules. Also add a KMOD_TCPSTAT_ADD and use that
instead of TCPSTAT_ADD.

Reviewed by: jtl@, rrs@
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D23904


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# 3fba40d9 11-Feb-2020 Randall Stewart <rrs@FreeBSD.org>

Remove all trailing white space from the BBR/Rack fold. Bits
left around by emacs (thanks emacs).


# d2517ab0 11-Feb-2020 Randall Stewart <rrs@FreeBSD.org>

Now that all of the stats framework is
in FreeBSD the bits that disabled stats
when netflix-stats is not defined is no longer
needed. Lets remove these bits so that we
will properly use stats per its definition
in BBR and Rack.

Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D23088


# 9cc711c9 25-Jan-2020 Michael Tuexen <tuexen@FreeBSD.org>

Sending CWR after an RTO is according to RFC 3168 generally required
and not only for the DCTCP congestion control.

Submitted by: Richard Scheffenegger
Reviewed by: rgrimes, tuexen@, Cheng Cui
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23119


# 47e2c17c 25-Jan-2020 Michael Tuexen <tuexen@FreeBSD.org>

Don't set the ECT codepoint on retransmitted packets during SACK loss
recovery. This is required by RFC 3168.

Submitted by: Richard Scheffenegger
Reviewed by: rgrimes@, tuexen@, Cheng Cui
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23118


# a2d59694 25-Jan-2020 Michael Tuexen <tuexen@FreeBSD.org>

As a TCP client only enable ECN when the corresponding sysctl variable
indicates that ECN should be negotiated for the client side.

Submitted by: Richard Scheffenegger
Reviewed by: rgrimes@, tuexen@
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23228


# ee97681e 24-Jan-2020 Michael Tuexen <tuexen@FreeBSD.org>

Don't delay the ACK for a TCP segment with the CWR flag set.
This allows the data sender to increase the CWND faster.

Submitted by: Richard Scheffenegger
Reviewed by: rgrimes@, tuexen@, Cheng Cui
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D22670


# e1d2b469 24-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Enter the network epoch when rack_output() is called in setsockopt(2).


# 109eb549 21-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Make tcp_output() require network epoch.

Enter the epoch before calling into tcp_output() from those
functions, that didn't do that before.

This eliminates a bunch of epoch recursions in TCP.


# b9555453 21-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Make ip6_output() and ip_output() require network epoch.

All callers that before may called into these functions
without network epoch now must enter it.


# 334fc582 08-Jan-2020 Bjoern A. Zeeb <bz@FreeBSD.org>

vnet: virtualise more network stack sysctls.

Virtualise tcp_always_keepalive, TCP and UDP log_in_vain. All three are
set in the netoptions startup script, which we would love to run for VNETs
as well [1].

While virtualising the log_in_vain sysctls seems pointles at first for as
long as the kernel message buffer is not virtualised, it at least allows
an administrator to debug the base system or an individual jail if needed
without turning the logging on for all jails running on a system.

PR: 243193 [1]
MFC after: 2 weeks


# 4ad24737 06-Jan-2020 Randall Stewart <rrs@FreeBSD.org>

This catches rack up in the recent changes to ECN and
also commonizes the functions that both the freebsd and
rack stack uses.

Sponsored by:Netflix Inc
Differential Revision: https://reviews.freebsd.org/D23052


# 1cf55767 17-Dec-2019 Randall Stewart <rrs@FreeBSD.org>

This commit is a bit of a re-arrange of deck chairs. It
gets both rack and bbr ready for the completion of the STATs
framework in FreeBSD. For now if you don't have both NF_stats and
stats on it disables them. As soon as the rest of the stats framework
lands we can remove that restriction and then just uses stats when
defined.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D22479


# 3cf38784 01-Dec-2019 Michael Tuexen <tuexen@FreeBSD.org>

Move all ECN related flags from the flags to the flags2 field.
This allows adding more ECN related flags in the future.
No functional change intended.

Submitted by: Richard Scheffenegger
Reviewed by: rrs@, tuexen@
Differential Revision: https://reviews.freebsd.org/D22497


# b72e56e7 01-Dec-2019 Michael Tuexen <tuexen@FreeBSD.org>

This is an initial step in implementing the new congestion window
validation as specified in RFC 7661.

Submitted by: Richard Scheffenegger
Reviewed by: rrs@, tuexen@
Differential Revision: https://reviews.freebsd.org/D21798


# 8df12ffc 01-Dec-2019 Michael Tuexen <tuexen@FreeBSD.org>

Make the IPTOS value available to all substate handlers. This will allow
to add support for L4S or SCE, which require processing of the IP TOS
field.

Submitted by: Richard Scheffenegger
Reviewed by: rgrimes@, rrs@, tuexen@
Differential Revision: https://reviews.freebsd.org/D22426


# d40c0d47 07-Nov-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Now that all of the tcp_input() and all its branches are executed
in the network epoch, we can greatly simplify synchronization.
Remove all unneccesary epoch enters hidden under INP_INFO_RLOCK macro.
Remove some unneccesary assertions and convert necessary ones into the
NET_EPOCH_ASSERT macro.


# 8ee1cf03 14-Oct-2019 Randall Stewart <rrs@FreeBSD.org>

if_hw_tsomaxsegsize needs to be initialized to zero, just
like in bbr.c and tcp_output.c


# 12a43d0d 29-Sep-2019 Michael Tuexen <tuexen@FreeBSD.org>

RFC 7112 requires a host to put the complete IP header chain
including the TCP header in the first IP packet.
Enforce this in tcp_output(). In addition make sure that at least
one byte payload fits in the TCP segement to allow making progress.
Without this check, a kernel with INVARIANTS will panic.
This issue was found by running an instance of syzkaller.

Reviewed by: jtl@
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21665


# 79c2a2a0 28-Sep-2019 Michael Tuexen <tuexen@FreeBSD.org>

Ensure that the INP lock is released before leaving [gs]etsockopt()
for RACK specific socket options.
These issues were found by a syzkaller instance.
Reviewed by: rrs@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21825


# ac7bd23a 24-Sep-2019 Randall Stewart <rrs@FreeBSD.org>

lets put (void) in a couple of functions to keep older platforms that
are stuck with gcc happy (ppc). The changes are needed in both bbr and
rack.

Obtained from: Michael Tuexen (mtuexen@)


# 35c7bb34 24-Sep-2019 Randall Stewart <rrs@FreeBSD.org>

This commit adds BBR (Bottleneck Bandwidth and RTT) congestion control. This
is a completely separate TCP stack (tcp_bbr.ko) that will be built only if
you add the make options WITH_EXTRA_TCP_STACKS=1 and also include the option
TCPHPTS. You can also include the RATELIMIT option if you have a NIC interface that
supports hardware pacing, BBR understands how to use such a feature.

Note that this commit also adds in a general purpose time-filter which
allows you to have a min-filter or max-filter. A filter allows you to
have a low (or high) value for some period of time and degrade slowly
to another value has time passes. You can find out the details of
BBR by looking at the original paper at:

https://queue.acm.org/detail.cfm?id=3022184

or consult many other web resources you can find on the web
referenced by "BBR congestion control". It should be noted that
BBRv1 (which this is) does tend to unfairness in cases of small
buffered paths, and it will usually get less bandwidth in the case
of large BDP paths(when competing with new-reno or cubic flows). BBR
is still an active research area and we do plan on implementing V2
of BBR to see if it is an improvement over V1.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D21582


# dd3121a8 19-Sep-2019 Michael Tuexen <tuexen@FreeBSD.org>

When the RACK stack computes the space for user data in a TCP segment,
it wasn't taking the IP level options into account. This patch fixes this.
In addition, it also corrects a KASSERT and adds protection code to assure
that the IP header chain and the TCP head fit in the first fragment as
required by RFC 7112.

Reviewed by: rrs@
MFC after: 3 days
Sponsored by: Nertflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21666


# 6d261981 09-Sep-2019 Michael Tuexen <tuexen@FreeBSD.org>

Only update SACK/DSACK lists when a non-empty segment was received.
This fixes hitting a KASSERT with a valid packet exchange.

Reviewed by: rrs@, Richard Scheffenegger
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21567


# 191ae5bf 03-Sep-2019 Michael Tuexen <tuexen@FreeBSD.org>

Fix two TCP RACK issues:
* Convert the TCP delayed ACK timer from ms to ticks as required.
This fixes the timer on platforms with hz != 1000.
* Don't delay acknowledgements which report duplicate data using
DSACKs.

Reviewed by: rrs@
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21512


# fe5dee73 02-Sep-2019 Michael Tuexen <tuexen@FreeBSD.org>

This patch improves the DSACK handling to conform with RFC 2883.
The lowest SACK block is used when multiple Blocks would be elegible as
DSACK blocks ACK blocks get reordered - while maintaining the ordering of
SACK blocks not relevant in the DSACK context is maintained.

Reviewed by: rrs@, tuexen@
Obtained from: Richard Scheffenegger
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D21038


# b2e60773 26-Aug-2019 John Baldwin <jhb@FreeBSD.org>

Add kernel-side support for in-kernel TLS.

KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277


# e13ad86c 21-Aug-2019 Randall Stewart <rrs@FreeBSD.org>

Fix an issue when TSO and Rack play together. Basically
an retransmission of the initial SYN (with data) would
cause us to strip the SYN and decrement/increase offset/len
which then caused us a -1 offset and a panic.

Reported by: Larry Rosenman
(Michael Tuexen helped me debug this at the IETF)


# 23fa2dbc 12-Aug-2019 Randall Stewart <rrs@FreeBSD.org>

Place back in the dependency on HPTS via module depends versus
a fatal error in compiling. This was taken out by mistake
when I mis-merged from the 18q22p2 sources of rack in NF. Opps.

Reported by: sbruno


# e5926fd3 14-Jul-2019 Randall Stewart <rrs@FreeBSD.org>

This is the second in a number of patches needed to
get BBRv1 into the tree. This fixes the DSACK bug but
is also needed by BBR. We have yet to go two more
one will be for the pacing code (tcp_ratelimit.c) and
the second will be for the new updated LRO code that
allows a transport to know the arrival times of packets
and (tcp_lro.c). After that we should finally be able
to get BBRv1 into head.

Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D20908


# 55f79588 12-Jul-2019 Randall Stewart <rrs@FreeBSD.org>

add back the comment around the pending DSACK fixes.


# 1cf999a5 10-Jul-2019 Randall Stewart <rrs@FreeBSD.org>

Update to jhb's other suggestion, use #error when
we are missing HPTS.


# 9cf3c235 10-Jul-2019 Randall Stewart <rrs@FreeBSD.org>

Update copyright per JBH's suggestions.. thanks.


# 3b0b41e6 10-Jul-2019 Randall Stewart <rrs@FreeBSD.org>

This commit updates rack to what is basically being used at NF as
well as sets in some of the groundwork for committing BBR. The
hpts system is updated as well as some other needed utilities
for the entrance of BBR. This is actually part 1 of 3 more
needed commits which will finally complete with BBRv1 being
added as a new tcp stack.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D20834


# 5e02b277 19-Jun-2019 Jonathan T. Looney <jtl@FreeBSD.org>

Add the ability to limit how much the code will fragment the RACK send map
in response to SACKs. The default behavior is unchanged; however, the limit
can be activated by changing the new net.inet.tcp.rack.split_limit sysctl.

Submitted by: Peter Lei <peterlei@netflix.com>
Reported by: jtl
Reviewed by: lstewart (earlier version)
Security: CVE-2019-5599


# d1156b05 06-Jun-2019 Michael Tuexen <tuexen@FreeBSD.org>

r347382 added receiver side DSACK support for the TCP base stack.
The corresponding changes for the RACK stack where missed and are added
by this commit.

Reviewed by: Richard Scheffenegger, rrs@
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D20372


# c6dcb64b 20-Feb-2019 Michael Tuexen <tuexen@FreeBSD.org>

Use exponential backoff for retransmitting SYN segments as specified
in the TCP RFCs.

Reviewed by: rrs@, Richard Scheffenegger
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D18974


# 52467047 04-Feb-2019 Warner Losh <imp@FreeBSD.org>

Regularize the Netflix copyright

Use recent best practices for Copyright form at the top of
the license:
1. Remove all the All Rights Reserved clauses on our stuff. Where we
piggybacked others, use a separate line to make things clear.
2. Use "Netflix, Inc." everywhere.
3. Use a single line for the copyright for grep friendliness.
4. Use date ranges in all places for our stuff.

Approved by: Netflix Legal (who gave me the form), adrian@ (pmc files)


# 116ef4d6 31-Jan-2019 Michael Tuexen <tuexen@FreeBSD.org>

When handling SYN-ACK segments in the SYN-RCVD state, set tp->snd_wnd
consistently.

This inconsistency was observed when working on the bug reported in
PR 235256, although it does not fix the reported issue. The fix for
the PR will be a separate commit.

PR: 235256
Reviewed by: rrs@, Richard Scheffenegger
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D19033


# bf7fcdb1 27-Jan-2019 Michael Tuexen <tuexen@FreeBSD.org>

Fix the detection of ECN-setup SYN-ACK packets.

RFC 3168 defines an ECN-setup SYN-ACK packet as on with the ECE flags
set and the CWR flags not set. The code was only checking if ECE flag
is set. This patch adds the check to verify that the CWR flags is not
set.

Submitted by: Richard Scheffenegger
Reviewed by: tuexen@
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18996


# 7dc90a1d 25-Jan-2019 Michael Tuexen <tuexen@FreeBSD.org>

Fix a bug in the restart window computation of TCP New Reno

When implementing support for IW10, an update in the computation
of the restart window used after an idle phase was missed. To
minimize code duplication, implement the logic in tcp_compute_initwnd()
and call it. This fixes a bug in NewReno, which was not aware of
IW10.

Submitted by: Richard Scheffenegger
Reviewed by: tuexen@
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18940


# ad2be389 22-Nov-2018 Michael Tuexen <tuexen@FreeBSD.org>

A TCP stack is required to check SEG.ACK first, when processing a
segment in the SYN-SENT state as stated in Section 3.9 of RFC 793,
page 66. Ensure this is also done by the TCP RACK stack.

Reviewed by: rrs@
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D18034


# fef56019 22-Nov-2018 Michael Tuexen <tuexen@FreeBSD.org>

Ensure that the TCP RACK stack honours the setting of the
net.inet.tcp.drop_synfin sysctl-variable.

Reviewed by: rrs@
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D18033


# 7e729f07 22-Nov-2018 Michael Tuexen <tuexen@FreeBSD.org>

Ensure that the default RTT stack can make an RTT measurement if
the TCP connection was initiated using the RACK stack, but the
peer does not support the TCP RACK extension.

This ensures that the TCP behaviour on the wire is the same if
the TCP connection is initated using the RACK stack or the default
stack.

Reviewed by: rrs@
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D18032


# 79410718 22-Nov-2018 Michael Tuexen <tuexen@FreeBSD.org>

Ensure that TCP RST-segments announce consistently a receiver window of
zero. This was already done when sending them via tcp_respond().

Reviewed by: rrs@
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D17949


# 3bea9a26 21-Nov-2018 Michael Tuexen <tuexen@FreeBSD.org>

Improve two KASSERTs in the TCP RACK stack.

There are two locations where an always true comparison was made in
a KASSERT. Replace this by an appropriate check and use a consistent
panic message. Also use this code when checking a similar condition.

PR: 229664
Reviewed by: rrs@
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D18021


# a8a8a8a8 12-Sep-2018 Michael Tuexen <tuexen@FreeBSD.org>

Fix TCP Fast Open for the TCP RACK stack.

* Fix a bug where the SYN handling during established state was
applied to a front state.
* Move a check for retransmission after the timer handling.
This was suppressing timer based retransmissions.
* Fix an off-by one byte in the sequence number of retransmissions.
* Apply fixes corresponding to
https://svnweb.freebsd.org/changeset/base/336934

Reviewed by: rrs@
Approved by: re (kib@)
MFC after: 1 month
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16912


# c28440db 19-Aug-2018 Randall Stewart <rrs@FreeBSD.org>

This change represents a substantial restructure of the way we
reassembly inbound tcp segments. The old algorithm just blindly
dropped in segments without coalescing. This meant that every
segment could take up greater and greater room on the linked list
of segments. This of course is now subject to a tighter limit (100)
of segments which in a high BDP situation will cause us to be a
lot more in-efficent as we drop segments beyond 100 entries that
we receive. What this restructure does is cause the reassembly
buffer to coalesce segments putting an emphasis on the two
common cases (which avoid walking the list of segments) i.e.
where we add to the back of the queue of segments and where we
add to the front. We also have the reassembly buffer supporting
a couple of debug options (black box logging as well as counters
for code coverage). These are compiled out by default but can
be added by uncommenting the defines.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D16626


# d18ea344 08-Aug-2018 Randall Stewart <rrs@FreeBSD.org>

Fix a small bug in rack where it will
end up sending the FIN twice.
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D16604


# 936b2b64 06-Aug-2018 Randall Stewart <rrs@FreeBSD.org>

This fixes a bug in Rack where we were
not properly using the correct value for
Delayed Ack.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D16579


# 4ad5b7a0 30-Jul-2018 Randall Stewart <rrs@FreeBSD.org>

This fixes a hole where rack could end up
sending an invalid segment into the reassembly
queue. This would happen if you enabled the
data after close option.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D16453


# 8de9ac5e 18-Jul-2018 Randall Stewart <rrs@FreeBSD.org>

Bump the ICMP echo limits to match the RFC

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D16333


# a026a53a 10-Jul-2018 Michael Tuexen <tuexen@FreeBSD.org>

Use appropriate MSS value when populating the TCP FO client cookie cache

When a client receives a SYN-ACK segment with a TFP fast open cookie,
but without an MSS option, an MSS value from uninitialised stack memory is used.
This patch ensures that in case no MSS option is included in the SYN-ACK,
the appropriate value as given in RFC 7413 is used.

Reviewed by: kbowling@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16175


# 5f1347d7 06-Jul-2018 Michael Tuexen <tuexen@FreeBSD.org>

Allow alternate TCP stack to populate the TCP FO client cookie
cache.

Without this patch, TCP FO could be used when using alternate
TCP stack, but only existing entires in the TCP client cookie
cache could be used. This cache was not populated by connections
using alternate TCP stacks.

Sponsored by: Netflix, Inc.


# 6573d758 03-Jul-2018 Matt Macy <mmacy@FreeBSD.org>

epoch(9): allow preemptible epochs to compose

- Add tracker argument to preemptible epochs
- Inline epoch read path in kernel and tied modules
- Change in_epoch to take an epoch as argument
- Simplify tfb_tcp_do_segment to not take a ti_locked argument,
there's no longer any benefit to dropping the pcbinfo lock
and trying to do so just adds an error prone branchfest to
these functions
- Remove cases of same function recursion on the epoch as
recursing is no longer free.
- Remove the the TAILQ_ENTRY and epoch_section from struct
thread as the tracker field is now stack or heap allocated
as appropriate.

Tested by: pho and Limelight Networks
Reviewed by: kbowling at llnw dot com
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16066


# a00f4ac2 23-Jun-2018 Gleb Smirnoff <glebius@FreeBSD.org>

Revert r334843, and partially revert r335180.

tcp_outflags[] were defined since 4BSD and are defined nowadays in
all its descendants. Removing them breaks third party application.


# c6f76759 19-Jun-2018 Randall Stewart <rrs@FreeBSD.org>

Make sure that the t_peakrate_thr is not compiled in
by default until NF can upstream it.

Reviewed by: and suggested lstewart
Sponsored by: Netflix Inc.


# 9e58ff6f 18-Jun-2018 Matt Macy <mmacy@FreeBSD.org>

convert inpcbinfo hash and info rwlocks to epoch + mutex

- Convert inpcbinfo info & hash locks to epoch for read and mutex for write
- Garbage collect code that handled INP_INFO_TRY_RLOCK failures as
INP_INFO_RLOCK which can no longer fail

When running 64 netperfs sending minimal sized packets on a 2x8x2 reduces
unhalted core cycles samples in rwlock rlock/runlock in udp_send from 51% to
3%.

Overall packet throughput rate limited by CPU affinity and NIC driver design
choices.

On the receiver unhalted core cycles samples in in_pcblookup_hash went from
13% to to 1.6%

Tested by LLNW and pho@

Reviewed by: jtl
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15686


# 9293873e 14-Jun-2018 Gleb Smirnoff <glebius@FreeBSD.org>

TCPOUTFLAGS no longer exists since r334843.


# 4aec110f 13-Jun-2018 Randall Stewart <rrs@FreeBSD.org>

This fixes several bugs that Larry Rosenman helped me find in
Rack with respect to its handling of TCP Fast Open. Several
fixes all related to TFO are included in this commit:
1) Handling of non-TFO retransmissions
2) Building the proper send-map when we are doing TFO
3) Dealing with the ack that comes back that includes the
SYN and data.

It appears that with this commit TFO now works :-)

Thanks Larry for all your help!!

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D15758


# cff21e48 11-Jun-2018 Jonathan T. Looney <jtl@FreeBSD.org>

Change RACK dependency on TCPHPTS from a build-time dependency to a load-
time dependency.

At present, RACK requires the TCPHPTS option to run. However, because
modules can be moved from machine to machine, this dependency is really
best assessed at load time rather than at build time.

Reviewed by: rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D15756


# 89e560f4 07-Jun-2018 Randall Stewart <rrs@FreeBSD.org>

This commit brings in a new refactored TCP stack called Rack.
Rack includes the following features:
- A different SACK processing scheme (the old sack structures are not used).
- RACK (Recent acknowledgment) where counting dup-acks is no longer done
instead time is used to knwo when to retransmit. (see the I-D)
- TLP (Tail Loss Probe) where we will probe for tail-losses to attempt
to try not to take a retransmit time-out. (see the I-D)
- Burst mitigation using TCPHTPS
- PRR (partial rate reduction) see the RFC.

Once built into your kernel, you can select this stack by either
socket option with the name of the stack is "rack" or by setting
the global sysctl so the default is rack.

Note that any connection that does not support SACK will be kicked
back to the "default" base FreeBSD stack (currently known as "default").

To build this into your kernel you will need to enable in your
kernel:
makeoptions WITH_EXTRA_TCP_STACKS=1
options TCPHPTS

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D15525