History log of /freebsd-current/sys/netinet/tcp_stacks/bbr.c
Revision Date Author Comments
# fce03f85 05-May-2024 Randall Stewart <rrs@FreeBSD.org>

TCP can be subject to Sack Attacks lets fix this issue.

There is a type of attack that a TCP peer can launch on a connection. This is for sure in Rack or BBR and probably even the default stack if it uses lists in sack processing. The idea of the attack is that the attacker is driving you to look at 100's of sack blocks that only update 1 byte. So for example if you have 1 - 10,000 bytes outstanding the attacker sends in something like:

ACK 0 SACK(1-512) SACK(1024 - 1536), SACK(2048-2536), SACK(4096 - 4608), SACK(8192-8704)
This first sack looks fine but then the attacker sends

ACK 0 SACK(1-512) SACK(1025 - 1537), SACK(2049-2537), SACK(4097 - 4609), SACK(8193-8705)
ACK 0 SACK(1-512) SACK(1027 - 1539), SACK(2051-2539), SACK(4099 - 4611), SACK(8195-8707)
...
These blocks are making you hunt across your linked list and split things up so that you have an entry for every other byte. Has your list grows you spend more and more CPU running through the lists. The idea here is the attacker chooses entries as far apart as possible that make you run through the list. This example is small but in theory if the window is open to say 1Meg you could end up with 100's of thousands link list entries.

To combat this we introduce three things.

when the peer requests a very small MSS we stop processing SACK's from them. This prevents a malicious peer from just using a small MSS to do the same thing.
Any time we get a sack block, we use the sack-filter to remove sacks that are smaller than the smallest v4 mss (minus 40 for max TCP options) unless it ties up to snd_max (since that is legal). All other sacks in theory should be at least an MSS. If we get such an attacker that means we basically start skipping all but MSS sized Sacked blocks.
The sack filter used to throw away data when its bounds were exceeded, instead now we increase its size to 15 and then throw away sack's if the filter gets over-run to prevent the malicious attacker from over-running the sack filter and thus we start to process things anyway.
The default stack will need to start using the sack-filter which we have talked about in past conference calls to take full advantage of the protections offered by it (and reduce cpu consumption when processing sacks).

After this set of changes is in rack can drop its SAD detection completely

Reviewed by:tuexen@, rscheff@
Differential Revision: <https://reviews.freebsd.org/D44903>


# c9cd686b 18-Apr-2024 Michael Tuexen <tuexen@FreeBSD.org>

tcp: drop data received after a FIN has been processed

RFC 9293 describes the handling of data in the CLOSE-WAIT, CLOSING,
LAST-ACK, and TIME-WAIT states:
This should not occur since a FIN has been received from the remote
side. Ignore the segment text.
Therefore, implement this handling.

Reviewed by: rrs, rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44746


# 605a0066 15-Apr-2024 Michael Tuexen <tuexen@FreeBSD.org>

tcp bbr: improve code consistency

Improve code consistency with the RACK stack.
Reviewed by: gallatin, rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44800


# dd7b86e2 18-Mar-2024 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove IS_FASTOPEN() macro

The macro is more obfuscating than helping as it just checks a single flag
of t_flags. All other t_flags bits are checked without a macro.

A bigger problem was that declaration of the macro in tcp_var.h depended
on a kernel option. It is a bad practice to create such definitions in
installable headers.

Reviewed by: rscheff, tuexen, kib
Differential Revision: https://reviews.freebsd.org/D44362


# e18b97bd 12-Mar-2024 Randall Stewart <rrs@FreeBSD.org>

Update to bring the rack stack with all its fixes in.

This brings the rack stack up to the current level used at NF. Many fixes
and improvements have been added. I also add in a fix to BBR to deal with
the changes that have been in hpts for a while i.e. only one call no matter
if mbuf queue or tcp_output.

It basically does little except BBlogs and is a placemark for future work on
doing path capacity measurements.

With a bit of a struggle with git I finally got rack_pcm.c into place (apologies
for not noticing this error). The LINT kernel is running on my box now .. sigh.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision:https://reviews.freebsd.org/D43986


# c112243f 11-Mar-2024 Brooks Davis <brooks@FreeBSD.org>

Revert "Update to bring the rack stack with all its fixes in."

This commit was incomplete and breaks LINT kernels. The tree has been
broken for 8+ hours.

This reverts commit f6d489f402c320f1a6eaa473491a0b8c3878113e.


# f6d489f4 11-Mar-2024 Randall Stewart <rrs@FreeBSD.org>

Update to bring the rack stack with all its fixes in.

This brings the rack stack up to the current level used at NF. Many fixes
and improvements have been added. I also add in a fix to BBR to deal with
the changes that have been in hpts for a while i.e. only one call no matter
if mbuf queue or tcp_output.

Note there is a new file that I can't figure out how to get in rack_pcm.c

It basically does little except BBlogs and is a placemark for future work on
doing path capacity measurements.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision:https://reviews.freebsd.org/D43986


# 2f4e46df 15-Feb-2024 Michael Tuexen <tuexen@FreeBSD.org>

RACK, BBR: handle EACCES like EPERM for IP output handling

The FreeBSD TCP base stack handles them also the same way.
In case of packet filters dropping packets in the output path,
this avoids retranmitting the dropped packet every 10ms or so.

Reviewed by: rscheff
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D43773


# 7b0b448b 27-Dec-2023 Gordon Bergling <gbe@FreeBSD.org>

tcp_stacks: Fix two typos in a source code comments

- s/recieved/received/

MFC after: 3 days


# d2ef52ef 04-Dec-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp/hpts: make stacks responsible for clearing themselves out HPTS

There already is the tfb_tcp_timer_stop_all method that is supposed to stop
all time events associated with a given tcpcb by given stack. Some time
ago it was doing actual callout_stop(). Today bbr/rack just mark their
internal state as inactive in their tfb_tcp_timer_stop_all methods, but
tcpcb stays in HPTS wheel and potentially called in from HPTS. Change the
methods to also call tcp_hpts_remove(). Note: I'm not sure if internal
flag is still relevant once we are out of HPTS wheel.

Call the method when connection goes into TCP_CLOSED state, instead of
calling it later when tcpcb is freed. Also call it when we switch between
stacks.

Reviewed by: tuexen, rrs
Differential Revision: https://reviews.freebsd.org/D42857


# 2b3a7746 04-Dec-2023 Gleb Smirnoff <glebius@FreeBSD.org>

hpts: make stacks responsible for tcp_hpts_init()

Those stacks that use HPTS should care about init, not generic code.

Reviewed by: imp, tuexen, rrs
Differential Revision: https://reviews.freebsd.org/D42856


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# ab65c64b 26-Jul-2023 Michael Tuexen <tuexen@FreeBSD.org>

tcp: fix handling of <RST,ACK> segments in SYN-RCVD for RACK and BBR

This deals with TCP endpoints in the SYN-RCVD state coming from the
SYN-SENT state.

Reviewed by: rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D41203


# 02b885b0 21-Jun-2023 Michael Tuexen <tuexen@FreeBSD.org>

tcp: fix TCP MD5 computation for the BBR and RACK stack

PR: 253096
Reviewed by: cc, rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D40597


# 7ea8d027 20-Jun-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

Update various sys/netinet source files to conform with the style(9)
guide on how to label FALLTHOUGH in switch statements.

No functional chance.

Reviewed By: tuexen, cc, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D40622


# 43b117f8 06-Jun-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: make the maximum number of retransmissions tunable per VNET

Both Windows (TcpMaxDataRetransmissions) and Linux (tcp_retries2)
allow to restrict the maximum number of consecutive timer based
retransmissions. Add that same capability on a per-VNet basis to
FreeBSD.

Reviewed By: cc, tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D40424


# c3c20de3 25-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: move HPTS/LRO flags out of inpcb to tcpcb

These flags are TCP specific. While here, make also several LRO
internal functions to pass tcpcb pointer instead of inpcb one.

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39698


# c2a69e84 25-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_hpts: move HPTS related fields from inpcb to tcpcb

This makes inpcb lighter and allows future cache line optimizations
of tcpcb. The reason why HPTS originally used inpcb is the compressed
TIME-WAIT state (see 0d7445193ab), that used to free a tcpcb, while the
associated connection is still on the HPTS ring.

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39697


# 2ad584c5 17-Apr-2023 Randall Stewart <rrs@FreeBSD.org>

tcp: Inconsistent use of hpts_calling flag

Gleb has noticed there were some inconsistency's in the way the inp_hpts_calls flag was being used. One
such inconsistency results in a bug when we can't allocate enough sendmap entries to entertain a call to
rack_output().. basically a timer won't get started like it should. Also in cleaning this up I find that the
"no_output" side of input needs to be adjusted to make sure we don't try to re-pace too quickly outside
the hpts assurance of 250useconds.

Another thing here is we end up with duplicate calls to tcp_output() which we should not. If packets go
from hpts for processing the input side of tcp will call the output side of tcp on the last packet if it is needed.
This means that when that occurs a second call to tcp_output would be made that is not needed and if pacing
is going on may be harmful.

Lets fix all this and explicitly state the contract that hpts is making with transports that care about the
flag.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39653


# a540cdca 17-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_hpts: use queue(9) STAILQ for the input queue

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39574


# 3cc7b667 14-Apr-2023 Randall Stewart <rrs@FreeBSD.org>

tcp: stack unloading crash in rack and bbr

Its possible to induce a crash in either rack or bbr. This would be done
if the rack stack were say the default and bbr was being used by a connection.
If the bbr stack is then unloaded and it was active, we will trigger a MPASS assert
in tcp_hpts since the new stack (default rack) would start a timer, and the old stack
(bbr) would have the inp already in hpts.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39576


# 66fbc19f 07-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: pass tcpcb in the tfb_tcp_ctloutput() method instead of inpcb

Just matches rest of the KPI.

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39435


# 35bc0bcc 07-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: reduce argument list to functions that pass a segment

The socket argument is superfluous, as a tcpcb always has one and
only one socket.

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39434


# 945f9a7c 07-Apr-2023 Randall Stewart <rrs@FreeBSD.org>

tcp: misc cleanup of options for rack as well as socket option logging.

Both BBR and Rack have the ability to log socket options, which is currently disabled. Rack
has an experimental SaD (Sack Attack Detection) algorithm that should be made available. Also
there is a t_maxpeak_rate that needs to be removed (its un-used).

Reviewed by: tuexen, cc
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D39427


# 73ee5756 31-Mar-2023 Randall Stewart <rrs@FreeBSD.org>

Fixes in the tcp infrastructure with respect to stack changes as well as other infrastructure updates for incoming rack features.

So stack switching as always been a bit of a issue. We currently use a break before make setup which means that
if something goes wrong you have to try to get back to a stack. This patch among a lot of other things changes that so
that it is a make before break. We also expand some of the function blocks in prep for new features in rack that will allow
more controlled pacing. We also add other abilities such as the pathway for a stack to query a previous stack to acquire from
it critical state information so things in flight don't get dropped or mis-handled when switching stacks. We also add the
concept of a timer granularity. This allows an alternate stack to change from the old ticks granularity to microseconds and
of course this even gives us a pathway to go to nanosecond timekeeping if we need to (something for the data center to consider
for sure).

Once all this lands I will then update rack to begin using all these new features.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D39210


# 69c7c811 16-Mar-2023 Randall Stewart <rrs@FreeBSD.org>

Move access to tcp's t_logstate into inline functions and provide new tracepoint and bbpoint capabilities.

The TCP stacks have long accessed t_logstate directly, but in order to do tracepoints and the new bbpoints
we need to move to using the new inline functions. This adds them and moves rack to now use
the tcp_tracepoints.

Reviewed by: tuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D38831


# c0e4090e 08-Feb-2023 Andrew Gallatin <gallatin@FreeBSD.org>

ktls: Accurately track if ifnet ktls is enabled

This allows us to avoid spurious calls to ktls_disable_ifnet()

When we implemented ifnet kTLSe, we set a flag in the tx socket
buffer (SB_TLS_IFNET) to indicate ifnet kTLS. This flag meant that
now, or in the past, ifnet ktls was active on a socket. Later,
I added code to switch ifnet ktls sessions to software in the case
of lossy TCP connections that have a high retransmit rate.
Because TCP was using SB_TLS_IFNET to know if it needed to do math
to calculate the retransmit ratio and potentially call into
ktls_disable_ifnet(), it was doing unneeded work long after
a session was moved to software.

This patch carefully tracks whether or not ifnet ktls is still enabled
on a TCP connection. Because the inp is now embedded in the tcpcb, and
because TCP is the most frequent accessor of this state, it made sense to
move this from the socket buffer flags to the tcpcb. Because we now need
reliable access to the tcbcb, we take a ref on the inp when creating a tx
ktls session.

While here, I noticed that rack/bbr were incorrectly implementing
tfb_hwtls_change(), and applying the change to all pending sends,
when it should apply only to future sends.

This change reduces spurious calls to ktls_disable_ifnet() by 95% or so
in a Netflix CDN environment.

Reviewed by: markj, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D38380


# 18b83b62 26-Jan-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: reduce the size of t_rttupdated in tcpcb

During tcp session start, various mechanisms need to
track a few initial RTTs before becoming active.
Prevent overflows of the corresponding tracking counter
and reduce the size of tcpcb simultaneously.

Reviewed By: #transport, tuexen, guest-ccui
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D21117


# 73e994a9 19-Jan-2023 Gordon Bergling <gbe@FreeBSD.org>

extra_tcp_stacks: Fix a common typo in source code comments

- s/orginal/original/

MFC after: 3 days


# eaabc937 14-Dec-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: retire TCPDEBUG

This subsystem is superseded by modern debugging facilities,
e.g. DTrace probes and TCP black box logging.

We intentionally leave SO_DEBUG in place, as many utilities may
set it on a socket. Also the tcp::debug DTrace probes look at
this flag on a socket.

Reviewed by: gnn, tuexen
Discussed with: rscheff, rrs, jtl
Differential revision: https://reviews.freebsd.org/D37694


# 446ccdd0 07-Dec-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: use single locked callout per tcpcb for the TCP timers

Use only one callout structure per tcpcb that is responsible for handling
all five TCP timeouts. Use locked version of callout, of course. The
callout function tcp_timer_enter() chooses soonest timer and executes it
with lock held. Unless the timer reports that the tcpcb has been freed,
the callout is rescheduled for next soonest timer, if there is any.

With single callout per tcpcb on connection teardown we should be able
to fully stop the callout and immediately free it, avoiding use of
callout_async_drain(). There is one gotcha here: callout_stop() can
actually touch our memory when a rare race condition happens. See
comment above tcp_timer_stop(). Synchronous stop of the callout makes
tcp_discardcb() the single entry point for tcpcb destructor, merging the
tcp_freecb() to the end of the function.

While here, also remove lots of lingering checks in the beginning of
TCP timer functions. With a locked callout they are unnecessary.

While here, clean unused parts of timer KPI for the pluggable TCP stacks.

While here, remove TCPDEBUG from tcp_timer.c, as this allows for more
simplification of TCP timers. The TCPDEBUG is scheduled for removal.

Move the DTrace probes in timers to the beginning of a function, where
a tcpcb is always existing.

Discussed with: rrs, tuexen, rscheff (the TCP part of the diff)
Reviewed by: hselasky, kib, mav (the callout part)
Differential revision: https://reviews.freebsd.org/D37321


# 918fa422 07-Dec-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove tcp_timer_suspend()

It was a temporary code added together with RACK to fight against
TCP timer races.


# bd4f9866 16-Nov-2022 Michael Tuexen <tuexen@FreeBSD.org>

tcp: remove unused t_rttbest

No functional change intended.

Reviewed by: rscheff@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D37401


# 9eb0e832 08-Nov-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: provide macros to access inpcb and socket from a tcpcb

There should be no functional changes with this commit.

Reviewed by: rscheff
Differential revision: https://reviews.freebsd.org/D37123


# bcf8fb7f 08-Nov-2022 Gordon Bergling <gbe@FreeBSD.org>

tcp_bbr(4): Fix a typo in a source code comment

- s/retranmitted/retransmitted/

MFC after: 3 days


# f504685a 31-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

rack/bbr: put back assertion that connection is not in TIME-WAIT

The assertion was incorrectly removed in 0d7445193ab. The leak of
a TIME-WAIT state into tfb_do_segment_nounlock method was fixed in
31bc602ff81. The TIME-WAIT connections are processed by the main
tcp_input() always.


# 83c1ec92 20-Oct-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: ECN preparations for ECN++, AccECN (tcp_respond)

tcp_respond is another function to build a tcp control packet
quickly. With ECN++ and AccECN, both the IP ECN header, and
the TCP ECN flags are supposed to reflect the correct state.

Also ensure that on receiving multiple ECN SYN-ACKs, the
responses triggered will reflect the latest state.

Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36973


# 53af6903 06-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove INP_TIMEWAIT flag

Mechanically cleanup INP_TIMEWAIT from the kernel sources. After
0d7445193ab, this commit shall not cause any functional changes.

Note: this flag was very often checked together with INP_DROPPED.
If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we
will be able to remove most of this checks and turn them to
assertions. Some of them can be turned into assertions right now,
but that should be carefully done on a case by case basis.

Differential revision: https://reviews.freebsd.org/D36400


# 0d744519 06-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove tcptw, the compressed timewait state structure

The memory savings the tcptw brought back in 2003 (see 340c35de6a2) no
longer justify the complexity required to maintain it. For longer
explanation please check out the email [1].

Surpisingly through almost 20 years the TCP stack functionality of
handling the TIME_WAIT state with a normal tcpcb did not bitrot. The
existing tcp_input() properly handles a tcpcb in TCPS_TIME_WAIT state,
which is confirmed by the packetdrill tcp-testsuite [2].

This change just removes tcptw and leaves INP_TIMEWAIT. The flag will
be removed in a separate commit. This makes it easier to review and
possibly debug the changes.

[1] https://lists.freebsd.org/archives/freebsd-net/2022-January/001206.html
[2] https://github.com/freebsd-net/tcp-testsuite

Differential revision: https://reviews.freebsd.org/D36398


# 2220b66f 03-Oct-2022 Konstantin Belousov <kib@FreeBSD.org>

Add mbuf_tstmp2timeval()

Reviewed by: hselasky, jkim, rscheff
Sponsored by: NVIDIA networking
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D36870


# 493105c2 21-Sep-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: fix simultaneous open and refine e80062a2d43

- The soisconnected() call on transition from SYN_RCVD to ESTABLISHED
is also necessary for a half-synchronized connection. Fix that
just setting the flag, when we transfer SYN-SENT -> SYN-RECEIVED.
- Provide a comment that explains at what conditions the call to
soisconnected() is necessary.
- Hence mechanically rename the TF_INCQUEUE flag to TF_SONOTCONN.
- Extend the change to the BBR and RACK stacks.

Note: the interaction between the accept_filter(9) and the socket layer
is not fully consistent, yet. For most accept filters this call to
soisconnected() will not move the connection from the incomplete queue
to the complete. The move would happen only when the filter has received
the desired data, and soisconnected() would be called once again from
sorwakeup(). Ideally, we should mark socket as connected only there,
and leave the soisconnected() from SYN_RCVD->ESTABLISHED only for the
simultaneous open case. However, this doesn't yet work.

Reviewed by: rscheff, tuexen, rrs
Differential revision: https://reviews.freebsd.org/D36641


# 76248965 14-Aug-2022 Dimitry Andric <dim@FreeBSD.org>

Fix unused variable warning in tcp_stacks's bbr.c

With clang 15, the following -Werror warning is produced:

sys/netinet/tcp_stacks/bbr.c:11925:11: error: variable 'rtr_cnt' set but not used [-Werror,-Wunused-but-set-variable]
uint32_t rtr_cnt = 0;
^

The 'rtr_cnt' variable was in bbr.c when it was first added, but it
appears to have been a debugging aid that has never been used, so remove
it.

MFC after: 3 days


# aeb6948d 08-Jul-2022 Gleb Smirnoff <glebius@FreeBSD.org>

bbr: check proper flag for connection had been closed

An older version of D35663 slipped through final reviews.

Submitted by: Peter Lei
Fixes: 74703901d8bbc3bc7a29df648bc3c131c87393c2


# 74703901 04-Jul-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: use a TCP flag to check if connection has been close(2)d

The flag SS_NOFDREF is a private flag of the socket layer. It also
is supposed to be read with SOCK_LOCK(), which we don't own here.

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D35663


# 4581cffb 12-May-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: fix build, convert missed sbreserve_locked() calls

Fixes: 4328318445ae


# 033718ab 12-Apr-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Whitespace cleanup in brr and rack

Whitespace cleanup (leading spaces to tabs)
Nicefy function definitions with indentations

No functional change

Reviewed By: #transport, thj
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D30043


# 2dd0c2bc 09-Apr-2022 Gordon Bergling <gbe@FreeBSD.org>

tcp_bbr(4): Fix a typo in a source code comment

- s/possiblity/possibility/

MFC after: 3 days


# 66570901 09-Apr-2022 Gordon Bergling <gbe@FreeBSD.org>

tcp_bbr(4): Fix two typos in source code comments

- s/postive/positive/
- s/postion/position/

MFC after: 3days


# 4d6883cb 08-Apr-2022 Gordon Bergling <gbe@FreeBSD.org>

tcp_bbr(4): Fix a typo in a sysctl description and a comment

- s/postive/positive/

MFC after: 5 days


# 75fdc440 07-Feb-2022 Gordon Bergling <gbe@FreeBSD.org>

extra_tcp_stacks: Fix two typos in source code comments

- s/recusive/recursive/

MFC after: 3 days


# 1ebf4607 03-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Access all 12 TCP header flags via inline function

In order to consistently provide access to all
(including reserved) TCP header flag bits,
use an accessor function tcp_get_flags and
tcp_set_flags. Also expand any flag variable from
uint8_t / char to uint16_t.

Reviewed By: hselasky, tuexen, glebius, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34130


# 3b3c08c1 02-Feb-2022 Michael Tuexen <tuexen@FreeBSD.org>

tcp: cleanup functions related to socket option handling

Consistently only pass the inp and the sopt around. Don't pass the
so around, since in a upcoming commit tcp_ctloutput_set() will be
called from a context different from setsockopt(). Also expect
the inp to be locked when calling tcp_ctloutput_[gs]et(), this is
also required for the upcoming use by tcpsso, a command line tool
to set socket options.
Reviewed by: glebius, rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D34151


# 9e58cca3 26-Jan-2022 Gordon Bergling <gbe@FreeBSD.org>

extra_tcp_stacks: Fix two typos in source code comments

- s/differnt/different/

MFC after; 3 days


# b3df222e 26-Jan-2022 Gordon Bergling <gbe@FreeBSD.org>

extra_tcp_stacks: Fix a few common typos

TCP_BBR:
- Fix a typo introducted in 1b90dfa5d2b0, which was reported by tuexen@

TCP_RACK:
- Correct two sysctl descriptions: s/corret/correct/

tcp_bbr(4): Also fix s/measurment/measurement/ in the man page

MFC after: 1 week


# aac52f94 18-Jan-2022 Randall Stewart <rrs@FreeBSD.org>

tcp: Warning cleanup from new compiler.

The clang compiler recently got an update that generates warnings of unused
variables where they were set, and then never used. This revision goes through
the tcp stack and cleans all of those up.

Reviewed by: Michael Tuexen, Gleb Smirnoff
Sponsored by: Netflix Inc.
Differential Revision:


# 1b90dfa5 02-Jan-2022 Gordon Bergling <gbe@FreeBSD.org>

tcp_bbr(4): Fix a few typos in sysctl descriptions

- s/measurment/measurement/

MFC after: 3 days


# a370832b 26-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove delayed drop KPI

No longer needed after tcp_output() can ask caller to drop.

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D33371


# f64dc2ab 26-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: TCP output method can request tcp_drop

The advanced TCP stacks (bbr, rack) may decide to drop a TCP connection
when they do output on it. The default stack never does this, thus
existing framework expects tcp_output() always to return locked and
valid tcpcb.

Provide KPI extension to satisfy demands of advanced stacks. If the
output method returns negative error code, it means that caller must
call tcp_drop().

In tcp_var() provide three inline methods to call tcp_output():
- tcp_output() is a drop-in replacement for the default stack, so that
default stack can continue using it internally without modifications.
For advanced stacks it would perform tcp_drop() and unlock and report
that with negative error code.
- tcp_output_unlock() handles the negative code and always converts
it to positive and always unlocks.
- tcp_output_nodrop() just calls the method and leaves the responsibility
to drop on the caller.

Sweep over the advanced stacks and use new KPI instead of using HPTS
delayed drop queue for that.

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D33370


# 40fa3e40 26-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: mechanically substitute call to tfb_tcp_output to new method.

Made with sed(1) execution:

sed -Ef sed -i "" $(grep --exclude tcp_var.h -lr tcp_output sys/)

sed:
s/tp->t_fb->tfb_tcp_output\(tp\)/tcp_output(tp)/
s/to tfb_tcp_output\(\)/to tcp_output()/

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D33366


# db0ac6de 02-Dec-2021 Cy Schubert <cy@FreeBSD.org>

Revert "wpa: Import wpa_supplicant/hostapd commit 14ab4a816"

This reverts commit 266f97b5e9a7958e365e78288616a459b40d924a, reversing
changes made to a10253cffea84c0c980a36ba6776b00ed96c3e3b.

A mismerge of a merge to catch up to main resulted in files being
committed which should not have been.


# f971e791 02-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_hpts: rename input queue to drop queue and trim dead code

The HPTS input queue is in reality used only for "delayed drops".
When a TCP stack decides to drop a connection on the output path
it can't do that due to locking protocol between main tcp_output()
and stacks. So, rack/bbr utilize HPTS to drop the connection in
a different context.

In the past the queue could also process input packets in context
of HPTS thread, but now no stack uses this, so remove this
functionality.

Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33025


# 50f081ec 02-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_hpts: provide tcp_in_hpts().

It will hide some internal HPTS knowledge from the consumers.

Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33023


# 1dadeab3 30-Nov-2021 Gordon Bergling <gbe@FreeBSD.org>

netinet: Fix a common typo in source code comments

- s/segement/segment/

MFC after: 3 days


# c28e39c3 03-Nov-2021 Gordon Bergling <gbe@FreeBSD.org>

Fix a common typo in syctl descriptions

- s/maxiumum/maximum/

MFC after: 3 days


# f581a26e 25-Oct-2021 Gleb Smirnoff <glebius@FreeBSD.org>

Factor out tcp6_use_min_mtu() to handle IPV6_USE_MIN_MTU by TCP.

Pass control for IP/IP6 level options from generic tcp_ctloutput_set()
down to per-stack ctloutput.

Call tcp6_use_min_mtu() from tcp stack tcp_default_ctloutput().

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D32655


# a36230f7 01-Oct-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Make dsack stats available in netstat and also make sure its aware of TLP's.

DSACK accounting has been for quite some time under a NETFLIX_STATS ifdef. Statistics
on DSACKs however are very useful in figuring out how much bad retransmissions you
are doing. This is further complicated, however, by stacks that do TLP. A TLP
when discovering a lost ack in the reverse path will cause the generation
of a DSACK. For this situation we introduce a new dsack-tlp-bytes as well
as the more traditional dsack-bytes and dsack-packets. These will now
all display in netstat -p tcp -s. This also updates all stacks that
are currently built to keep track of these stats.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32158


# b1e806c0 07-Jul-2021 Andrew Gallatin <gallatin@FreeBSD.org>

tcp: fix alternate stack build with LINT-NO{INET,INET6,IP}

When fixing another bug, I noticed that the alternate
TCP stacks do not build when various combinations of
ipv4 and ipv6 are disabled.

Reviewed by: rrs, tuexen
Differential Revision: https://reviews.freebsd.org/D31094
Sponsored by: Netflix


# d7955cc0 06-Jul-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: HPTS performance enhancements

HPTS drives both rack and bbr, and yet there have been many complaints
about performance. This bit of work restructures hpts to help reduce CPU
overhead. It does this by now instead of relying on the timer/callout to
drive it instead use user return from a system call as well as lro flushes
to drive hpts. The timer becomes a backstop that dynamically adjusts
based on how "late" we are.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31083


# 9e4d9e4c 25-Jun-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Preparation for allowing hardware TLS to be able to kick a tcp connection that is retransmitting too much out of hardware and back to software.

Hardware TLS is now supported in some interface cards and it works well. Except that
when we have connections that retransmit a lot we get into trouble with all the retransmits.
This prep step makes way for change that Drew will be making so that we can "kick out" a
session from hardware TLS.

Reviewed by: mtuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30895


# ba1b3e48 11-Jun-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Missing mfree in rack and bbr

Recently (Nov) we added logic that protects against a peer negotiating a timestamp, and
then not including a timestamp. This involved in the input path doing a goto done_with_input
label. Now I suspect the code was cribbed from one in Rack that has to do with the SYN.
This had a bug, i.e. it should have a m_freem(m) before going to the label (bbr had this
missing m_freem() but rack did not). This then caused the missing m_freem to show
up in both BBR and Rack. Also looking at the code referencing m->m_pkthdr.lro_nsegs
later (after processing) is not a good idea, even though its only for logging. Best to
copy that off before any frees can take place.

Reviewed by: mtuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30727


# 67e89281 10-Jun-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Mbuf leak while holding a socket buffer lock.

When running at NF the current Rack and BBR changes with the recent
commits from Richard that cause the socket buffer lock to be held over
the ip_output() call and then finally culminating in a call to tcp_handle_wakeup()
we get a lot of leaked mbufs. I don't think that this leak is actually caused
by holding the lock or what Richard has done, but is exposing some other
bug that has probably been lying dormant for a long time. I will continue to
look (using his changes) at what is going on to try to root cause out the issue.

In the meantime I can't leave the leaks out for everyone else. So this commit
will revert all of Richards changes and move both Rack and BBR back to just
doing the old sorwakeup_locked() calls after messing with the so_rcv buffer.

We may want to look at adding back in Richards changes after I have pinpointed
the root cause of the mbuf leak and fixed it.

Reviewed by: mtuexen,rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30704


# 4747500d 04-Jun-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: A better fix for the previously attempted fix of the ack-war issue with tcp.

So it turns out that my fix before was not correct. It ended with us failing
some of the "improved" SYN tests, since we are not in the correct states.
With more digging I have figured out the root of the problem is that when
we receive a SYN|FIN the reassembly code made it so we create a segq entry
to hold the FIN. In the established state where we were not in order this
would be correct i.e. a 0 len with a FIN would need to be accepted. But
if you are in a front state we need to strip the FIN so we correctly handle
the ACK but ignore the FIN. This gets us into the proper states
and avoids the previous ack war.

I back out some of the previous changes but then add a new change
here in tcp_reass() that fixes the root cause of the issue. We still
leave the rack panic fixes in place however.

Reviewed by: mtuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30627


# 8c69d988 27-May-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: When we have an out-of-order FIN we do want to strip off the FIN bit.

The last set of commits fixed both a panic (in rack) and an ACK-war (in freebsd and bbr).
However there was a missing case, i.e. where we get an out-of-order FIN by itself.
In such a case we don't want to leave the FIN bit set, otherwise we will do the
wrong thing and ack the FIN incorrectly. Instead we need to go through the
tcp_reasm() code and that way the FIN will be stripped and all will be well.

Reviewed by: mtuexen,rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30497


# 13c0e198 25-May-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Fix bugs related to the PUSH bit and rack and an ack war

Michaels testing with UDP tunneling found an issue with the push bit, which was only partly fixed
in the last commit. The problem is the left edge gets transmitted before the adjustments are done
to the send_map, this means that right edge bits must be considered to be added only if
the entire RSM is being retransmitted.

Now syzkaller also continued to find a crash, which Michael sent me the reproducer for. Turns
out that the reproducer on default (freebsd) stack made the stack get into an ack-war with itself.
After fixing the reference issues in rack the same ack-war was found in rack (and bbr). Basically
what happens is we go into the reassembly code and lose the FIN bit. The trick here is we
should not be going into the reassembly code if tlen == 0 i.e. the peer never sent you anything.
That then gets the proper action on the FIN bit but then you end up in LAST_ACK with no
timers running. This is because the usrclosed function gets called and the FIN's and such have
already been exchanged. So when we should be entering FIN_WAIT2 (or even FIN_WAIT1) we get
stuck in LAST_ACK. Fixing this means tweaking the usrclosed function so that we properly
recognize the condition and drop into FIN_WAIT2 where a timer will allow at least TP_MAXIDLE
before closing (to allow time for the peer to retransmit its FIN if the ack is lost). Setting the fast_finwait2
timer can speed this up in testing.

Reviewed by: mtuexen,rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30451


# 8923ce63 22-May-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: Handle stack switch while processing socket options

Handle the case where during socket option processing, the user
switches a stack such that processing the stack specific socket
option does not make sense anymore. Return an error in this case.

MFC after: 1 week
Reviewed by: markj
Reported by: syzbot+a6e1d91f240ad5d72cd1@syzkaller.appspotmail.com
Sponsored by: Netflix, Inc.
Differential revision: https://reviews.freebsd.org/D30395


# 032bf749 21-May-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

[tcp] Keep socket buffer locked until upcall

r367492 would unlock the socket buffer before eventually calling the upcall.
This leads to problematic interaction with NFS kernel server/client components
(MP threads) accessing the socket buffer with potentially not correctly updated
state.

Reported by: rmacklem
Reviewed By: tuexen, #transport
Tested by: rmacklem, otis
MFC after: 2 weeks
Sponsored By: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29690


# 500eb6dd 21-May-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: Fix sending of TCP segments with IP level options

When bringing in TCP over UDP support in
https://cgit.FreeBSD.org/src/commit/?id=9e644c23000c2f5028b235f6263d17ffb24d3605,
the length of IP level options was considered when locating the
transport header. This was incorrect and is fixed by this patch.

X-MFC with: https://cgit.FreeBSD.org/src/commit/?id=9e644c23000c2f5028b235f6263d17ffb24d3605
MFC after: 3 days
Reviewed by: markj, rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D30358


# 5d8fd932 06-May-2021 Randall Stewart <rrs@FreeBSD.org>

This brings into sync FreeBSD with the netflix versions of rack and bbr.
This fixes several breakages (panics) since the tcp_lro code was
committed that have been reported. Quite a few new features are
now in rack (prefecting of DGP -- Dynamic Goodput Pacing among the
largest). There is also support for ack-war prevention. Documents
comming soon on rack..

Sponsored by: Netflix
Reviewed by: rscheff, mtuexen
Differential Revision: https://reviews.freebsd.org/D30036


# 9e644c23 18-Apr-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: add support for TCP over UDP

Adding support for TCP over UDP allows communication with
TCP stacks which can be implemented in userspace without
requiring special priviledges or specific support by the OS.
This is joint work with rrs.

Reviewed by: rrs
Sponsored by: Netflix, Inc.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D29469


# 40f41ece 22-Mar-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: improve handling of SYN segments in SYN-SENT state

Ensure that the stack does not generate a DSACK block for user
data received on a SYN segment in SYN-SENT state.

Reviewed by: rscheff
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D29376
Sponsored by: Netflix, Inc.


# 5666643a 13-Mar-2021 Gordon Bergling <gbe@FreeBSD.org>

Fix some common typos in comments

- occured -> occurred
- normaly -> normally
- controling -> controlling
- fileds -> fields
- insterted -> inserted
- outputing -> outputting

MFC after: 1 week


# 1a714ff2 26-Jan-2021 Randall Stewart <rrs@FreeBSD.org>

This pulls over all the changes that are in the netflix
tree that fix the ratelimit code. There were several bugs
in tcp_ratelimit itself and we needed further work to support
the multiple tag format coming for the joint TLS and Ratelimit dances.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D28357


# d2b3cedd 13-Jan-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: add sysctl to tolerate TCP segments missing timestamps

When timestamp support has been negotiated, TCP segements received
without a timestamp should be discarded. However, there are broken
TCP implementations (for example, stacks used by Omniswitch 63xx and
64xx models), which send TCP segments without timestamps although
they negotiated timestamp support.
This patch adds a sysctl variable which tolerates such TCP segments
and allows to interoperate with broken stacks.

Reviewed by: jtl@, rscheff@
Differential Revision: https://reviews.freebsd.org/D28142
Sponsored by: Netflix, Inc.
PR: 252449
MFC after: 1 week


# cc3c3485 13-Jan-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: fix handling of TCP RST segments missing timestamps

A TCP RST segment should be processed even it is missing TCP
timestamps.

Reported by: dmgk@, kevans@
Reviewed by: rscheff@, dmgk@
Sponsored by: Netflix, Inc.
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D28143


# 283c76c7 09-Nov-2020 Michael Tuexen <tuexen@FreeBSD.org>

RFC 7323 specifies that:
* TCP segments without timestamps should be dropped when support for
the timestamp option has been negotiated.
* TCP segments with timestamps should be processed normally if support
for the timestamp option has not been negotiated.
This patch enforces the above.

PR: 250499
Reviewed by: gnn, rrs
MFC after: 1 week
Sponsored by: Netflix, Inc
Differential Revision: https://reviews.freebsd.org/D27148


# 4d0770f1 08-Nov-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Prevent premature SACK block transmission during loss recovery

Under specific conditions, a window update can be sent with
outdated SACK information. Some clients react to this by
subsequently delaying loss recovery, making TCP perform very
poorly.

Reported by: chengc_netapp.com
Reviewed by: rrs, jtl
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D24237


# 285385ba 09-Sep-2020 Randall Stewart <rrs@FreeBSD.org>

So it turns out that syzkaller hit another crash. It has to do with switching
stacks with a SENT_FIN outstanding. Both rack and bbr will only send a
FIN if all data is ack'd so this must be enforced. Also if the previous stack
sent the FIN we need to make sure in rack that when we manufacture the
"unknown" sends that we include the proper HAS_FIN bits.

Note for BBR we take a simpler approach and just refuse to switch.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D26269


# 67d224ef 04-Sep-2020 Bjoern A. Zeeb <bz@FreeBSD.org>

bbr: remove unused static function

bbr_log_type_hrdwtso() is a file local static unused function.
Remove it to avoid warnings on kernel compiles.

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D26331


# 662c1305 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

net: clean up empty lines in .c and .h files


# b9978183 19-Aug-2020 Andrew Gallatin <gallatin@FreeBSD.org>

TCP: remove special treatment for hardware (ifnet) TLS

Remove most special treatment for ifnet TLS in the TCP stack, except
for code to avoid mixing handshakes and bulk data.

This code made heroic efforts to send down entire TLS records to
NICs. It was added to improve the PCIe bus efficiency of older TLS
offload NICs which did not keep state per-session, and so would need
to re-DMA the first part(s) of a TLS record if a TLS record was sent
in multiple TCP packets or TSOs. Newer TLS offload NICs do not need
this feature.

At Netflix, we've run extensive QoE tests which show that this feature
reduces client quality metrics, presumably because the effort to send
TLS records atomically causes the server to both wait too long to send
data (leading to buffers running dry), and to send too much data at
once (leading to packet loss).

Reviewed by: hselasky, jhb, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26103


# e54b7cd0 01-Jul-2020 Michael Tuexen <tuexen@FreeBSD.org>

Fix the cleanup handling in a error path for TCP BBR.

Reported by: syzbot+df7899c55c4cc52f5447@syzkaller.appspotmail.com
Reviewed by: rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D25486


# 95ef69c6 16-Jun-2020 Randall Stewart <rrs@FreeBSD.org>

iSo in doing final checks on OCA firmware with all the latest tweaks the dup-ack checking
packet drill script was failing with a number of unexpected acks. So it turns
out if you have the default recvwin set up to 1Meg (like OCA's do) and you
have no window scaling (like the dupack checking code) then we have another
case where we are always trying to update the rwnd and sending an
ack when we should not.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D25298


# f092a3c7 12-Jun-2020 Randall Stewart <rrs@FreeBSD.org>

So it turns out with the right window scaling you can get the code in all stacks to
always want to do a window update, even when no data can be sent. Now in
cases where you are not pacing thats probably ok, you just send an extra
window update or two. However with bbr (and rack if its paced) every time
the pacer goes off its going to send a "window update".

Also in testing bbr I have found that if we are not responding to
data right away we end up staying in startup but incorrectly holding
a pacing gain of 192 (a loss). This is because the idle window code
does not restict itself to only work with PROBE_BW. In all other
states you dont want it doing a PROBE_BW state change.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D25247


# e854dd38 08-Jun-2020 Randall Stewart <rrs@FreeBSD.org>

An important statistic in determining if a server process (or client) is being delayed
is to know the time to first byte in and time to first byte out. Currently we
have no way to know these all we have is t_starttime. That (t_starttime) tells us
what time the 3 way handshake completed. We don't know when the first
request came in or how quickly we responded. Nor from a client perspective
do we know how long from when we sent out the first byte before the
server responded.

This small change adds the ability to track the TTFB's. This will show up in
BB logging which then can be pulled for later analysis. Note that currently
the tracking is via the ticks variable of all three variables. This provides
a very rough estimate (hz=1000 its 1ms). A follow-on set of work will be
to change all three of these values into something with a much finer resolution
(either microseconds or nanoseconds), though we may want to make the resolution
configurable so that on lower powered machines we could still use the much
cheaper ticks variable.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D24902


# f1ea4e41 03-Jun-2020 Randall Stewart <rrs@FreeBSD.org>

This fixes a couple of skyzaller crashes. Most
of them have to do with TFO. Even the default stack
had one of the issues:

1) We need to make sure for rack that we don't advance
snd_nxt beyond iss when we are not doing fast open. We
otherwise can get a bunch of SYN's sent out incorrectly
with the seq number advancing.
2) When we complete the 3-way handshake we should not ever
append to reassembly if the tlen is 0, if TFO is enabled
prior to this fix we could still call the reasemmbly. Note
this effects all three stacks.
3) Rack like its cousin BBR should track if a SYN is on a
send map entry.
4) Both bbr and rack need to only consider len incremented on a SYN
if the starting seq is iss, otherwise we don't increment len which
may mean we return without adding a sendmap entry.

This work was done in collaberation with Michael Tuexen, thanks for
all the testing!
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D25000


# 77c68315 23-May-2020 Emmanuel Vadot <manu@FreeBSD.org>

bbr: Use arc4random_uniform from libkern.

This unbreak LINT build

Reported by: jenkins, melifaro


# 8e051165 21-May-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Retain only mutually supported TCP options after simultaneous SYN

When receiving a parallel SYN in SYN-SENT state, remove all the
options only we supported locally before sending the SYN,ACK.

This addresses a consistency issue on parallel opens.

Also, on such a parallel open, the stack could be coaxed into
running with timestamps enabled, even if administratively disabled.

Reviewed by: tuexen (mentor)
Approved by: tuexen (mentor)
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D23371


# 777b88d6 15-May-2020 Randall Stewart <rrs@FreeBSD.org>

This fixes several skyzaller issues found with the
help of Michael Tuexen. There was some accounting
errors with TCPFO for bbr and also for both rack
and bbr there was a FO case where we should be
jumping to the just_return_nolock label to
exit instead of returning 0. This of course
caused no timer to be running and thus the
stuck sessions.

Reported by: Michael Tuexen and Skyzaller
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D24852


# b1ddcbc6 07-May-2020 Randall Stewart <rrs@FreeBSD.org>

When in the SYN-SENT state bbr and rack will not properly send an ACK but instead start the D-ACK timer. This
causes so_reuseport_lb_test to fail since it slows down how quickly the program runs until the timeout occurs
and fails the test

Sponsored by: Netflix inc.
Differential Revision: https://reviews.freebsd.org/D24747


# 8717b8f1 07-May-2020 Randall Stewart <rrs@FreeBSD.org>

NF has an internal option that changes the tcp_mcopy_m routine slightly (has
a few extra arguments). Recently that changed to only have one arg extra so
that two ifdefs around the call are no longer needed. Lets take out the
extra ifdef and arg.

Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D24736


# 570045a0 04-May-2020 Randall Stewart <rrs@FreeBSD.org>

This fixes two issues found by ankitraheja09@gmail.com
1) When BBR retransmits the syn it was messing up the snd_max
2) When we need to send a RST we might not send it when we should

Reported by: ankitraheja09@gmail.com
Sponsored by: Netflix.com
Differential Revision: https://reviews.freebsd.org/D24693


# 7985fd7e 04-May-2020 Michael Tuexen <tuexen@FreeBSD.org>

Enter the net epoch before calling the output routine in TCP BBR.
This was only triggered when setting the IPPROTO_TCP level socket
option TCP_DELACK.
This issue was found by runnning an instance of SYZKALLER.
Reviewed by: rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D24690


# 963fb2ad 04-May-2020 Randall Stewart <rrs@FreeBSD.org>

This commit brings things into sync with the advancements that
have been made in rack and adds a few fixes in BBR. This also
removes any possibility of incorrectly doing OOB data the stacks
do not support it. Should fix the skyzaller crashes seen in the
past. Still to fix is the BBR issue just reported this weekend
with the SYN and on sending a RST. Note that this version of
rack can now do pacing as well.

Sponsored by:Netflix Inc
Differential Revision:https://reviews.freebsd.org/D24576


# 9028b6e0 29-Apr-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Prevent premature shrinking of the scaled receive window
which can cause a TCP client to use invalid or stale TCP sequence numbers for ACK packets.

Packets with old sequence numbers are ignored and not used to update the send window size.
This might cause the TCP session to hang indefinitely under some circumstances.

Reported by: Cui Cheng
Reviewed by: tuexen (mentor), rgrimes (mentor)
Approved by: tuexen (mentor), rgrimes (mentor)
MFC after: 3 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D24515


# b2ade6b1 29-Apr-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Correctly set up the initial TCP congestion window in all cases,
by not including the SYN bit sequence space in cwnd related calculations.
Snd_und is adjusted explicitly in all cases, outside the cwnd update, instead.

This fixes an off-by-one conformance issue with regular TCP sessions not
using Appropriate Byte Counting (RFC3465), sending one more packet during
the initial window than expected.

PR: 235256
Reviewed by: tuexen (mentor), rgrimes (mentor)
Approved by: tuexen (mentor), rgrimes (mentor)
MFC after: 3 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D19000


# ac99fd86 25-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix LINT build broken by r360292.


# 983066f0 25-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Convert route caching to nexthop caching.

This change is build on top of nexthop objects introduced in r359823.

Nexthops are separate datastructures, containing all necessary information
to perform packet forwarding such as gateway interface and mtu. Nexthops
are shared among the routes, providing more pre-computed cache-efficient
data while requiring less memory. Splitting the LPM code and the attached
data solves multiple long-standing problems in the routing layer,
drastically reduces the coupling with outher parts of the stack and allows
to transparently introduce faster lookup algorithms.

Route caching was (re)introduced to minimise (slow) routing lookups, allowing
for notably better performance for large TCP senders. Caching works by
acquiring rtentry reference, which is protected by per-rtentry mutex.
If the routing table is changed (checked by comparing the rtable generation id)
or link goes down, cache record gets withdrawn.

Nexthops have the same reference counting interface, backed by refcount(9).
This change merely replaces rtentry with the actual forwarding nextop as a
cached object, which is mostly mechanical. Other moving parts like cache
cleanup on rtable change remains the same.

Differential Revision: https://reviews.freebsd.org/D24340


# bb410f9f 21-Apr-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

revert rS360143 - Correctly set up initial cwnd
due to syzkaller panics found

Reported by: tuexen
Approved by: tuexen (mentor)
Sponsored by: NetApp, Inc.


# 73b76966 21-Apr-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Correctly set up the initial TCP congestion window
in all cases, by adjust snd_una right after the
connection initialization, to include the one byte
in sequence space occupied by the SYN bit.

This does not change the regular ACK processing,
while making the BYTES_THIS_ACK macro to work properly.

PR: 235256
Reviewed by: tuexen (mentor), rgrimes (mentor)
Approved by: tuexen (mentor), rgrimes (mentor)
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D19000


# 413c3db1 31-Mar-2020 Michael Tuexen <tuexen@FreeBSD.org>

Allow the TCP backhole detection to be disabled at all, enabled only
for IPv4, enabled only for IPv6, and enabled for IPv4 and IPv6.
The current blackhole detection might classify a temporary outage as
an MTU issue and reduces permanently the MSS. Since the consequences of
such a reduction due to a misclassification are much more drastically
for IPv4 than for IPv6, allow the administrator to enable it for IPv6 only.

Reviewed by: bcr@ (man page), Richard Scheffenegger
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D24219


# 7ca6e296 12-Mar-2020 Michael Tuexen <tuexen@FreeBSD.org>

Use KMOD_TCPSTAT_INC instead of TCPSTAT_INC for RACK and BBR, since
these are kernel modules. Also add a KMOD_TCPSTAT_ADD and use that
instead of TCPSTAT_ADD.

Reviewed by: jtl@, rrs@
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D23904


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# 3fba40d9 11-Feb-2020 Randall Stewart <rrs@FreeBSD.org>

Remove all trailing white space from the BBR/Rack fold. Bits
left around by emacs (thanks emacs).


# 5aa0576b 07-Feb-2020 Ed Maste <emaste@FreeBSD.org>

Miscellaneous typo fixes

Submitted by: Gordon Bergling <gbergling_gmail.com>
Differential Revision: https://reviews.freebsd.org/D23453


# 109eb549 21-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Make tcp_output() require network epoch.

Enter the epoch before calling into tcp_output() from those
functions, that didn't do that before.

This eliminates a bunch of epoch recursions in TCP.


# b9555453 21-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Make ip6_output() and ip_output() require network epoch.

All callers that before may called into these functions
without network epoch now must enter it.


# 334fc582 08-Jan-2020 Bjoern A. Zeeb <bz@FreeBSD.org>

vnet: virtualise more network stack sysctls.

Virtualise tcp_always_keepalive, TCP and UDP log_in_vain. All three are
set in the netoptions startup script, which we would love to run for VNETs
as well [1].

While virtualising the log_in_vain sysctls seems pointles at first for as
long as the kernel message buffer is not virtualised, it at least allows
an administrator to debug the base system or an individual jail if needed
without turning the logging on for all jails running on a system.

PR: 243193 [1]
MFC after: 2 weeks


# 1cf55767 17-Dec-2019 Randall Stewart <rrs@FreeBSD.org>

This commit is a bit of a re-arrange of deck chairs. It
gets both rack and bbr ready for the completion of the STATs
framework in FreeBSD. For now if you don't have both NF_stats and
stats on it disables them. As soon as the rest of the stats framework
lands we can remove that restriction and then just uses stats when
defined.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D22479


# d40c0d47 07-Nov-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Now that all of the tcp_input() and all its branches are executed
in the network epoch, we can greatly simplify synchronization.
Remove all unneccesary epoch enters hidden under INP_INFO_RLOCK macro.
Remove some unneccesary assertions and convert necessary ones into the
NET_EPOCH_ASSERT macro.


# 9992c365 23-Oct-2019 Randall Stewart <rrs@FreeBSD.org>

Fix a small bug in bbr when running under a VM. Basically what
happens is we are more delayed in the pacer calling in so
we remove the stack from the pacer and recalculate how
much time is left after all data has been acknowledged. However
the comparision was backwards so we end up with a negative
value in the last_pacing_delay time which causes us to
add in a huge value to the next pacing time thus stalling
the connection.

Reported by: vm2.finance@gmail.com


# 12a43d0d 29-Sep-2019 Michael Tuexen <tuexen@FreeBSD.org>

RFC 7112 requires a host to put the complete IP header chain
including the TCP header in the first IP packet.
Enforce this in tcp_output(). In addition make sure that at least
one byte payload fits in the TCP segement to allow making progress.
Without this check, a kernel with INVARIANTS will panic.
This issue was found by running an instance of syzkaller.

Reviewed by: jtl@
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21665


# ac7bd23a 24-Sep-2019 Randall Stewart <rrs@FreeBSD.org>

lets put (void) in a couple of functions to keep older platforms that
are stuck with gcc happy (ppc). The changes are needed in both bbr and
rack.

Obtained from: Michael Tuexen (mtuexen@)


# da99b33b 24-Sep-2019 Randall Stewart <rrs@FreeBSD.org>

don't call in_ratelmit detach when RATELIMIT is not
compiled in the kernel.


# 35c7bb34 24-Sep-2019 Randall Stewart <rrs@FreeBSD.org>

This commit adds BBR (Bottleneck Bandwidth and RTT) congestion control. This
is a completely separate TCP stack (tcp_bbr.ko) that will be built only if
you add the make options WITH_EXTRA_TCP_STACKS=1 and also include the option
TCPHPTS. You can also include the RATELIMIT option if you have a NIC interface that
supports hardware pacing, BBR understands how to use such a feature.

Note that this commit also adds in a general purpose time-filter which
allows you to have a min-filter or max-filter. A filter allows you to
have a low (or high) value for some period of time and degrade slowly
to another value has time passes. You can find out the details of
BBR by looking at the original paper at:

https://queue.acm.org/detail.cfm?id=3022184

or consult many other web resources you can find on the web
referenced by "BBR congestion control". It should be noted that
BBRv1 (which this is) does tend to unfairness in cases of small
buffered paths, and it will usually get less bandwidth in the case
of large BDP paths(when competing with new-reno or cubic flows). BBR
is still an active research area and we do plan on implementing V2
of BBR to see if it is an improvement over V1.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D21582