History log of /freebsd-current/sys/netinet/tcp_var.h
Revision Date Author Comments
# ea916b64 18-May-2024 Randall Stewart <rrs@FreeBSD.org>

Remove TCP_SAD optional code now that the sack filter performs this function.

With the commit of D44903 we no longer need the SAD option. Instead all stacks that
use the sack filter inherit its protection against sack-attack.

Reviewed by: tuexen@
Differential Revision:https://reviews.freebsd.org/D45216


# 2a9aae9e 08-May-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: add counter to track when SACK loss recovery uses TSO

Add a counter to track how frequently SACK has transmitted
more than one MSS using TSO. Instances when this will be
beneficial is the use of PRR, or when ACK thinning due to
GRO/LRO or ACK discards by the network are present.

Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D45070


# dcdfe449 08-May-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: add sysctl to allow/disallow TSO during SACK loss recovery

Introduce net.inet.tcp.sack.tso for future use when TSO is ready
to be used during loss recovery.

Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D45068


# fce03f85 05-May-2024 Randall Stewart <rrs@FreeBSD.org>

TCP can be subject to Sack Attacks lets fix this issue.

There is a type of attack that a TCP peer can launch on a connection. This is for sure in Rack or BBR and probably even the default stack if it uses lists in sack processing. The idea of the attack is that the attacker is driving you to look at 100's of sack blocks that only update 1 byte. So for example if you have 1 - 10,000 bytes outstanding the attacker sends in something like:

ACK 0 SACK(1-512) SACK(1024 - 1536), SACK(2048-2536), SACK(4096 - 4608), SACK(8192-8704)
This first sack looks fine but then the attacker sends

ACK 0 SACK(1-512) SACK(1025 - 1537), SACK(2049-2537), SACK(4097 - 4609), SACK(8193-8705)
ACK 0 SACK(1-512) SACK(1027 - 1539), SACK(2051-2539), SACK(4099 - 4611), SACK(8195-8707)
...
These blocks are making you hunt across your linked list and split things up so that you have an entry for every other byte. Has your list grows you spend more and more CPU running through the lists. The idea here is the attacker chooses entries as far apart as possible that make you run through the list. This example is small but in theory if the window is open to say 1Meg you could end up with 100's of thousands link list entries.

To combat this we introduce three things.

when the peer requests a very small MSS we stop processing SACK's from them. This prevents a malicious peer from just using a small MSS to do the same thing.
Any time we get a sack block, we use the sack-filter to remove sacks that are smaller than the smallest v4 mss (minus 40 for max TCP options) unless it ties up to snd_max (since that is legal). All other sacks in theory should be at least an MSS. If we get such an attacker that means we basically start skipping all but MSS sized Sacked blocks.
The sack filter used to throw away data when its bounds were exceeded, instead now we increase its size to 15 and then throw away sack's if the filter gets over-run to prevent the malicious attacker from over-running the sack filter and thus we start to process things anyway.
The default stack will need to start using the sack-filter which we have talked about in past conference calls to take full advantage of the protections offered by it (and reduce cpu consumption when processing sacks).

After this set of changes is in rack can drop its SAD detection completely

Reviewed by:tuexen@, rscheff@
Differential Revision: <https://reviews.freebsd.org/D44903>


# 1d14e88e 08-Apr-2024 Mark Johnston <markj@FreeBSD.org>

tcp: Make tcp_var.h more self-contained

struct tcpcb embeds a struct osd and a struct callout. Rather than
forcing all consumers to pull in the same headers, include the headers
directly.

No functional change intended.

Reviewed by: glebius
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D44685


# 60d8dbbe 18-Jan-2024 Kristof Provost <kp@FreeBSD.org>

netinet: add a probe point for IP, IP6, ICMP, ICMP6, UDP and TCP stats counters

When debugging network issues one common clue is an unexpectedly
incrementing error counter. This is helpful, in that it gives us an
idea of what might be going wrong, but often these counters may be
incremented in different functions.

Add a static probe point for them so that we can use dtrace to get
futher information (e.g. a stack trace).

For example:
dtrace -n 'mib:ip:count: { printf("%d", arg0); stack(); }'

This can be disabled by setting the following kernel option:
options KDTRACE_NO_MIB_SDT

Reviewed by: gallatin, tuexen (previous version), gnn (previous version)
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D43504


# 5a268d86 03-Apr-2024 Michael Tuexen <tuexen@FreeBSD.org>

tcp: fix comment

Make the comment consistent with the code.

Reviewed by: rscheff
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44611


# dd7b86e2 18-Mar-2024 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove IS_FASTOPEN() macro

The macro is more obfuscating than helping as it just checks a single flag
of t_flags. All other t_flags bits are checked without a macro.

A bigger problem was that declaration of the macro in tcp_var.h depended
on a kernel option. It is a bad practice to create such definitions in
installable headers.

Reviewed by: rscheff, tuexen, kib
Differential Revision: https://reviews.freebsd.org/D44362


# 220ee18f 13-Mar-2024 Konstantin Belousov <kib@FreeBSD.org>

netinet/tcp_var.h: always define IS_FASTOPEN() for kernel compilation env

and drop the definition for userspace (which matched TCP_RFC7413) since
it depends on presence of the kernel option.

Reviewed by: glebius, rscheff
Sponsored by: NVIDIA networking
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D44349


# e4315bbc 13-Mar-2024 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: move struct tcp_ifcap declaration under _KERNEL

Reviewed by: rscheff, tuexen, kib
Differential Revision: https://reviews.freebsd.org/D44340


# e18b97bd 12-Mar-2024 Randall Stewart <rrs@FreeBSD.org>

Update to bring the rack stack with all its fixes in.

This brings the rack stack up to the current level used at NF. Many fixes
and improvements have been added. I also add in a fix to BBR to deal with
the changes that have been in hpts for a while i.e. only one call no matter
if mbuf queue or tcp_output.

It basically does little except BBlogs and is a placemark for future work on
doing path capacity measurements.

With a bit of a struggle with git I finally got rack_pcm.c into place (apologies
for not noticing this error). The LINT kernel is running on my box now .. sigh.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision:https://reviews.freebsd.org/D43986


# c112243f 11-Mar-2024 Brooks Davis <brooks@FreeBSD.org>

Revert "Update to bring the rack stack with all its fixes in."

This commit was incomplete and breaks LINT kernels. The tree has been
broken for 8+ hours.

This reverts commit f6d489f402c320f1a6eaa473491a0b8c3878113e.


# f6d489f4 11-Mar-2024 Randall Stewart <rrs@FreeBSD.org>

Update to bring the rack stack with all its fixes in.

This brings the rack stack up to the current level used at NF. Many fixes
and improvements have been added. I also add in a fix to BBR to deal with
the changes that have been in hpts for a while i.e. only one call no matter
if mbuf queue or tcp_output.

Note there is a new file that I can't figure out how to get in rack_pcm.c

It basically does little except BBlogs and is a placemark for future work on
doing path capacity measurements.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision:https://reviews.freebsd.org/D43986


# c7c325d0 24-Jan-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: pass maxseg around instead of calculating locally

Improve slowpath processing (reordering, retransmissions)
slightly by calculating maxseg only once. This typically
saves one of two calls to tcp_maxseg().

Reviewed By: glebius, tuexen, cc, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43536


# 7f3184ba 22-Jan-2024 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove outdated comment

This paragraph should have been removed in 446ccdd08e2a.


# 30409ecd 06-Jan-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: do not purge SACK scoreboard on first RTO

Keeping the SACK scoreboard intact after the first RTO
and retransmitting all data anew only on subsequent RTOs
allows a more timely and efficient loss recovery under
many adverse cirumstances.

Reviewed By: tuexen, #transport
MFC after: 10 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D42906


# a8b70cf2 24-Dec-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

netpfil: Use accessor functions and named constants for all tcphdr flags

Update all remaining references to the struct tcphdr th_x2 field.
This completes the compatibilty of various aspects with AccECN
(TH_AE), after the internal ipfw "re-checksum required" was moved
to use the TH_RES1 flag.

No functional change.

Reviewed By: tuexen, #transport, glebius
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43172


# 8717c306 21-Dec-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: allow userspace use of tcp header flags accessor functions

Provide accessor functions to all 12 possible TCP header
flags for userspace too.

Reviewed By: zlei
MFC after: 2 weeks
Sponsored by: Netapp, Inc.
Differential Revision: https://reviews.freebsd.org/D43152


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 219a6ca9 21-Nov-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: uninline tcp_account_for_send()

This allows to clear inclusion of "opt_kern_tls.h" from a system header.

Reviewed by: rscheff, tuexen
Differential Revision: https://reviews.freebsd.org/D42696


# 49a6fbe3 15-Nov-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

[tcp] add PRR 6937bis heuristic and retire prr_conservative sysctl

Improve Proportional Rate Reduction (RFC6937) by using a
heuristic, which automatically chooses between
conservative CRB and more aggressive SSRB modes.
Only when snd_una advances (a partial ACK), SSRB may be
used. Also, that ACK must not have any indication of
ongoing loss - using the addition of new holes into the
scoreboard as proxy for such an event.

MFC after: 4 weeks
Reviewed By: #transport, kbowling, rrs
Sponsored By: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28822


# 22dc8609 17-Oct-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: use signed IsLost() related accounting variables

Coverity found that one safety check (kassert) was not
functional, as possible incorrect subtractions during
the accounting wouldn't show up as (invalid) negative
values.

Reported by: gallatin
Reviewed By: cc, #transport
Sponsored By: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D42180


# e2c6a6d2 09-Oct-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: include RFC6675 IsLost() in pipe calculation

Add more accounting while processing SACK data, to
keep track of when a packet is deemed lost using
the RFC6675 guidance.

Together with PRR (RFC6972) this allows a sender to
retransmit presumed lost packets faster, and loss
recovery to complete earlier.

Reviewed By: cc, rrs, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D39299


# 2ff63af9 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .h pattern

Remove /^\s*\*+\s*\$FreeBSD\$.*$\n/


# cf32543f 27-Jul-2023 Michael Tuexen <tuexen@FreeBSD.org>

tcp: document that conditional fields in tcpcb should be at the end

Reviewed by: rscheff, Peter Lei
Sponsored by: Netflix, Inc.


# e4a873bf 19-Jul-2023 Michael Tuexen <tuexen@FreeBSD.org>

tcp: improve layout of struct tcpcb

Put optional fields at the end to minimize run time problems in
case CC modules are build from within its directory.

Reviewed by: cc, gallatin, glebius, imp
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D41059


# d3152ab2 17-Jul-2023 Warner Losh <imp@FreeBSD.org>

tcbpcb: Always define t_osd

Always define t_osd. congestion control modules access it
unconditionally. This fixes the build.

However, this is, at best, a temporary band-aide until the
larger issues are sorted.

Sponsored by: Netflix


# 43b117f8 06-Jun-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: make the maximum number of retransmissions tunable per VNET

Both Windows (TcpMaxDataRetransmissions) and Linux (tcp_retries2)
allow to restrict the maximum number of consecutive timer based
retransmissions. Add that same capability on a per-VNet basis to
FreeBSD.

Reviewed By: cc, tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D40424


# 57a3a161 24-May-2023 Randall Stewart <rrs@FreeBSD.org>

tcp: request tracking is not http specific.

This change is a name change only. TCP Request tracking can track sendfile and even non-sendfile requests. The
names however in the current code use http, and they should not. The feature is not http specific. Lets change the
name so they more properly reflect whats going on. This also fixes conflicts with http_req which caused application pain.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D40229


# ec6d620b 19-May-2023 Randall Stewart <rrs@FreeBSD.org>

There are congestion control algorithms will that pull in srtt, and this can cause issues with rack.

When using rack, cubic and htcp will grab the srtt, but they think it is in ticks. For rack
it is in micro-seconds (which we should probably move all stacks to actually). This causes
issues so instead lets make a new interface so that any CC module can pull the srtt in
whatever granularity they want.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D40146


# c3c20de3 25-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: move HPTS/LRO flags out of inpcb to tcpcb

These flags are TCP specific. While here, make also several LRO
internal functions to pass tcpcb pointer instead of inpcb one.

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39698


# c2a69e84 25-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_hpts: move HPTS related fields from inpcb to tcpcb

This makes inpcb lighter and allows future cache line optimizations
of tcpcb. The reason why HPTS originally used inpcb is the compressed
TIME-WAIT state (see 0d7445193ab), that used to free a tcpcb, while the
associated connection is still on the HPTS ring.

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39697


# 8aa2be69 20-Apr-2023 Cheng Cui <cc@FreeBSD.org>

Correct the value of macro TF2_TCP_ACCOUNTING.

Summary: Make sure the values are in order.

Reviewers: rscheff, tuexen, #transport!
Approved by: rscheff, tuexen, glebius
Subscribers: imp, melifaro, glebius
Differential Revision: https://reviews.freebsd.org/D39716


# a540cdca 17-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_hpts: use queue(9) STAILQ for the input queue

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39574


# 66fbc19f 07-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: pass tcpcb in the tfb_tcp_ctloutput() method instead of inpcb

Just matches rest of the KPI.

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39435


# 35bc0bcc 07-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: reduce argument list to functions that pass a segment

The socket argument is superfluous, as a tcpcb always has one and
only one socket.

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39434


# de4368dd 07-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: retire tfb_tcp_hpts_do_segment()

Isn't in use anymore. Correct comments that mention it.

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39433


# 945f9a7c 07-Apr-2023 Randall Stewart <rrs@FreeBSD.org>

tcp: misc cleanup of options for rack as well as socket option logging.

Both BBR and Rack have the ability to log socket options, which is currently disabled. Rack
has an experimental SaD (Sack Attack Detection) algorithm that should be made available. Also
there is a t_maxpeak_rate that needs to be removed (its un-used).

Reviewed by: tuexen, cc
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D39427


# 73ee5756 31-Mar-2023 Randall Stewart <rrs@FreeBSD.org>

Fixes in the tcp infrastructure with respect to stack changes as well as other infrastructure updates for incoming rack features.

So stack switching as always been a bit of a issue. We currently use a break before make setup which means that
if something goes wrong you have to try to get back to a stack. This patch among a lot of other things changes that so
that it is a make before break. We also expand some of the function blocks in prep for new features in rack that will allow
more controlled pacing. We also add other abilities such as the pathway for a stack to query a previous stack to acquire from
it critical state information so things in flight don't get dropped or mis-handled when switching stacks. We also add the
concept of a timer granularity. This allows an alternate stack to change from the old ticks granularity to microseconds and
of course this even gives us a pathway to go to nanosecond timekeeping if we need to (something for the data center to consider
for sure).

Once all this lands I will then update rack to begin using all these new features.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D39210


# 69c7c811 16-Mar-2023 Randall Stewart <rrs@FreeBSD.org>

Move access to tcp's t_logstate into inline functions and provide new tracepoint and bbpoint capabilities.

The TCP stacks have long accessed t_logstate directly, but in order to do tracepoints and the new bbpoints
we need to move to using the new inline functions. This adds them and moves rack to now use
the tcp_tracepoints.

Reviewed by: tuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D38831


# 2f201df1 20-Jul-2021 Alfonso <gfunni234@gmail.com>

Change hw_tls to a bool

Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/512


# 624de4ec 22-Feb-2023 Michael Tuexen <tuexen@FreeBSD.org>

tcp: remove unused function prototype

tcp_trace was implemented in tcp_debug.c, which was removed recently.

Reviewed by: rscheff@, zlei@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D38712


# 76578d60 21-Feb-2023 Michael Tuexen <tuexen@FreeBSD.org>

bblog: improve timeout event handling

Extend the BBLog RTO event to deal with all timers of the base
stack. Also provide information about starting, stopping, and
running off. The expiration of the retransmission timer is
reported as it was done before.

Reviewed by: rscheff@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D38710


# 6b802933 21-Feb-2023 Michael Tuexen <tuexen@FreeBSD.org>

tcp: rearrange enum and remove unused variable

Rearrange the enum tt_which such that TT_REXMIT is 0. This allows
an extension of the BBLog event RTO in a backwards compatible way.
Remove tcptimers, which was only used in trpt, a utility removed
from the source tree recently.

Reviewed by: glebius@, guest-ccui@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D38547


# c0e4090e 08-Feb-2023 Andrew Gallatin <gallatin@FreeBSD.org>

ktls: Accurately track if ifnet ktls is enabled

This allows us to avoid spurious calls to ktls_disable_ifnet()

When we implemented ifnet kTLSe, we set a flag in the tx socket
buffer (SB_TLS_IFNET) to indicate ifnet kTLS. This flag meant that
now, or in the past, ifnet ktls was active on a socket. Later,
I added code to switch ifnet ktls sessions to software in the case
of lossy TCP connections that have a high retransmit rate.
Because TCP was using SB_TLS_IFNET to know if it needed to do math
to calculate the retransmit ratio and potentially call into
ktls_disable_ifnet(), it was doing unneeded work long after
a session was moved to software.

This patch carefully tracks whether or not ifnet ktls is still enabled
on a TCP connection. Because the inp is now embedded in the tcpcb, and
because TCP is the most frequent accessor of this state, it made sense to
move this from the socket buffer flags to the tcpcb. Because we now need
reliable access to the tcbcb, we take a ref on the inp when creating a tx
ktls session.

While here, I noticed that rack/bbr were incorrectly implementing
tfb_hwtls_change(), and applying the change to all pending sends,
when it should apply only to future sends.

This change reduces spurious calls to ktls_disable_ifnet() by 95% or so
in a Netflix CDN environment.

Reviewed by: markj, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D38380


# 9aff05bb 07-Feb-2023 John Baldwin <jhb@FreeBSD.org>

tcp_var.h: Fix spelling of independent in comment


# 18b83b62 26-Jan-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: reduce the size of t_rttupdated in tcpcb

During tcp session start, various mechanisms need to
track a few initial RTTs before becoming active.
Prevent overflows of the corresponding tracking counter
and reduce the size of tcpcb simultaneously.

Reviewed By: #transport, tuexen, guest-ccui
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D21117


# 446ccdd0 07-Dec-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: use single locked callout per tcpcb for the TCP timers

Use only one callout structure per tcpcb that is responsible for handling
all five TCP timeouts. Use locked version of callout, of course. The
callout function tcp_timer_enter() chooses soonest timer and executes it
with lock held. Unless the timer reports that the tcpcb has been freed,
the callout is rescheduled for next soonest timer, if there is any.

With single callout per tcpcb on connection teardown we should be able
to fully stop the callout and immediately free it, avoiding use of
callout_async_drain(). There is one gotcha here: callout_stop() can
actually touch our memory when a rare race condition happens. See
comment above tcp_timer_stop(). Synchronous stop of the callout makes
tcp_discardcb() the single entry point for tcpcb destructor, merging the
tcp_freecb() to the end of the function.

While here, also remove lots of lingering checks in the beginning of
TCP timer functions. With a locked callout they are unnecessary.

While here, clean unused parts of timer KPI for the pluggable TCP stacks.

While here, remove TCPDEBUG from tcp_timer.c, as this allows for more
simplification of TCP timers. The TCPDEBUG is scheduled for removal.

Move the DTrace probes in timers to the beginning of a function, where
a tcpcb is always existing.

Discussed with: rrs, tuexen, rscheff (the TCP part of the diff)
Reviewed by: hselasky, kib, mav (the callout part)
Differential revision: https://reviews.freebsd.org/D37321


# 918fa422 07-Dec-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove tcp_timer_suspend()

It was a temporary code added together with RACK to fight against
TCP timer races.


# e68b3792 07-Dec-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: embed inpcb into tcpcb

For the TCP protocol inpcb storage specify allocation size that would
provide space to most of the data a TCP connection needs, embedding
into struct tcpcb several structures, that previously were allocated
separately.

The most import one is the inpcb itself. With embedding we can provide
strong guarantee that with a valid TCP inpcb the tcpcb is always valid
and vice versa. Also we reduce number of allocs/frees per connection.
The embedded inpcb is placed in the beginning of the struct tcpcb,
since in_pcballoc() requires that. However, later we may want to move
it around for cache line efficiency, and this can be done with a little
effort. The new intotcpcb() macro is ready for such move.

The congestion algorithm data, the TCP timers and osd(9) data are
also embedded into tcpcb, and temprorary struct tcpcb_mem goes away.
There was no extra allocation here, but we went through extra pointer
every time we accessed this data.

One interesting side effect is that now TCP data is allocated from
SMR-protected zone. Potentially this allows the TCP stacks or other
TCP related modules to utilize that for their own synchronization.

Large part of the change was done with sed script:

s/tp->ccv->/tp->t_ccv./g
s/tp->ccv/\&tp->t_ccv/g
s/tp->cc_algo/tp->t_cc/g
s/tp->t_timers->tt_/tp->tt_/g
s/CCV\(ccv, osd\)/\&CCV(ccv, t_osd)/g

Dependency side effect is that code that needs to know struct tcpcb
should also know struct inpcb, that added several <netinet/in_pcb.h>.

Differential revision: https://reviews.freebsd.org/D37127


# bd4f9866 16-Nov-2022 Michael Tuexen <tuexen@FreeBSD.org>

tcp: remove unused t_rttbest

No functional change intended.

Reviewed by: rscheff@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D37401


# 1a70101a 10-Nov-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: account sent/received IP ECN markings independently

Have tcpstats (netstat -s) differentiate between received and sent
ECN-marked packets. Also account for IP ECN bits (on TCP packets)
even when the tcp session has not negotiated ECN support.

Event: IETF 115 Hackathon
Reviewed By: glebius, tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D37314


# 8840ae22 08-Nov-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: don't store VNET in every tcpcb, take it from the inpcbinfo

Reviewed by: rscheff
Differential revision: https://reviews.freebsd.org/D37125


# 9eb0e832 08-Nov-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: provide macros to access inpcb and socket from a tcpcb

There should be no functional changes with this commit.

Reviewed by: rscheff
Differential revision: https://reviews.freebsd.org/D37123


# dc9daa04 08-Nov-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: allow packets to be marked as ECT1 instead of ECT0

This adds the capability for a modular congestion control
to select which variant of ECN-capable-transport it wants to use
when sending out elegible segments. As an initial CC to utilize
this, DCTCP was selected.

Event: IETF 115 Hackathon
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D24869


# 37bf391d 06-Nov-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: make tcp_packets_this_ack() only visible in kernel scope


# b1258b76 06-Nov-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: add conservative d.cep accounting algorithm

Accurate ECN asks to conservatively estimate, when the
ACE counter may have wrapped due to a single ACK covering a larger
number of segments. This is described in Annex A.2 of the
accurate-ecn draft.

Event: IETF 115 Hackathon
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D37281


# c348e880 31-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: make tcp_handle_wakeup() static and robust

It is called only from tcp_input() and always has valid parameter.

Reviewed by: rscheff, tuexen
Differential revision: https://reviews.freebsd.org/D37115


# 83c1ec92 20-Oct-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: ECN preparations for ECN++, AccECN (tcp_respond)

tcp_respond is another function to build a tcp control packet
quickly. With ECN++ and AccECN, both the IP ECN header, and
the TCP ECN flags are supposed to reflect the correct state.

Also ensure that on receiving multiple ECN SYN-ACKs, the
responses triggered will reflect the latest state.

Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36973


# c3738466 18-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: style the struct tcpcb definition

- Use C99 types uintXX_t instead of u_intXX_t.
- Try to make space/tab usage a little bit more consistent.
- Shorten comments to fit into 80 chars.

Not a functional change, just making future changes easier to read.


# 0d744519 06-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove tcptw, the compressed timewait state structure

The memory savings the tcptw brought back in 2003 (see 340c35de6a2) no
longer justify the complexity required to maintain it. For longer
explanation please check out the email [1].

Surpisingly through almost 20 years the TCP stack functionality of
handling the TIME_WAIT state with a normal tcpcb did not bitrot. The
existing tcp_input() properly handles a tcpcb in TCPS_TIME_WAIT state,
which is confirmed by the packetdrill tcp-testsuite [2].

This change just removes tcptw and leaves INP_TIMEWAIT. The flag will
be removed in a separate commit. This makes it easier to review and
possibly debug the changes.

[1] https://lists.freebsd.org/archives/freebsd-net/2022-January/001206.html
[2] https://github.com/freebsd-net/tcp-testsuite

Differential revision: https://reviews.freebsd.org/D36398


# cd84e78f 04-Oct-2022 Randall Stewart <rrs@FreeBSD.org>

tcp idle reduce does not work for a server.

TCP has an idle-reduce feature that allows a connection to reduce its
cwnd after it has been idle more than an RTT. This feature only works
for a sending side connection. It does this by at output checking the
idle time (t_rcvtime vs ticks) to see if its more than the RTO timeout.

The problem comes if you are a web server. You get a request and
then send out all the data.. then go idle. The next time you would
send is in response to a request from the peer asking for more data.
But the thing is you updated t_rcvtime when the request came in so
you never reduce.

The fix is to do the idle reduce check also on inbound.

Reviewed by: tuexen, rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36721


# 775e20c1 03-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: make tcp_drop_syn_sent() static


# 43d39ca7 03-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

netinet*: de-void control input IP protocol methods

After decoupling of protosw(9) and IP wire protocols in 78b1fc05b205 for
IPv4 we got vector ip_ctlprotox[] that is executed only and only from
icmp_input() and respectively for IPv6 we got ip6_ctlprotox[] executed
only and only from icmp6_input(). This allows to use protocol specific
argument types in these methods instead of struct sockaddr and void.

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36727


# 08af8aac 27-Sep-2022 Randall Stewart <rrs@FreeBSD.org>

Tcp progress timeout

Rack has had the ability to timeout connections that just sit idle automatically. This
feature of course is off by default and requires the user set it on (though the socket option
has been missing in tcp_usrreq.c). Lets get the progress timeout fully supported in
the base stack as well as rack.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36716


# e5049a17 26-Sep-2022 Randall Stewart <rrs@FreeBSD.org>

TCP rack does not work properly with cubic.

Right now if you use rack with cubic (the new default cc) you will have
improper results. This is because rack uses different variables than
the base stack (or bbr) and thus tcp_compute_pipe() always returns
so that cubic will choose a 30% backoff not the 50% backoff it should
when it is newreno compatibility mode. The fix is to allow a stack (rack)
to override its own compute_pipe.

Reviewed by: tuexen, rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36711


# 493105c2 21-Sep-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: fix simultaneous open and refine e80062a2d43

- The soisconnected() call on transition from SYN_RCVD to ESTABLISHED
is also necessary for a half-synchronized connection. Fix that
just setting the flag, when we transfer SYN-SENT -> SYN-RECEIVED.
- Provide a comment that explains at what conditions the call to
soisconnected() is necessary.
- Hence mechanically rename the TF_INCQUEUE flag to TF_SONOTCONN.
- Extend the change to the BBR and RACK stacks.

Note: the interaction between the accept_filter(9) and the socket layer
is not fully consistent, yet. For most accept filters this call to
soisconnected() will not move the connection from the incomplete queue
to the complete. The move would happen only when the filter has received
the desired data, and soisconnected() would be called once again from
sorwakeup(). Ideally, we should mark socket as connected only there,
and leave the soisconnected() from SYN_RCVD->ESTABLISHED only for the
simultaneous open case. However, this doesn't yet work.

Reviewed by: rscheff, tuexen, rrs
Differential revision: https://reviews.freebsd.org/D36641


# 0c7f3ae8 19-Sep-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcpcb: fix tabulation count in i4012ef7754c and abbreviate "packets"

This lines up comments to the rest of the file. Abbreviation
helps to fit in to 80 char terminal. Not a functional change.


# e80062a2 08-Sep-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: avoid call to soisconnected() on transition to ESTABLISHED

This call existed since pre-FreeBSD times, and it is hard to understand
why it was there in the first place. After 6f3caa6d815 it definitely
became necessary always and commit message from f1ee30ccd60 confirms that.
Now that 6f3caa6d815 is effectively backed out by 07285bb4c22, the call
appears to be useful only for sockets that landed on the incomplete queue,
e.g. sockets that have accept_filter(9) enabled on them.

Provide a new TCP flag to mark connections that are known to be on the
incomplete queue, and call soisconnected() only for those connections.

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D36488


# 4012ef77 31-Aug-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Functional implementation of Accurate ECN

The AccECN handshake and TCP header flags are supported,
no support yet for the AccECN option. This minimalistic
implementation is sufficient to support DCTCP while
dramatically cutting the number of ACKs, and provide ECN
response from the receiver to the CC modules.

Reviewed By: #transport, #manpages, rrs, pauamma
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D21011


# e7d02be1 17-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protosw: refactor protosw and domain static declaration and load

o Assert that every protosw has pr_attach. Now this structure is
only for socket protocols declarations and nothing else.
o Merge struct pr_usrreqs into struct protosw. This was suggested
in 1996 by wollman@ (see 7b187005d18ef), and later reiterated
in 2006 by rwatson@ (see 6fbb9cf860dcd).
o Make struct domain hold a variable sized array of protosw pointers.
For most protocols these pointers are initialized statically.
Those domains that may have loadable protocols have spacers. IPv4
and IPv6 have 8 spacers each (andre@ dff3237ee54ea).
o For inetsw and inet6sw leave a comment noting that many protosw
entries very likely are dead code.
o Refactor pf_proto_[un]register() into protosw_[un]register().
o Isolate pr_*_notsupp() methods into uipc_domain.c

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36232


# 81a34d37 17-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protosw: retire pr_drain and use EVENTHANDLER(9) directly

The method was called for two different conditions: 1) the VM layer is
low on pages or 2) one of UMA zones of mbuf allocator exhausted.
This change 2) into a new event handler, but all affected network
subsystems modified to subscribe to both, so this change shall not
bring functional changes under different low memory situations.

There were three subsystems still using pr_drain: TCP, SCTP and frag6.
The latter had its protosw entry for the only reason to register its
pr_drain method.

Reviewed by: tuexen, melifaro
Differential revision: https://reviews.freebsd.org/D36164


# 6c452841 17-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: use callout(9) directly instead of pr_slowtimo

Modern TCP stacks uses multiple callouts per tcpcb, and a global
callout is ancient artifact. However it is still used to garbage
collect compressed timewait entries.

Reviewed by: melifaro, tuexen
Differential revision: https://reviews.freebsd.org/D36159


# 74703901 04-Jul-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: use a TCP flag to check if connection has been close(2)d

The flag SS_NOFDREF is a private flag of the socket layer. It also
is supposed to be read with SOCK_LOCK(), which we don't own here.

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D35663


# 700a395c 13-Apr-2022 John Baldwin <jhb@FreeBSD.org>

tcp_log_vain/addrs: Use a const pointer for the IPv4 header.

The pointer to the IPv6 header was already const.


# ea9017fb 21-Feb-2022 Randall Stewart <rrs@FreeBSD.org>

tcp: Congestion control move to using reference counting.

In the transport call on 12/3 Gleb asked to move the CC modules towards
using reference counting to prevent folks from unloading a module in use.
It was also agreed that Michael would do a user space utility like tcp_drop
that could be used to move all connections that are using a specific CC
to some other CC.

This is the half I committed to doing, making it so that we maintain a refcount
on a cc module every time a pcb refers to it and decrementing that every
time a pcb no longer uses a cc module. This also helps us simplify the
whole unloading process by getting rid of tcp_ccunload() which munged
through all the tcb's. Instead we mark a module as being removed and
prevent further references to it. We also make sure that if a module is
marked as being removed it cannot be made as the default and also
the opposite of that, if its a default it fails and does not mark it as being
removed.

Reviewed by: Michael Tuexen, Gleb Smirnoff
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D33249


# 0c2832ee 15-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Restore 6 tcps padding entries in HEAD

The padding in CURRENT shall not shrink. It is
designed that in CURRENT at always stays
the same, and then when a new stable is branched, it
inherits 6 pointer placeholders that can be used
withing this stable/X lifetime to extend the structure.

Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34269


# 3f169c54 09-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Add/update AccECN related statistics and numbers

Reserve couters in the tcps struct in preparation
for AccECN, extend the debugging output for TF2
flags, optimize the syncache flags from individual
bits to a codepoint for the specifc ECN handshake.

This is in preparation of AccECN.

No functional chance except for extended debug
output capabilities.

Reviewed By: #transport, rrs
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34161


# fd7daa72 08-Feb-2022 Michael Tuexen <tuexen@FreeBSD.org>

tcp: make tcp_ctloutput_set() non-static

tcp_ctloutput_set() will be used via the sysctl interface in a
upcoming command line tool tcpsso.

Reviewed by: glebius, rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D34164


# 1ebf4607 03-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Access all 12 TCP header flags via inline function

In order to consistently provide access to all
(including reserved) TCP header flag bits,
use an accessor function tcp_get_flags and
tcp_set_flags. Also expand any flag variable from
uint8_t / char to uint16_t.

Reviewed By: hselasky, tuexen, glebius, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34130


# 3b3c08c1 02-Feb-2022 Michael Tuexen <tuexen@FreeBSD.org>

tcp: cleanup functions related to socket option handling

Consistently only pass the inp and the sopt around. Don't pass the
so around, since in a upcoming commit tcp_ctloutput_set() will be
called from a context different from setsockopt(). Also expect
the inp to be locked when calling tcp_ctloutput_[gs]et(), this is
also required for the upcoming use by tcpsso, a command line tool
to set socket options.
Reviewed by: glebius, rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D34151


# 68e623c3 27-Jan-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Rewind erraneous RTO only while performing RTO retransmissions

Under rare circumstances, a spurious retranmission is
incorrectly detected and rewound, messing up various tcpcb values,
which can lead to a panic when SACK is in use.

Reviewed By: tuexen, chengc_netapp.com, #transport
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D33979


# 89128ff3 03-Jan-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protocols: init with standard SYSINIT(9) or VNET_SYSINIT

The historical BSD network stack loop that rolls over domains and
over protocols has no advantages over more modern SYSINIT(9).
While doing the sweep, split global and per-VNET initializers.

Getting rid of pr_init allows to achieve several things:
o Get rid of ifdef's that protect against double foo_init() when
both INET and INET6 are compiled in.
o Isolate initializers statically to the module they init.
o Makes code easier to understand and maintain.

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D33537


# f64dc2ab 26-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: TCP output method can request tcp_drop

The advanced TCP stacks (bbr, rack) may decide to drop a TCP connection
when they do output on it. The default stack never does this, thus
existing framework expects tcp_output() always to return locked and
valid tcpcb.

Provide KPI extension to satisfy demands of advanced stacks. If the
output method returns negative error code, it means that caller must
call tcp_drop().

In tcp_var() provide three inline methods to call tcp_output():
- tcp_output() is a drop-in replacement for the default stack, so that
default stack can continue using it internally without modifications.
For advanced stacks it would perform tcp_drop() and unlock and report
that with negative error code.
- tcp_output_unlock() handles the negative code and always converts
it to positive and always unlocks.
- tcp_output_nodrop() just calls the method and leaves the responsibility
to drop on the caller.

Sweep over the advanced stacks and use new KPI instead of using HPTS
delayed drop queue for that.

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D33370


# 5b08b46a 26-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: welcome back tcp_output() as the right way to run output on tcpcb.

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D33365


# 2a28b045 20-Dec-2021 Robert Wing <rew@FreeBSD.org>

tcp_twrespond: send signed segment when connection is TCP-MD5

When a connection is established to use TCP-MD5, tcp_twrespond() doesn't
respond with a signed segment. This results in the host performing the
active close to remain in a TIME_WAIT state and the other host in the
LAST_ACK state. Fix this by sending a signed segment when the connection
is established to use TCP-MD5.

Reviewed by: glebius
Differential Revision: https://reviews.freebsd.org/D33490


# 71d2d5ad 19-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcptw: count how many times a tcptw was actually useful

This will allow a sysadmin to lower net.inet.tcp.msl and
see how long tcptw are actually useful.


# cb377263 19-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcptw: remove unused fields

The structure goes away anyway, but it would be interesting to know
how much memory we used to save with it. So for the record, structure
size with this revision is 64 bytes.


# db0ac6de 02-Dec-2021 Cy Schubert <cy@FreeBSD.org>

Revert "wpa: Import wpa_supplicant/hostapd commit 14ab4a816"

This reverts commit 266f97b5e9a7958e365e78288616a459b40d924a, reversing
changes made to a10253cffea84c0c980a36ba6776b00ed96c3e3b.

A mismerge of a merge to catch up to main resulted in files being
committed which should not have been.


# de2d4784 02-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

SMR protection for inpcbs

With introduction of epoch(9) synchronization to network stack the
inpcb database became protected by the network epoch together with
static network data (interfaces, addresses, etc). However, inpcb
aren't static in nature, they are created and destroyed all the
time, which creates some traffic on the epoch(9) garbage collector.

Fairly new feature of uma(9) - Safe Memory Reclamation allows to
safely free memory in page-sized batches, with virtually zero
overhead compared to uma_zfree(). However, unlike epoch(9), it
puts stricter requirement on the access to the protected memory,
needing the critical(9) section to access it. Details:

- The database is already build on CK lists, thanks to epoch(9).
- For write access nothing is changed.
- For a lookup in the database SMR section is now required.
Once the desired inpcb is found we need to transition from SMR
section to r/w lock on the inpcb itself, with a check that inpcb
isn't yet freed. This requires some compexity, since SMR section
itself is a critical(9) section. The complexity is hidden from
KPI users in inp_smr_lock().
- For a inpcb list traversal (a pcblist sysctl, or broadcast
notification) also a new KPI is provided, that hides internals of
the database - inp_next(struct inp_iterator *).

Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33022


# ff945008 18-Nov-2021 Gleb Smirnoff <glebius@FreeBSD.org>

Add tcp_freecb() - single place to free tcpcb.

Until this change there were two places where we would free tcpcb -
tcp_discardcb() in case if all timers are drained and tcp_timer_discard()
otherwise. They were pretty much copy-n-paste, except that in the
default case we would run tcp_hc_update(). Merge this into single
function tcp_freecb() and move new short version of tcp_timer_discard()
to tcp_timer.c and make it static.

Reviewed by: rrs, hselasky
Differential revision: https://reviews.freebsd.org/D32965


# f581a26e 25-Oct-2021 Gleb Smirnoff <glebius@FreeBSD.org>

Factor out tcp6_use_min_mtu() to handle IPV6_USE_MIN_MTU by TCP.

Pass control for IP/IP6 level options from generic tcp_ctloutput_set()
down to per-stack ctloutput.

Call tcp6_use_min_mtu() from tcp stack tcp_default_ctloutput().

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D32655


# e2833083 25-Oct-2021 Peter Lei <peterlei@netflix.com>

tcp: socket option to get stack alias name

TCP stack sysctl nodes are currently inserted using the stack
name alias. Allow the user to get the current stack's alias to
allow for programatic sysctl access.

Obtained from: Netflix


# a36230f7 01-Oct-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Make dsack stats available in netstat and also make sure its aware of TLP's.

DSACK accounting has been for quite some time under a NETFLIX_STATS ifdef. Statistics
on DSACKs however are very useful in figuring out how much bad retransmissions you
are doing. This is further complicated, however, by stacks that do TLP. A TLP
when discovering a lost ack in the reverse path will cause the generation
of a DSACK. For this situation we introduce a new dsack-tlp-bytes as well
as the more traditional dsack-bytes and dsack-packets. These will now
all display in netstat -p tcp -s. This also updates all stacks that
are currently built to keep track of these stats.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32158


# e3e7d953 15-Sep-2021 Hans Petter Selasky <hselasky@FreeBSD.org>

tcp: Avoid division by zero when KERN_TLS is enabled in tcp_account_for_send().

If the "len" variable is non-zero, we can assume that the sum of
"tp->t_snd_rxt_bytes + tp->t_sndbytes" is also non-zero.

It is also assumed that the 64-bit byte counters will never wrap around.

Differential Revision: https://reviews.freebsd.org/D31959
Reviewed by: gallatin, rrs and tuexen
Found by: "I told you so", also called hselasky
MFC after: 1 week
Sponsored by: NVIDIA Networking


# 739de953 05-Aug-2021 Andrew Gallatin <gallatin@FreeBSD.org>

ktls: Move KERN_TLS ifdef to tcp_var.h

This allows us to remove stubs in ktls.h and allows us
to sort the function prototypes.

Reviewed by: jhb
Sponsored by: Netflix


# ca1a7e10 12-Jul-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: TCP_LRO getting bad checksums and sending it in to TCP incorrectly.

In reviewing tcp_lro.c we have a possibility that some drives may send a mbuf into
LRO without making sure that the checksum passes. Some drivers actually are
aware of this and do not call lro when the csum failed, others do not do this and
thus could end up sending data up that we think has a checksum passing when
it does not.

This change will fix that situation by properly verifying that the mbuf
has the correct markings (CSUM VALID bits as well as csum in mbuf header
is set to 0xffff).

Reviewed by: tuexen, hselasky, gallatin
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31155


# 28d0a740 06-Jul-2021 Andrew Gallatin <gallatin@FreeBSD.org>

ktls: auto-disable ifnet (inline hw) kTLS

Ifnet (inline) hw kTLS NICs typically keep state within
a TLS record, so that when transmitting in-order,
they can continue encryption on each segment sent without
DMA'ing extra state from the host.

This breaks down when transmits are out of order (eg,
TCP retransmits). In this case, the NIC must re-DMA
the entire TLS record up to and including the segment
being retransmitted. This means that when re-transmitting
the last 1448 byte segment of a TLS record, the NIC will
have to re-DMA the entire 16KB TLS record. This can lead
to the NIC running out of PCIe bus bandwidth well before
it saturates the network link if a lot of TCP connections have
a high retransmoit rate.

This change introduces a new sysctl (kern.ipc.tls.ifnet_max_rexmit_pct),
where TCP connections with higher retransmit rate will be
switched to SW kTLS so as to conserve PCIe bandwidth.

Reviewed by: hselasky, markj, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D30908


# 9e4d9e4c 25-Jun-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Preparation for allowing hardware TLS to be able to kick a tcp connection that is retransmitting too much out of hardware and back to software.

Hardware TLS is now supported in some interface cards and it works well. Except that
when we have connections that retransmit a lot we get into trouble with all the retransmits.
This prep step makes way for change that Drew will be making so that we can "kick out" a
session from hardware TLS.

Reviewed by: mtuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30895


# 67e89281 10-Jun-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Mbuf leak while holding a socket buffer lock.

When running at NF the current Rack and BBR changes with the recent
commits from Richard that cause the socket buffer lock to be held over
the ip_output() call and then finally culminating in a call to tcp_handle_wakeup()
we get a lot of leaked mbufs. I don't think that this leak is actually caused
by holding the lock or what Richard has done, but is exposing some other
bug that has probably been lying dormant for a long time. I will continue to
look (using his changes) at what is going on to try to root cause out the issue.

In the meantime I can't leave the leaks out for everyone else. So this commit
will revert all of Richards changes and move both Rack and BBR back to just
doing the old sorwakeup_locked() calls after messing with the so_rcv buffer.

We may want to look at adding back in Richards changes after I have pinpointed
the root cause of the mbuf leak and fixed it.

Reviewed by: mtuexen,rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30704


# 032bf749 21-May-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

[tcp] Keep socket buffer locked until upcall

r367492 would unlock the socket buffer before eventually calling the upcall.
This leads to problematic interaction with NFS kernel server/client components
(MP threads) accessing the socket buffer with potentially not correctly updated
state.

Reported by: rmacklem
Reviewed By: tuexen, #transport
Tested by: rmacklem, otis
MFC after: 2 weeks
Sponsored By: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29690


# 0471a8c7 10-May-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: SACK Lost Retransmission Detection (LRD)

Recover from excessive losses without reverting to a
retransmission timeout (RTO). Disabled by default, enable
with sysctl net.inet.tcp.do_lrd=1

Reviewed By: #transport, rrs, tuexen, #manpages
Sponsored by: Netapp, Inc.
Differential Revision: https://reviews.freebsd.org/D28931


# 5d8fd932 06-May-2021 Randall Stewart <rrs@FreeBSD.org>

This brings into sync FreeBSD with the netflix versions of rack and bbr.
This fixes several breakages (panics) since the tcp_lro code was
committed that have been reported. Quite a few new features are
now in rack (prefecting of DGP -- Dynamic Goodput Pacing among the
largest). There is also support for ack-war prevention. Documents
comming soon on rack..

Sponsored by: Netflix
Reviewed by: rscheff, mtuexen
Differential Revision: https://reviews.freebsd.org/D30036


# 9ca874cf 30-Mar-2021 Hans Petter Selasky <hselasky@FreeBSD.org>

Add TCP LRO support for VLAN and VxLAN.

This change makes the TCP LRO code more generic and flexible with regards
to supporting multiple different TCP encapsulation protocols and in general
lays the ground for broader TCP LRO support. The main job of the TCP LRO code is
to merge TCP packets for the same flow, to reduce the number of calls to upper
layers. This reduces CPU and increases performance, due to being able to send
larger TSO offloaded data chunks at a time. Basically the TCP LRO makes it
possible to avoid per-packet interaction by the host CPU.

Because the current TCP LRO code was tightly bound and optimized for TCP/IP
over ethernet only, several larger changes were needed. Also a minor bug was
fixed in the flushing mechanism for inactive entries, where the expire time,
"le->mtime" was not always properly set.

To avoid having to re-run time consuming regression tests for every change,
it was chosen to squash the following list of changes into a single commit:
- Refactor parsing of all address information into the "lro_parser" structure.
This easily allows to reuse parsing code for inner headers.
- Speedup header data comparison. Don't compare field by field, but
instead use an unsigned long array, where the fields get packed.
- Refactor the IPv4/TCP/UDP checksum computations, so that they may be computed
recursivly, only applying deltas as the result of updating payload data.
- Make smaller inline functions doing one operation at a time instead of
big functions having repeated code.
- Refactor the TCP ACK compression code to only execute once
per TCP LRO flush. This gives a minor performance improvement and
keeps the code simple.
- Use sbintime() for all time-keeping. This change also fixes flushing
of inactive entries.
- Try to shrink the size of the LRO entry, because it is frequently zeroed.
- Removed unused TCP LRO macros.
- Cleanup unused TCP LRO statistics counters while at it.
- Try to use __predict_true() and predict_false() to optimise CPU branch
predictions.

Bump the __FreeBSD_version due to changing the "lro_ctrl" structure.

Tested by: Netflix
Reviewed by: rrs (transport)
Differential Revision: https://reviews.freebsd.org/D29564
MFC after: 2 week
Sponsored by: Mellanox Technologies // NVIDIA Networking


# 9e644c23 18-Apr-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: add support for TCP over UDP

Adding support for TCP over UDP allows communication with
TCP stacks which can be implemented in userspace without
requiring special priviledges or specific support by the OS.
This is joint work with rrs.

Reviewed by: rrs
Sponsored by: Netflix, Inc.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D29469


# d1de2b05 17-Apr-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Rename rfc6675_pipe to sack.revised, and enable by default

As full support of RFC6675 is in place, deprecating
net.inet.tcp.rfc6675_pipe and enabling by default
net.inet.tcp.sack.revised.

Reviewed By: #transport, kbowling, rrs
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28702


# 90cca08e 08-Apr-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Prepare PRR to work with NewReno LossRecovery

Add proper PRR vnet declarations for consistency.
Also add pointer to tcpopt struct to tcp_do_prr_ack, in preparation
for it to deal with non-SACK window reduction (after loss).

No functional change.

MFC after: 2 weeks
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29440


# eb3a59a8 25-Mar-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Refactor PRR code

No functional change intended.

MFC after: 2 weeks
Reviewed By: #transport, rrs
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29411


# e5313869 05-Mar-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Add prr_out in preparation for PRR/nonSACK and LRD

Reviewed By: #transport, kbowling
MFC after: 3 days
Sponsored By: Netapp, Inc.
Differential Revision: https://reviews.freebsd.org/D29058


# 69a34e8d 26-Jan-2021 Randall Stewart <rrs@FreeBSD.org>

Update the LRO processing code so that we can support
a further CPU enhancements for compressed acks. These
are acks that are compressed into an mbuf. The transport
has to be aware of how to process these, and an upcoming
update to rack will do so. You need the rack changes
to actually test and validate these since if the transport
does not support mbuf compression, then the old code paths
stay in place. We do in this commit take out the concept
of logging if you don't have a lock (which was quite
dangerous and was only for some early debugging but has
been left in the code).

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D28374


# d2b3cedd 13-Jan-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: add sysctl to tolerate TCP segments missing timestamps

When timestamp support has been negotiated, TCP segements received
without a timestamp should be discarded. However, there are broken
TCP implementations (for example, stacks used by Omniswitch 63xx and
64xx models), which send TCP segments without timestamps although
they negotiated timestamp support.
This patch adds a sysctl variable which tolerates such TCP segments
and allows to interoperate with broken stacks.

Reviewed by: jtl@, rscheff@
Differential Revision: https://reviews.freebsd.org/D28142
Sponsored by: Netflix, Inc.
PR: 252449
MFC after: 1 week


# 0e1d7c25 04-Dec-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Add TCP feature Proportional Rate Reduction (PRR) - RFC6937

PRR improves loss recovery and avoids RTOs in a wide range
of scenarios (ACK thinning) over regular SACK loss recovery.

PRR is disabled by default, enable by net.inet.tcp.do_prr = 1.
Performance may be impeded by token bucket rate policers at
the bottleneck, where net.inet.tcp.do_prr_conservate = 1
should be enabled in addition.

Submitted by: Aris Angelogiannopoulos
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D18892


# 4d0770f1 08-Nov-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Prevent premature SACK block transmission during loss recovery

Under specific conditions, a window update can be sent with
outdated SACK information. Some clients react to this by
subsequently delaying loss recovery, making TCP perform very
poorly.

Reported by: chengc_netapp.com
Reviewed by: rrs, jtl
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D24237


# ce398115 28-Oct-2020 John Baldwin <jhb@FreeBSD.org>

Save the current TCP pacing rate in t_pacing_rate.

Reviewed by: gallatin, gnn
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26875


# 54321200 09-Oct-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Extend netstat to display TCP stack and detailed congestion state (2)

Extend netstat to display TCP stack and detailed congestion state

Adding the "-c" option used to show detailed per-connection
congestion control state for TCP sessions.

This is one summary patch, which adds the relevant variables into
xtcpcb. As previous "spare" space is used, these changes are ABI
compatible.

Reviewed by: tuexen
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D26518


# 42d75607 13-Sep-2020 Michael Tuexen <tuexen@FreeBSD.org>

Export the name of the congestion control. This will be used by sockstat
and netstat.

Reviewed by: rscheff
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D26412


# f359d6eb 13-Aug-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Improve SACK support code for RFC6675 and PRR

Adding proper accounting of sacked_bytes and (per-ACK)
delivered data to the SACK scoreboard. This will
allow more aspects of RFC6675 to be implemented as well
as Proportional Rate Reduction (RFC6937).

Prior to this change, the pipe calculation controlled with
net.inet.tcp.rfc6675_pipe was also susceptible to incorrect
results when more than 3 (or 4) holes in the sequence space
were present, which can no longer all fit into a single
ACK's SACK option.

Reviewed by: kbowling, rgrimes (mentor)
Approved by: rgrimes (mentor, blanket)
MFC after: 3 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D18624


# 9dc7d8a2 24-Jun-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

TCP: make after-idle work for transactional sessions.

The use of t_rcvtime as proxy for the last transmission
fails for transactional IO, where the client requests
data before the server can respond with a bulk transfer.

Set aside a dedicated variable to actually track the last
locally sent segment going forward.

Reported by: rrs
Reviewed by: rrs, tuexen (mentor)
Approved by: tuexen (mentor), rgrimes (mentor)
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D25016


# e854dd38 08-Jun-2020 Randall Stewart <rrs@FreeBSD.org>

An important statistic in determining if a server process (or client) is being delayed
is to know the time to first byte in and time to first byte out. Currently we
have no way to know these all we have is t_starttime. That (t_starttime) tells us
what time the 3 way handshake completed. We don't know when the first
request came in or how quickly we responded. Nor from a client perspective
do we know how long from when we sent out the first byte before the
server responded.

This small change adds the ability to track the TTFB's. This will show up in
BB logging which then can be pulled for later analysis. Note that currently
the tracking is via the ticks variable of all three variables. This provides
a very rough estimate (hz=1000 its 1ms). A follow-on set of work will be
to change all three of these values into something with a much finer resolution
(either microseconds or nanoseconds), though we may want to make the resolution
configurable so that on lower powered machines we could still use the much
cheaper ticks variable.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D24902


# d3b6c96b 04-May-2020 Randall Stewart <rrs@FreeBSD.org>

Adjust the fb to have a way to ask the underlying stack
if it can support the PRUS option (OOB). And then have
the new function call that to validate and give the
correct error response if needed to the user (rack
and bbr do not support obsoleted OOB data).

Sponsoered by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D24574


# e570d231 27-Apr-2020 Randall Stewart <rrs@FreeBSD.org>

This change does a small prepratory step in getting the
latest rack and bbr in from the NF repo. When those come
in the OOB data handling will be fixed where Skyzaller crashes.

Differential Revision: https://reviews.freebsd.org/D24575


# b89af8e1 14-Apr-2020 Michael Tuexen <tuexen@FreeBSD.org>

Improve the TCP blackhole detection. The principle is to reduce the
MSS in two steps and try each candidate two times. However, if two
candidates are the same (which is the case in TCP/IPv6), this candidate
was tested four times. This patch ensures that each candidate actually
reduced the MSS and is only tested 2 times. This reduces the time window
of missclassifying a temporary outage as an MTU issue.

Reviewed by: jtl
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D24308


# 7ca6e296 12-Mar-2020 Michael Tuexen <tuexen@FreeBSD.org>

Use KMOD_TCPSTAT_INC instead of TCPSTAT_INC for RACK and BBR, since
these are kernel modules. Also add a KMOD_TCPSTAT_ADD and use that
instead of TCPSTAT_ADD.

Reviewed by: jtl@, rrs@
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D23904


# a3574665 13-Feb-2020 Michael Tuexen <tuexen@FreeBSD.org>

sack_newdata and snd_recover hold the same value. Therefore, use only
a single instance: use snd_recover also where sack_newdata was used.

Submitted by: Richard Scheffenegger
Differential Revision: https://reviews.freebsd.org/D18811


# 481be5de 12-Feb-2020 Randall Stewart <rrs@FreeBSD.org>

White space cleanup -- remove trailing tab's or spaces
from any line.

Sponsored by: Netflix Inc.


# 334fc582 08-Jan-2020 Bjoern A. Zeeb <bz@FreeBSD.org>

vnet: virtualise more network stack sysctls.

Virtualise tcp_always_keepalive, TCP and UDP log_in_vain. All three are
set in the netoptions startup script, which we would love to run for VNETs
as well [1].

While virtualising the log_in_vain sysctls seems pointles at first for as
long as the kernel message buffer is not virtualised, it at least allows
an administrator to debug the base system or an individual jail if needed
without turning the logging on for all jails running on a system.

PR: 243193 [1]
MFC after: 2 weeks


# 4ad24737 06-Jan-2020 Randall Stewart <rrs@FreeBSD.org>

This catches rack up in the recent changes to ECN and
also commonizes the functions that both the freebsd and
rack stack uses.

Sponsored by:Netflix Inc
Differential Revision: https://reviews.freebsd.org/D23052


# a9a08ece 05-Jan-2020 Randall Stewart <rrs@FreeBSD.org>

This change adds a small feature to the tcp logging code. Basically
a connection can now have a separate tag added to the id.

Obtained from: Lawrence Stewart
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D22866


# 1cf55767 17-Dec-2019 Randall Stewart <rrs@FreeBSD.org>

This commit is a bit of a re-arrange of deck chairs. It
gets both rack and bbr ready for the completion of the STATs
framework in FreeBSD. For now if you don't have both NF_stats and
stats on it disables them. As soon as the rest of the stats framework
lands we can remove that restriction and then just uses stats when
defined.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D22479


# adc56f5a 02-Dec-2019 Edward Tomasz Napierala <trasz@FreeBSD.org>

Make use of the stats(3) framework in the TCP stack.

This makes it possible to retrieve per-connection statistical
information such as the receive window size, RTT, or goodput,
using a newly added TCP_STATS getsockopt(3) option, and extract
them using the stats_voistat_fetch(3) API.

See the net/tcprtt port for an example consumer of this API.

Compared to the existing TCP_INFO system, the main differences
are that this mechanism is easy to extend without breaking ABI,
and provides statistical information instead of raw "snapshots"
of values at a given point in time. stats(3) is more generic
and can be used in both userland and the kernel.

Reviewed by: thj
Tested by: thj
Obtained from: Netflix
Relnotes: yes
Sponsored by: Klara Inc, Netflix
Differential Revision: https://reviews.freebsd.org/D20655


# 3cf38784 01-Dec-2019 Michael Tuexen <tuexen@FreeBSD.org>

Move all ECN related flags from the flags to the flags2 field.
This allows adding more ECN related flags in the future.
No functional change intended.

Submitted by: Richard Scheffenegger
Reviewed by: rrs@, tuexen@
Differential Revision: https://reviews.freebsd.org/D22497


# 77aabfd9 01-Dec-2019 Michael Tuexen <tuexen@FreeBSD.org>

Make the TF_* flags easier readable by humans by adding leading zeroes
to make them aligned.

Submitted by: Richard Scheffenegger
Reviewed by: rgrimes@, rrs@, tuexen@
Differential Revision: https://reviews.freebsd.org/D22428


# b72e56e7 01-Dec-2019 Michael Tuexen <tuexen@FreeBSD.org>

This is an initial step in implementing the new congestion window
validation as specified in RFC 7661.

Submitted by: Richard Scheffenegger
Reviewed by: rrs@, tuexen@
Differential Revision: https://reviews.freebsd.org/D21798


# af9b9e0d 06-Sep-2019 Randall Stewart <rrs@FreeBSD.org>

This adds in the missing counter initialization which
I had forgotten to bring over.. opps.

Differential Revision: https://reviews.freebsd.org/D21127


# fe5dee73 02-Sep-2019 Michael Tuexen <tuexen@FreeBSD.org>

This patch improves the DSACK handling to conform with RFC 2883.
The lowest SACK block is used when multiple Blocks would be elegible as
DSACK blocks ACK blocks get reordered - while maintaining the ordering of
SACK blocks not relevant in the DSACK context is maintained.

Reviewed by: rrs@, tuexen@
Obtained from: Richard Scheffenegger
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D21038


# b2e60773 26-Aug-2019 John Baldwin <jhb@FreeBSD.org>

Add kernel-side support for in-kernel TLS.

KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277


# d21036e0 23-Jul-2019 Michael Tuexen <tuexen@FreeBSD.org>

Add a sysctl variable ts_offset_per_conn to change the computation
of the TCP TS offset from taking the IP addresses and the TCP port
numbers into account to a version just taking only the IP addresses
into account. This works around broken middleboxes or endpoints.
The default is to keep the behaviour, which is also the behaviour
recommended in RFC 7323.

Reported by: devgs@ukr.net
Reviewed by: rrs@
MFC after: 2 weeks
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D20980


# e5926fd3 14-Jul-2019 Randall Stewart <rrs@FreeBSD.org>

This is the second in a number of patches needed to
get BBRv1 into the tree. This fixes the DSACK bug but
is also needed by BBR. We have yet to go two more
one will be for the pacing code (tcp_ratelimit.c) and
the second will be for the new updated LRO code that
allows a transport to know the arrival times of packets
and (tcp_lro.c). After that we should finally be able
to get BBRv1 into head.

Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D20908


# 3b0b41e6 10-Jul-2019 Randall Stewart <rrs@FreeBSD.org>

This commit updates rack to what is basically being used at NF as
well as sets in some of the groundwork for committing BBR. The
hpts system is updated as well as some other needed utilities
for the entrance of BBR. This is actually part 1 of 3 more
needed commits which will finally complete with BBRv1 being
added as a new tcp stack.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D20834


# 7893235f 29-Mar-2019 Navdeep Parhar <np@FreeBSD.org>

tcp_autorcvbuf_inc was removed in r344433.

Discussed with: tuexen@
Sponsored by: Chelsio Communications


# 7dc90a1d 25-Jan-2019 Michael Tuexen <tuexen@FreeBSD.org>

Fix a bug in the restart window computation of TCP New Reno

When implementing support for IW10, an update in the computation
of the restart window used after an idle phase was missed. To
minimize code duplication, implement the logic in tcp_compute_initwnd()
and call it. This fixes a bug in NewReno, which was not aware of
IW10.

Submitted by: Richard Scheffenegger
Reviewed by: tuexen@
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18940


# c28440db 19-Aug-2018 Randall Stewart <rrs@FreeBSD.org>

This change represents a substantial restructure of the way we
reassembly inbound tcp segments. The old algorithm just blindly
dropped in segments without coalescing. This meant that every
segment could take up greater and greater room on the linked list
of segments. This of course is now subject to a tighter limit (100)
of segments which in a high BDP situation will cause us to be a
lot more in-efficent as we drop segments beyond 100 entries that
we receive. What this restructure does is cause the reassembly
buffer to coalesce segments putting an emphasis on the two
common cases (which avoid walking the list of segments) i.e.
where we add to the back of the queue of segments and where we
add to the front. We also have the reassembly buffer supporting
a couple of debug options (black box logging as well as counters
for code coverage). These are compiled out by default but can
be added by uncommenting the defines.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D16626


# 8e02b4e0 19-Aug-2018 Michael Tuexen <tuexen@FreeBSD.org>

Don't expose the uptime via the TCP timestamps.

The TCP client side or the TCP server side when not using SYN-cookies
used the uptime as the TCP timestamp value. This patch uses in all
cases an offset, which is the result of a keyed hash function taking
the source and destination addresses and port numbers into account.
The keyed hash function is the same a used for the initial TSN.

Reviewed by: rrs@
MFC after: 1 month
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16636


# f38b68ae 05-Jul-2018 Brooks Davis <brooks@FreeBSD.org>

Make struct xinpcb and friends word-size independent.

Replace size_t members with ksize_t (uint64_t) and pointer members
(never used as pointers in userspace, but instead as unique
idenitifiers) with kvaddr_t (uint64_t). This makes the structs
identical between 32-bit and 64-bit ABIs.

On 64-bit bit systems, the ABI is maintained. On 32-bit systems,
this is an ABI breaking change. The ABI of most of these structs
was previously broken in r315662. This also imposes a small API
change on userspace consumers who must handle kernel pointers
becoming virtual addresses.

PR: 228301 (exp-run by antoine)
Reviewed by: jtl, kib, rwatson (various versions)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15386


# 6573d758 03-Jul-2018 Matt Macy <mmacy@FreeBSD.org>

epoch(9): allow preemptible epochs to compose

- Add tracker argument to preemptible epochs
- Inline epoch read path in kernel and tied modules
- Change in_epoch to take an epoch as argument
- Simplify tfb_tcp_do_segment to not take a ti_locked argument,
there's no longer any benefit to dropping the pcbinfo lock
and trying to do so just adds an error prone branchfest to
these functions
- Remove cases of same function recursion on the epoch as
recursing is no longer free.
- Remove the the TAILQ_ENTRY and epoch_section from struct
thread as the tracker field is now stack or heap allocated
as appropriate.

Tested by: pho and Limelight Networks
Reviewed by: kbowling at llnw dot com
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16066


# 89e560f4 07-Jun-2018 Randall Stewart <rrs@FreeBSD.org>

This commit brings in a new refactored TCP stack called Rack.
Rack includes the following features:
- A different SACK processing scheme (the old sack structures are not used).
- RACK (Recent acknowledgment) where counting dup-acks is no longer done
instead time is used to knwo when to retransmit. (see the I-D)
- TLP (Tail Loss Probe) where we will probe for tail-losses to attempt
to try not to take a retransmit time-out. (see the I-D)
- Burst mitigation using TCPHTPS
- PRR (partial rate reduction) see the RFC.

Once built into your kernel, you can select this stack by either
socket option with the name of the stack is "rack" or by setting
the global sysctl so the default is rack.

Note that any connection that does not support SACK will be kicked
back to the "default" base FreeBSD stack (currently known as "default").

To build this into your kernel you will need to enable in your
kernel:
makeoptions WITH_EXTRA_TCP_STACKS=1
options TCPHPTS

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D15525


# 803a2305 26-Apr-2018 Randall Stewart <rrs@FreeBSD.org>

This change re-arranges the fields within the tcp-pcb so that
they are more in order of cache line use as one passes
through the tcp_input/output paths (non-errors most likely path). This
helps speed up cache line optimization so that the tcp stack runs
a bit more efficently.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D15136


# 3ee9c3c4 19-Apr-2018 Randall Stewart <rrs@FreeBSD.org>

This commit brings in the TCP high precision timer system (tcp_hpts).
It is the forerunner/foundational work of bringing in both Rack and BBR
which use hpts for pacing out packets. The feature is optional and requires
the TCPHPTS option to be enabled before the feature will be active. TCP
modules that use it must assure that the base component is compile in
the kernel in which they are loaded.

MFC after: Never
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D15020


# f8979519 10-Apr-2018 Jonathan T. Looney <jtl@FreeBSD.org>

Add missing header change from r332381.

Sponsored by: Netflix, Inc.
Pointy hat: jtl


# 2529f56e 22-Mar-2018 Jonathan T. Looney <jtl@FreeBSD.org>

Add the "TCP Blackbox Recorder" which we discussed at the developer
summits at BSDCan and BSDCam in 2017.

The TCP Blackbox Recorder allows you to capture events on a TCP connection
in a ring buffer. It stores metadata with the event. It optionally stores
the TCP header associated with an event (if the event is associated with a
packet) and also optionally stores information on the sockets.

It supports setting a log ID on a TCP connection and using this to correlate
multiple connections that share a common log ID.

You can log connections in different modes. If you are doing a coordinated
test with a particular connection, you may tell the system to put it in
mode 4 (continuous dump). Or, if you just want to monitor for errors, you
can put it in mode 1 (ring buffer) and dump all the ring buffers associated
with the connection ID when we receive an error signal for that connection
ID. You can set a default mode that will be applied to a particular ratio
of incoming connections. You can also manually set a mode using a socket
option.

This commit includes only basic probes. rrs@ has added quite an abundance
of probes in his TCP development work. He plans to commit those soon.

There are user-space programs which we plan to commit as ports. These read
the data from the log device and output pcapng files, and then let you
analyze the data (and metadata) in the pcapng files.

Reviewed by: gnn (previous version)
Obtained from: Netflix, Inc.
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D11085


# 18a75309 25-Feb-2018 Patrick Kelsey <pkelsey@FreeBSD.org>

Greatly reduce the number of #ifdefs supporting the TCP_RFC7413 kernel option.

The conditional compilation support is now centralized in
tcp_fastopen.h and tcp_var.h. This doesn't provide the minimum
theoretical code/data footprint when TCP_RFC7413 is disabled, but
nearly all the TFO code should wind up being removed by the optimizer,
the additional footprint in the syncache entries is a single pointer,
and the additional overhead in the tcpcb is at the end of the
structure.

This enables the TCP_RFC7413 kernel option by default in amd64 and
arm64 GENERIC.

Reviewed by: hiren
MFC after: 1 month
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14048


# c560df6f 25-Feb-2018 Patrick Kelsey <pkelsey@FreeBSD.org>

This is an implementation of the client side of TCP Fast Open (TFO)
[RFC7413]. It also includes a pre-shared key mode of operation in
which the server requires the client to be in possession of a shared
secret in order to successfully open TFO connections with that server.

The names of some existing fastopen sysctls have changed (e.g.,
net.inet.tcp.fastopen.enabled -> net.inet.tcp.fastopen.server_enable).

Reviewed by: tuexen
MFC after: 1 month
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14047


# 66492fea 07-Dec-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Separate out send buffer autoscaling code into function, so that
alternative TCP stacks may reuse it instead of pasting.

Obtained from: Netflix


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# 3bdf4c42 11-Oct-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Declare more TCP globals in tcp_var.h, so that alternative TCP stacks
can use them. Gather all TCP tunables in tcp_var.h in one place and
alphabetically sort them, to ease maintainance of the list.

Don't copy and paste declarations in tcp_stacks/fastpath.c.


# e5cccc35 12-Sep-2017 Michael Tuexen <tuexen@FreeBSD.org>

Add support to print the TCP stack being used.

Sponsored by: Netflix, Inc.


# 32a04bb8 25-Aug-2017 Sean Bruno <sbruno@FreeBSD.org>

Use counter(9) for PLPMTUD counters.

Remove unused PLPMTUD sysctl counters.

Bump UPDATING and FreeBSD Version to indicate a rebuild is required.

Submitted by: kevin.bowling@kev009.com
Reviewed by: jtl
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12003


# dc6a41b9 08-Jun-2017 Jonathan T. Looney <jtl@FreeBSD.org>

Add the infrastructure to support loading multiple versions of TCP
stack modules.

It adds support for mangling symbols exported by a module by prepending
a string to them. (This avoids overlapping symbols in the kernel linker.)

It allows the use of a macro as the module name in the DECLARE_MACRO()
and MACRO_VERSION() macros.

It allows the code to register stack aliases (e.g. both a generic name
["default"] and version-specific name ["default_10_3p1"]).

With these changes, it is trivial to compile TCP stack modules with
the name defined in the Makefile and to load multiple versions of the
same stack simultaneously. This functionality can be used to enable
side-by-side testing of an old and new version of the same TCP stack.
It also could support upgrading the TCP stack without a reboot.

Reviewed by: gnn, sjg (makefiles only)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D11086


# e44c1887 10-Apr-2017 Steven Hartland <smh@FreeBSD.org>

Use estimated RTT for receive buffer auto resizing instead of timestamps

Switched from using timestamps to RTT estimates when performing TCP receive
buffer auto resizing, as not all hosts support / enable TCP timestamps.

Disabled reset of receive buffer auto scaling when not in bulk receive mode,
which gives an extra 20% performance increase.

Also extracted auto resizing to a common method shared between standard and
fastpath modules.

With this AWS S3 downloads at ~17ms latency on a 1Gbps connection jump from
~3MB/s to ~100MB/s using the default settings.

Reviewed by: lstewart, gnn
MFC after: 2 weeks
Relnotes: Yes
Sponsored by: Multiplay
Differential Revision: https://reviews.freebsd.org/D9668


# cc65eb4e 21-Mar-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Hide struct inpcb, struct tcpcb from the userland.

This is a painful change, but it is needed. On the one hand, we avoid
modifying them, and this slows down some ideas, on the other hand we still
eventually modify them and tools like netstat(1) never work on next version of
FreeBSD. We maintain a ton of spares in them, and we already got some ifdef
hell at the end of tcpcb.

Details:
- Hide struct inpcb, struct tcpcb under _KERNEL || _WANT_FOO.
- Make struct xinpcb, struct xtcpcb pure API structures, not including
kernel structures inpcb and tcpcb inside. Export into these structures
the fields from inpcb and tcpcb that are known to be used, and put there
a ton of spare space.
- Make kernel and userland utilities compilable after these changes.
- Bump __FreeBSD_version.

Reviewed by: rrs, gnn
Differential Revision: D10018


# fbbd9655 28-Feb-2017 Warner Losh <imp@FreeBSD.org>

Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96


# cfff3743 10-Feb-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Move tcp_fields_to_net() static inline into tcp_var.h, just below its
friend tcp_fields_to_host(). There is third party code that also uses
this inline.

Reviewed by: ae


# fcf59617 06-Feb-2017 Andrey V. Elsukov <ae@FreeBSD.org>

Merge projects/ipsec into head/.

Small summary
-------------

o Almost all IPsec releated code was moved into sys/netipsec.
o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel
option IPSEC_SUPPORT added. It enables support for loading
and unloading of ipsec.ko and tcpmd5.ko kernel modules.
o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by
default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type
support was removed. Added TCP/UDP checksum handling for
inbound packets that were decapsulated by transport mode SAs.
setkey(8) modified to show run-time NAT-T configuration of SA.
o New network pseudo interface if_ipsec(4) added. For now it is
build as part of ipsec.ko module (or with IPSEC kernel).
It implements IPsec virtual tunnels to create route-based VPNs.
o The network stack now invokes IPsec functions using special
methods. The only one header file <netipsec/ipsec_support.h>
should be included to declare all the needed things to work
with IPsec.
o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed.
Now these protocols are handled directly via IPsec methods.
o TCP_SIGNATURE support was reworked to be more close to RFC.
o PF_KEY SADB was reworked:
- now all security associations stored in the single SPI namespace,
and all SAs MUST have unique SPI.
- several hash tables added to speed up lookups in SADB.
- SADB now uses rmlock to protect access, and concurrent threads
can do SA lookups in the same time.
- many PF_KEY message handlers were reworked to reflect changes
in SADB.
- SADB_UPDATE message was extended to support new PF_KEY headers:
SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They
can be used by IKE daemon to change SA addresses.
o ipsecrequest and secpolicy structures were cardinally changed to
avoid locking protection for ipsecrequest. Now we support
only limited number (4) of bundled SAs, but they are supported
for both INET and INET6.
o INPCB security policy cache was introduced. Each PCB now caches
used security policies to avoid SP lookup for each packet.
o For inbound security policies added the mode, when the kernel does
check for full history of applied IPsec transforms.
o References counting rules for security policies and security
associations were changed. The proper SA locking added into xform
code.
o xform code was also changed. Now it is possible to unregister xforms.
tdb_xxx structures were changed and renamed to reflect changes in
SADB/SPDB, and changed rules for locking and refcounting.

Reviewed by: gnn, wblock
Obtained from: Yandex LLC
Relnotes: yes
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D9352


# 5e946c03 12-Jan-2017 Maxim Sobolev <sobomax@FreeBSD.org>

Fix slight type mismatch between so_options defined in sys/socketvar.h
and tw_so_options defined here which is supposed to be a copy of the
former (short vs u_short respectively).

Switch tw_so_options to be "signed short" to match the type of the field
it's inherited from.


# 68bd7ed1 12-Oct-2016 Jonathan T. Looney <jtl@FreeBSD.org>

The TFO server-side code contains some changes that are not conditioned on
the TCP_RFC7413 kernel option. This change removes those few instructions
from the packet processing path.

While not strictly necessary, for the sake of consistency, I applied the
new IS_FASTOPEN macro to all places in the packet processing path that
used the (t_flags & TF_FASTOPEN) check.

Reviewed by: hiren
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8219


# bd79708d 11-Oct-2016 Jonathan T. Looney <jtl@FreeBSD.org>

In the TCP stack, the hhook(9) framework provides hooks for kernel modules
to add actions that run when a TCP frame is sent or received on a TCP
session in the ESTABLISHED state. In the base tree, this functionality is
only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd,
and cc_vegas congestion control modules.

Presently, we incur overhead to check for hooks each time a TCP frame is
sent or received on an ESTABLISHED TCP session.

This change adds a new compile-time option (TCP_HHOOK) to determine whether
to include the hhook(9) framework for TCP. To retain backwards
compatibility, I added the TCP_HHOOK option to every configuration file that
already defined "options INET". (Therefore, this patch introduces no
functional change. In order to see a functional difference, you need to
compile a custom kernel without the TCP_HHOOK option.) This change will
allow users to easily exclude this functionality from their kernel, should
they wish to do so.

Note that any users who use a custom kernel configuration and use one of the
congestion control modules listed above will need to add the TCP_HHOOK
option to their kernel configuration.

Reviewed by: rrs, lstewart, hiren (previous version), sjg (makefiles only)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8185


# 3ac12506 06-Oct-2016 Jonathan T. Looney <jtl@FreeBSD.org>

Remove "long" variables from the TCP stack (not including the modular
congestion control framework).

Reviewed by: gnn, lstewart (partial)
Sponsored by: Juniper Networks, Netflix
Differential Revision: (multiple)
Tested by: Limelight, Netflix


# 55a429a6 06-Oct-2016 Jonathan T. Looney <jtl@FreeBSD.org>

Remove declaration of un-defined function tcp_seq_subtract().

Reviewed by: gnn
MFC after: 1 week
Sponsored by: Juniper Networks, Netflix
Differential Revision: https://reviews.freebsd.org/D7055


# 4b7b743c 25-Aug-2016 Lawrence Stewart <lstewart@FreeBSD.org>

Pass the number of segments coalesced by LRO up the stack by repurposing the
tso_segsz pkthdr field during RX processing, and use the information in TCP for
more correct accounting and as a congestion control input. This is only a start,
and an audit of other uses for the data is left as future work.

Reviewed by: gallatin, rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D7564


# 587d67c0 16-Aug-2016 Randall Stewart <rrs@FreeBSD.org>

Here we update the modular tcp to be able to switch to an
alternate TCP stack in other then the closed state (pre-listen/connect).
The idea is that *if* that is supported by the alternate stack, it
is asked if its ok to switch. If it approves the "handoff" then we
allow the switch to happen. Also the fini() function now gets a flag
to tell if you are switching away *or* the tcb is destroyed. The
init() call into the alternate stack is moved to the end so the
tcb is more fully formed before the init transpires.

Sponsored by: Netflix Inc.
Differential Revision: D6790


# 3f58662d 01-Jun-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

The pr_destroy field does not allow us to run the teardown code in a
specific order. VNET_SYSUNINITs however are doing exactly that.
Thus remove the VIMAGE conditional field from the domain(9) protosw
structure and replace it with VNET_SYSUNINITs.
This also allows us to change some order and to make the teardown functions
file local static.
Also convert divert(4) as it uses the same mechanism ip(4) and ip6(4) use
internally.

Slightly reshuffle the SI_SUB_* fields in kernel.h and add a new ones, e.g.,
for pfil consumers (firewalls), partially for this commit and for others
to come.

Reviewed by: gnn, tuexen (sctp), jhb (kernel.h)
Obtained from: projects/vnet
MFC after: 2 weeks
X-MFC: do not remove pr_destroy
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D6652


# f59d975e 17-May-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Tiny refactor of r294869/r296881: use defines to mask the VNET() macro.

Suggested by: bz


# 5105a92c 17-May-2016 Randall Stewart <rrs@FreeBSD.org>

This small change adopts the excellent suggestion for using named
structures in the add of a new tcp-stack that came in late to me
via email after the last commit. It also makes it so that a new
stack may optionally get a callback during a retransmit
timeout. This allows the new stack to clear specific state (think
sack scoreboards or other such structures).

Sponsored by: Netflix Inc.
Differential Revision: http://reviews.freebsd.org/D6303


# e5ad6456 28-Apr-2016 Randall Stewart <rrs@FreeBSD.org>

This cleans up the timers code in TCP to start using the new
async_drain functionality. This as been tested in NF as well as
by Verisign. Still to do in here is to remove all the old flags. They
are currently left being maintained but probably are no longer needed.

Sponsored by: Netflix Inc.
Differential Revision: http://reviews.freebsd.org/D5924


# 5d20f974 22-Mar-2016 Jonathan T. Looney <jtl@FreeBSD.org>

to_flags is currently a 64-bit integer; however, we only use 7 bits.
Furthermore, there is no reason this needs to be a 64-bit integer
for the forseeable future.

Also, there is an inconsistency between to_flags and the mask in
tcp_addoptions(). Before r195654, to_flags was a u_long and the mask in
tcp_addoptions() was a u_int. r195654 changed to_flags to be a u_int64_t
but left the mask in tcp_addoptions() as a u_int, meaning that these
variables will only be the same width on platforms with 64-bit integers.

Convert both to_flags and the mask in tcp_addoptions() to be explicitly
32-bit variables. This may save a few cycles on 32-bit platforms, and
avoids unnecessarily mixing types.

Differential Revision: https://reviews.freebsd.org/D5584
Reviewed by: hiren
MFC after: 2 weeks
Sponsored by: Juniper Networks


# bf840a17 14-Mar-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Redo r294869. The array of counters for TCP states doesn't belong to
struct tcpstat, because the structure can be zeroed out by netstat(1) -z,
and of course running connection counts shouldn't be touched.

Place running connection counts into separate array, and provide
separate read-only sysctl oid for it.


# 2f06d2ab 14-Mar-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Comment fix: statistics are not read-only.


# 57a78e3b 26-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Augment struct tcpstat with tcps_states[], which is used for book-keeping
the amount of TCP connections by state. Provides a cheap way to get
connection count without traversing the whole pcb list.

Sponsored by: Netflix


# d17d4c6b 26-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Provide TCPSTAT_DEC() and TCPSTAT_FETCH() macros.


# 1f12da0e 22-Jan-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

Just checkpoint the WIP in order to be able to make the tree update
easier. Note: this is currently not in a usable state as certain
teardown parts are not called and the DOMAIN rework is missing.
More to come soon and find its way to head.

Obtained from: P4 //depot/user/bz/vimage/...
Sponsored by: The FreeBSD Foundation


# 0c39d38d 06-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Historically we have two fields in tcpcb to describe sender MSS: t_maxopd,
and t_maxseg. This dualism emerged with T/TCP, but was not properly cleaned
up after T/TCP removal. After all permutations over the years the result is
that t_maxopd stores a minimum of peer offered MSS and MTU reduced by minimum
protocol header. And t_maxseg stores (t_maxopd - TCPOLEN_TSTAMP_APPA) if
timestamps are in action, or is equal to t_maxopd otherwise. That's a very
rough estimate of MSS reduced by options length. Throughout the code it
was used in places, where preciseness was not important, like cwnd or
ssthresh calculations.

With this change:

- t_maxopd goes away.
- t_maxseg now stores MSS not adjusted by options.
- new function tcp_maxseg() is provided, that calculates MSS reduced by
options length. The functions gives a better estimate, since it takes
into account SACK state as well.

Reviewed by: jtl
Differential Revision: https://reviews.freebsd.org/D3593


# 281a0fd4 24-Dec-2015 Patrick Kelsey <pkelsey@FreeBSD.org>

Implementation of server-side TCP Fast Open (TFO) [RFC7413].

TFO is disabled by default in the kernel build. See the top comment
in sys/netinet/tcp_fastopen.c for implementation particulars.

Reviewed by: gnn, jch, stas
MFC after: 3 days
Sponsored by: Verisign, Inc.
Differential Revision: https://reviews.freebsd.org/D4350


# 55bceb1e 15-Dec-2015 Randall Stewart <rrs@FreeBSD.org>

First cut of the modularization of our TCP stack. Still
to do is to clean up the timer handling using the async-drain.
Other optimizations may be coming to go with this. Whats here
will allow differnet tcp implementations (one included).
Reviewed by: jtl, hiren, transports
Sponsored by: Netflix Inc.
Differential Revision: D4055


# 4d163382 10-Dec-2015 Hiren Panchasara <hiren@FreeBSD.org>

Clean up unused bandwidth entry in the TCP hostcache.

Submitted by: Jason Wolfe (j at nitrology dot com)
Reviewed by: rrs, hiren
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D4154


# b87170f2 09-Dec-2015 Hiren Panchasara <hiren@FreeBSD.org>

r290122 added 4 bytes and removed 8 in struct sackhint. Add a pad entry of 4
bytes to restore the size.

Spotted by: rrs
Reviewed by: rrs
X-MFC with: r290122
Sponsored by: Limelight Networks


# 021eaf79 08-Dec-2015 Hiren Panchasara <hiren@FreeBSD.org>

One of the ways to detect loss is to count duplicate acks coming back from the
other end till it reaches predetermined threshold which is 3 for us right now.
Once that happens, we trigger fast-retransmit to do loss recovery.

Main problem with the current implementation is that we don't honor SACK
information well to detect whether an incoming ack is a dupack or not. RFC6675
has latest recommendations for that. According to it, dupack is a segment that
arrives carrying a SACK block that identifies previously unknown information
between snd_una and snd_max even if it carries new data, changes the advertised
window, or moves the cumulative acknowledgment point.

With the prevalence of Selective ACK (SACK) these days, improper handling can
lead to delayed loss recovery.

With the fix, new behavior looks like following:

0) th_ack < snd_una --> ignore
Old acks are ignored.
1) th_ack == snd_una, !sack_changed --> ignore
Acks with SACK enabled but without any new SACK info in them are ignored.
2) th_ack == snd_una, window == old_window --> increment
Increment on a good dupack.
3) th_ack == snd_una, window != old_window, sack_changed --> increment
When SACK enabled, it's okay to have advertized window changed if the ack has
new SACK info.
4) th_ack > snd_una --> reset to 0
Reset to 0 when left edge moves.
5) th_ack > snd_una, sack_changed --> increment
Increment if left edge moves but there is new SACK info.

Here, sack_changed is the indicator that incoming ack has previously unknown
SACK info in it.

Note: This fix is not fully compliant to RFC6675. That may require a few
changes to current implementation in order to keep per-sackhole dupack counter
and change to the way we mark/handle sack holes.

PR: 203663
Reviewed by: jtl
MFC after: 3 weeks
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D4225


# 12eeb81f 28-Oct-2015 Hiren Panchasara <hiren@FreeBSD.org>

Calculate the correct amount of bytes that are in-flight for a connection as
suggested by RFC 6675.

Currently differnt places in the stack tries to guess this in suboptimal ways.
The main problem is that current calculations don't take sacked bytes into
account. Sacked bytes are the bytes receiver acked via SACK option. This is
suboptimal because it assumes that network has more outstanding (unacked) bytes
than the actual value and thus sends less data by setting congestion window
lower than what's possible which in turn may cause slower recovery from losses.

As an example, one of the current calculations looks something like this:
snd_nxt - snd_fack + sackhint.sack_bytes_rexmit
New proposal from RFC 6675 is:
snd_max - snd_una - sackhint.sacked_bytes + sackhint.sack_bytes_rexmit
which takes sacked bytes into account which is a new addition to the sackhint
struct. Only thing we are missing from RFC 6675 is isLost() i.e. segment being
considered lost and thus adjusting pipe based on that which makes this
calculation a bit on conservative side.

The approach is very simple. We already process each ack with sack info in
tcp_sack_doack() and extract sack blocks/holes out of it. We'd now also track
this new variable sacked_bytes which keeps track of total sacked bytes reported.

One downside to this approach is that we may get incorrect count of sacked_bytes
if the other end decides to drop sack info in the ack because of memory pressure
or some other reasons. But in this (not very likely) case also the pipe
calculation would be conservative which is okay as opposed to being aggressive
in sending packets into the network.

Next step is to use this more accurate pipe estimation to drive congestion
window adjustments.

In collaboration with: rrs
Reviewed by: jason_eggnet dot com, rrs
MFC after: 2 weeks
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D3971


# 356c7958 27-Oct-2015 Hiren Panchasara <hiren@FreeBSD.org>

Add sysctl tunable net.inet.tcp.initcwnd_segments to specify initial congestion
window in number of segments on fly. It is set to 10 segments by default.

Remove net.inet.tcp.experimental.initcwnd10 which is now redundant. Also remove
the parent node net.inet.tcp.experimental as it's not needed anymore and also
because it was not well thought out.

Differential Revision: https://reviews.freebsd.org/D3858
In collaboration with: lstewart
Reviewed by: gnn (prev version), rwatson, allanjude, wblock (man page)
MFC after: 2 weeks
Relnotes: yes
Sponsored by: Limelight Networks


# 86a996e6 13-Oct-2015 Hiren Panchasara <hiren@FreeBSD.org>

There are times when it would be really nice to have a record of the last few
packets and/or state transitions from each TCP socket. That would help with
narrowing down certain problems we see in the field that are hard to reproduce
without understanding the history of how we got into a certain state. This
change provides just that.

It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is
destroyed, the list is freed. I thought this was likely to be more
performance-friendly than saving copies of the tcpcb. Plus, with the packets,
you should be able to reverse-engineer what happened to the tcpcb.

To enable the feature, you will need to compile a kernel with the TCPPCAP
option. Even then, the feature defaults to being deactivated. You can activate
it by setting a positive value for the number of captured packets. You can do
that on either a global basis or on a per-socket basis (via a setsockopt call).

There is no way to get the packets out of the kernel other than using kmem or
getting a coredump. I thought that would help some of the legal/privacy concerns
regarding such a feature. However, it should be possible to add a future effort
to export them in PCAP format.

I tested this at low scale, and found that there were no mbuf leaks and the peak
mbuf usage appeared to be unchanged with and without the feature.

The main performance concern I can envision is the number of mbufs that would be
used on systems with a large number of sockets. If you save five packets per
direction per socket and have 3,000 sockets, that will consume at least 30,000
mbufs just to keep these packets. I tried to reduce the concerns associated with
this by limiting the number of clusters (not mbufs) that could be used for this
feature. Again, in my testing, that appears to work correctly.

Differential Revision: D3100
Submitted by: Jonathan Looney <jlooney at juniper dot net>
Reviewed by: gnn, hiren


# 1558cb24 26-Sep-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Eliminate nd6_nud_hint() and its TCP bindings.

Initially function was introduced in r53541 (KAME initial commit) to
"provide hints from upper layer protocols that indicate a connection
is making "forward progress"" (quote from RFC 2461 7.3.1 Reachability
Confirmation).
However, it was converted to do nothing (e.g. just return) in r122922
(tcp_hostcache implementation) back in 2003. Some defines were moved
to tcp_var.h in r169541. Then, it was broken (for non-corner cases)
by r186119 (L2<>L3 split) in 2008 (NULL ifp in nd6_lookup). So,
right now this code is broken and has no "real" base users.

Differential Revision: https://reviews.freebsd.org/D3699


# 24067db8 03-Sep-2015 Gleb Smirnoff <glebius@FreeBSD.org>

Make tcp_mtudisc() static and void. No functional changes.

Sponsored by: Nginx, Inc.


# 03041aaa 30-Jul-2015 Hiren Panchasara <hiren@FreeBSD.org>

Update snd_una description to make it more readable.

Differential Revision: https://reviews.freebsd.org/D3179
Reviewed by: gnn
Sponsored by: Limelight Networks


# 4741bfcb 29-Jul-2015 Patrick Kelsey <pkelsey@FreeBSD.org>

Revert r265338, r271089 and r271123 as those changes do not handle
non-inline urgent data and introduce an mbuf exhaustion attack vector
similar to FreeBSD-SA-15:15.tcp, but not requiring VNETs.

Address the issue described in FreeBSD-SA-15:15.tcp.

Reviewed by: glebius
Approved by: so
Approved by: jmallett (mentor)
Security: FreeBSD-SA-15:15.tcp
Sponsored by: Norse Corp, Inc.


# 5571f9cf 16-Apr-2015 Julien Charbon <jch@FreeBSD.org>

Fix an old and well-documented use-after-free race condition in
TCP timers:
- Add a reference from tcpcb to its inpcb
- Defer tcpcb deletion until TCP timers have finished

Differential Revision: https://reviews.freebsd.org/D2079
Submitted by: jch, Marc De La Gueronniere <mdelagueronniere@verisign.com>
Reviewed by: imp, rrs, adrian, jhb, bz
Approved by: jhb
Sponsored by: Verisign, Inc.


# 71da7153 17-Nov-2014 Julien Charbon <jch@FreeBSD.org>

Re-introduce padding fields removed with r264321 to keep
struct tcptw ABI unchanged.

Suggested by: jhb
Approved by: jhb (mentor)
MFC after: 1 day
X-MFC-With: r264321


# 4952ad42 03-Nov-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Restore spares used in "struct tcpcb" and bump "__FreeBSD_version" to
indicate need for kernel module re-compilation.

Sponsored by: Mellanox Technologies


# cea40c48 30-Oct-2014 Julien Charbon <jch@FreeBSD.org>

Fix a race condition in TCP timewait between tcp_tw_2msl_reuse() and
tcp_tw_2msl_scan(). This race condition drives unplanned timewait
timeout cancellation. Also simplify implementation by holding inpcb
reference and removing tcptw reference counting.

Differential Revision: https://reviews.freebsd.org/D826
Submitted by: Marc De la Gueronniere <mdelagueronniere@verisign.com>
Submitted by: jch
Reviewed By: jhb (mentor), adrian, rwatson
Sponsored by: Verisign, Inc.
MFC after: 2 weeks
X-MFC-With: r264321


# f6f6703f 07-Oct-2014 Sean Bruno <sbruno@FreeBSD.org>

Implement PLPMTUD blackhole detection (RFC 4821), inspired by code
from xnu sources. If we encounter a network where ICMP is blocked
the Needs Frag indicator may not propagate back to us. Attempt to
downshift the mss once to a preconfigured value.

Default this feature to off for now while we do not have a full PLPMTUD
implementation in our stack.

Adds the following new sysctl's for control:
net.inet.tcp.pmtud_blackhole_detection -- turns on/off this feature
net.inet.tcp.pmtud_blackhole_mss -- mss to try for ipv4
net.inet.tcp.v6pmtud_blackhole_mss -- mss to try for ipv6

Adds the following new sysctl's for monitoring:
-- Number of times the code was activated to attempt a mss downshift
net.inet.tcp.pmtud_blackhole_activated
-- Number of times the blackhole mss was used in an attempt to downshift
net.inet.tcp.pmtud_blackhole_min_activated
-- Number of times that we failed to connect after we downshifted the mss
net.inet.tcp.pmtud_blackhole_failed

Phabricator: https://reviews.freebsd.org/D506
Reviewed by: rpaulo bz
MFC after: 2 weeks
Relnotes: yes
Sponsored by: Limelight Networks


# 29c47f18 27-Sep-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

* Split tcp_signature_compute() into 2 pieces:
- tcp_get_sav() - SADB key lookup
- tcp_signature_do_compute() - actual computation
* Fix TCP signature case for listening socket:
do not assume EVERY connection coming to socket
with TCP_SIGNATURE set to be md5 signed regardless
of SADB key existance for particular address. This
fixes the case for routing software having _some_
BGP sessions secured by md5.
* Simplify TCP_SIGNATURE handling in tcp_input()

MFC after: 2 weeks


# 9fd573c3 22-Sep-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Improve transmit sending offload, TSO, algorithm in general.

The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.

Reviewed by: adrian, rmacklem
Sponsored by: Mellanox Technologies
MFC after: 1 week


# 8f5a8818 07-Aug-2014 Kevin Lo <kevlo@FreeBSD.org>

Merge 'struct ip6protosw' and 'struct protosw' into one. Now we have
only one protocol switch structure that is shared between ipv4 and ipv6.

Phabric: D476
Reviewed by: jhb


# 4fd2b4eb 24-May-2014 Bjoern A. Zeeb <bz@FreeBSD.org>

Make tcp_twrespond() file local private; this removes it from the
public KPI; it is not used anywhere else and seems it never was.

MFC after: 2 weeks


# 255cd9fd 23-May-2014 Bjoern A. Zeeb <bz@FreeBSD.org>

Move the tcp_fields_to_host() and tcp_fields_to_net() (inline)
functions to the tcp_var.h header file in order to avoid further
duplication with upcoming commits.

Reviewed by: np
MFC after: 2 weeks


# b1a41566 16-May-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Provide compatibility #define after r265408.

Suggested by: truckman


# f1395664 15-May-2014 Mike Silbersack <silby@FreeBSD.org>

Remove the function tcp_twrecycleable; it has been #if 0'd for
eight years. The original concept was to improve the
corner case where you run out of ephemeral ports, but it
was causing performance problems and the mechanism
of limiting the number of time_wait sockets serves
the same purpose in the end.

Reviewed by: bz


# c669105d 05-May-2014 Gleb Smirnoff <glebius@FreeBSD.org>

- Remove net.inet.tcp.reass.overflows sysctl. It counts exactly
same events that tcpstat's tcps_rcvmemdrop counter counts.
- Rename tcps_rcvmemdrop to tcps_rcvreassfull and improve its
description in netstat(1) output.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# e407b67b 04-May-2014 Gleb Smirnoff <glebius@FreeBSD.org>

The FreeBSD-SA-14:08.tcp was a lesson on not doing acrobatics with
mixing on stack memory and UMA memory in one linked list.

Thus, rewrite TCP reassembly code in terms of memory usage. The
algorithm remains unchanged.

We actually do not need extra memory to build a reassembly queue.
Arriving mbufs are always packet header mbufs. So we got the length
of data as pkthdr.len. We got m_nextpkt for linkage. And we need
only one pointer to point at the tcphdr, use PH_loc for that.

In tcpcb the t_segq fields becomes mbuf pointer. The t_segqlen
field now counts not packets, but bytes in the queue. This gives
us more precision when comparing to socket buffer limits.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 66eefb1e 10-Apr-2014 John Baldwin <jhb@FreeBSD.org>

Currently, the TCP slow timer can starve TCP input processing while it
walks the list of connections in TIME_WAIT closing expired connections
due to contention on the global TCP pcbinfo lock.

To remediate, introduce a new global lock to protect the list of
connections in TIME_WAIT. Only acquire the TCP pcbinfo lock when
closing an expired connection. This limits the window of time when
TCP input processing is stopped to the amount of time needed to close
a single connection.

Submitted by: Julien Charbon <jcharbon@verisign.com>
Reviewed by: rwatson, rrs, adrian
MFC after: 2 months


# 5b26ea5d 25-Feb-2014 John Baldwin <jhb@FreeBSD.org>

Remove more constants related to static sysctl nodes. The MAXID constants
were primarily used to size the sysctl name list macros that were removed
in r254295. A few other constants either did not have an associated
sysctl node, or the associated node used OID_AUTO instead.

PR: ports/184525 (exp-run)


# a5f44cd7 21-Sep-2013 Bjoern A. Zeeb <bz@FreeBSD.org>

Introduce spares in the TCP syncache and timewait structures
so that fixed TCP_SIGNATURE handling can later be merged.

This is derived from follow-up work to SVN r183001 posted to
net@ on Sep 13 2008.

Approved by: re (gjb)


# fd77bbb9 26-Aug-2013 John Baldwin <jhb@FreeBSD.org>

Remove most of the remaining sysctl name list macros. They were only
ever intended for use in sysctl(8) and it has not used them for many
years.

Reviewed by: bde
Tested by: exp-run by bdrewery


# 57f60867 25-Aug-2013 Mark Johnston <markj@FreeBSD.org>

Implement the ip, tcp, and udp DTrace providers. The probe definitions use
dynamic translation so that their arguments match the definitions for
these providers in Solaris and illumos. Thus, existing scripts for these
providers should work unmodified on FreeBSD.

Tested by: gnn, hiren
MFC after: 1 month


# 5da0521f 09-Jul-2013 Andrey V. Elsukov <ae@FreeBSD.org>

Use new macros to implement ipstat and tcpstat using PCPU counters.
Change interface of kread_counters() similar ot kread() in the netstat(1).


# 3c914c54 02-Jun-2013 Andre Oppermann <andre@FreeBSD.org>

Allow drivers to specify a maximum TSO length in bytes if they are
limited in the amount of data they can handle at once.

Drivers can set ifp->if_hw_tsomax before calling ether_ifattach() to
change the limit.

The lowest allowable size is IP_MAXPACKET / 8 (8192 bytes) as anything
less wouldn't be very useful anymore. The upper limit is still at
IP_MAXPACKET (65536 bytes). Raising it requires further auditing of
the IPv4/v6 code path's as the length field in the IP header would
overflow leading to confusion in firewalls and others packet handler on
the real size of the packet.

The placement into "struct ifnet" is a bit hackish but the best place
that was found. When the stack/driver boundary is updated it should
be handled in a better way.

Submitted by: cperciva (earlier version)
Reviewed by: cperciva
Tested by: cperciva
MFC after: 1 week (using spare struct members to preserve ABI)


# 5923c293 08-Apr-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Merge from projects/counters: TCP/IP stats.

Convert 'struct ipstat' and 'struct tcpstat' to counter(9).

This speeds up IP forwarding at extreme packet rates, and
makes accounting more precise.

Sponsored by: Nginx, Inc.


# 09440655 28-Oct-2012 Andre Oppermann <andre@FreeBSD.org>

Increase the initial CWND to 10 segments as defined in IETF TCPM
draft-ietf-tcpm-initcwnd-05. It explains why the increased initial
window improves the overall performance of many web services without
risking congestion collapse.

As long as it remains a draft it is placed under a sysctl marking it
as experimental:
net.inet.tcp.experimental.initcwnd10 = 1
When it becomes an official RFC soon the sysctl will be changed to
the RFC number and moved to net.inet.tcp.

This implementation differs from the RFC draft in that it is a bit
more conservative in the case of packet loss on SYN or SYN|ACK because
we haven't reduced the default RTO to 1 second yet. Also the restart
window isn't yet increased as allowed. Both will be adjusted with
upcoming changes.

Is is enabled by default. In Linux it is enabled since kernel 3.0.

MFC after: 2 weeks


# 09fe6320 19-Jun-2012 Navdeep Parhar <np@FreeBSD.org>

- Updated TOE support in the kernel.

- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs.
These are available as t3_tom and t4_tom modules that augment cxgb(4)
and cxgbe(4) respectively. The cxgb/cxgbe drivers continue to work as
usual with or without these extra features.

- iWARP driver for Terminator 3 ASIC (kernel verbs). T4 iWARP in the
works and will follow soon.

Build-tested with make universe.

30s overview
============
What interfaces support TCP offload? Look for TOE4 and/or TOE6 in the
capabilities of an interface:
# ifconfig -m | grep TOE

Enable/disable TCP offload on an interface (just like any other ifnet
capability):
# ifconfig cxgbe0 toe
# ifconfig cxgbe0 -toe

Which connections are offloaded? Look for toe4 and/or toe6 in the
output of netstat and sockstat:
# netstat -np tcp | grep toe
# sockstat -46c | grep toe

Reviewed by: bz, gnn
Sponsored by: Chelsio communications.
MFC after: ~3 months (after 9.1, and after ensuring MFC is feasible)


# ef341ee1 16-Apr-2012 Gleb Smirnoff <glebius@FreeBSD.org>

When we receive an ICMP unreach need fragmentation datagram, we take
proposed MTU value from it and update the TCP host cache. Then
tcp_mss_update() is called on the corresponding tcpcb. It finds the
just allocated entry in the TCP host cache and updates MSS on the
tcpcb. And then we do a fast retransmit of what we have in the tcp
send buffer.

This sequence gets broken if the TCP host cache is exausted. In this
case allocation fails, and later called tcp_mss_update() finds nothing
in cache. The fast retransmit is done with not reduced MSS and is
immidiately replied by remote host with new ICMP datagrams and the
cycle repeats. This ping-pong can go up to wirespeed.

To fix this:
- tcp_mss_update() gets new parameter - mtuoffer, that is like
offer, but needs to have min_protoh subtracted.
- tcp_mtudisc() as notification method renamed to tcp_mtudisc_notify().
- tcp_mtudisc() now accepts not a useless error argument, but proposed
MTU value, that is passed to tcp_mss_update() as mtuoffer.

Reported by: az
Reported by: Andrey Zonov <andrey zonov.org>
Reviewed by: andre (previous version of patch)


# 9077f387 05-Feb-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Add new socket options: TCP_KEEPINIT, TCP_KEEPIDLE, TCP_KEEPINTVL and
TCP_KEEPCNT, that allow to control initial timeout, idle time, idle
re-send interval and idle send count on a per-socket basis.

Reviewed by: andre, bz, lstewart


# 9ec4a4cc 16-Oct-2011 Andre Oppermann <andre@FreeBSD.org>

Remove the ss_fltsz and ss_fltsz_local sysctl's which have
long been superseded by the RFC3390 initial CWND sizing.

Also remove the remnants of TCP_METRICS_CWND which used the
TCP hostcache to set the initial CWND in a non-RFC compliant
way.

MFC after: 1 week


# e233e2ac 16-Oct-2011 Andre Oppermann <andre@FreeBSD.org>

VNET virtualize tcp_sendspace/tcp_recvspace and change the
type to INT. A long is not necessary as the TCP window is
limited to 2**30. A larger initial window isn't useful.

MFC after: 1 week


# d9a36286 17-Jul-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Add spares to the network stack for FreeBSD-9:
- TCP keep* timers
- TCP UTO (adjust from what was there already)
- netmap
- route caching
- user cookie (temporary to allow for the real fix)

Slightly re-shuffle struct ifnet moving fields out of the middle
of spares and to better align.

Discussed with: rwatson (slightly earlier version)


# 672dc4ae 29-Apr-2011 John Baldwin <jhb@FreeBSD.org>

TCP reuses t_rxtshift to determine the backoff timer used for both the
persist state and the retransmit timer. However, the code that implements
"bad retransmit recovery" only checks t_rxtshift to see if an ACK has been
received in during the first retransmit timeout window. As a result, if
ticks has wrapped over to a negative value and a socket is in the persist
state, it can incorrectly treat an ACK from the remote peer as a
"bad retransmit recovery" and restore saved values such as snd_ssthresh and
snd_cwnd. However, if the socket has never had a retransmit timeout, then
these saved values will be zero, so snd_ssthresh and snd_cwnd will be set
to 0.

If the socket is in fast recovery (this can be caused by excessive
duplicate ACKs such as those fixed by 220794), then each ACK that arrives
triggers either NewReno or SACK partial ACK handling which clamps snd_cwnd
to be no larger than snd_ssthresh. In effect, the socket's send window
is permamently stuck at 0 even though the remote peer is advertising a
much larger window and pending data is only sent via TCP window probes
(so one byte every few seconds).

Fix this by adding a new TCP pcb flag (TF_PREVVALID) that indicates that
the various snd_*_prev fields in the pcb are valid and only perform
"bad retransmit recovery" if this flag is set in the pcb. The flag is set
on the first retransmit timeout that occurs and is cleared on subsequent
retransmit timeouts or when entering the persist state.

Reviewed by: bz
MFC after: 2 weeks


# 2903309a 25-Apr-2011 Attilio Rao <attilio@FreeBSD.org>

Add the possibility to verify MD5 hash of incoming TCP packets.
As long as this is a costy function, even when compiled in (along with
the option TCP_SIGNATURE), it can be disabled via the
net.inet.tcp.signature_verify_input sysctl.

Sponsored by: Sandvine Incorporated
Reviewed by: emaste, bz
MFC after: 2 weeks


# f1f5cc47 10-Jan-2011 Lawrence Stewart <lstewart@FreeBSD.org>

Fixe some whitespace nits that were introduced in r216758.

Sponsored by: FreeBSD Foundation
Submitted by: pjd
MFC after: 10 weeks
X-MFC with: r216758


# 79e955ed 07-Jan-2011 John Baldwin <jhb@FreeBSD.org>

Trim extra spaces before tabs.


# e29f3cc7 27-Dec-2010 Lawrence Stewart <lstewart@FreeBSD.org>

Add a comment for the ccv member of struct tcpcb.

Sponsored by: FreeBSD Foundation
MFC after: 5 weeks
X-MFC with: r215166


# 39bc9de5 27-Dec-2010 Lawrence Stewart <lstewart@FreeBSD.org>

- Add some helper hook points to the TCP stack. The hooks allow Khelp modules to
access inbound/outbound events and associated data for established TCP
connections. The hooks only run if at least one hook function is registered
for the hook point, ensuring the impact on the stack is effectively nil when
no TCP Khelp modules are loaded. struct tcp_hhook_data is passed as contextual
data to any registered Khelp module hook functions.

- Add an OSD (Object Specific Data) pointer to struct tcpcb to allow Khelp
modules to associate per-connection data with the TCP control block.

- Bump __FreeBSD_version and add a note to UPDATING regarding to ABI changes
introduced by this commit and r216753.

In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: FreeBSD Foundation
Reviewed by: bz, others along the way
MFC after: 3 months


# bee9ab2b 27-Dec-2010 Lawrence Stewart <lstewart@FreeBSD.org>

Add a new sack hint to track the most recent and highest sacked sequence number.
This will be used by the incoming Enhanced RTT Khelp module.

Sponsored by: FreeBSD Foundation
Submitted by: David Hayes <dahayes at swin edu au>
Reviewed by: bz and others (as part of a larger patch)
MFC after: 3 months


# f5d34df5 17-Nov-2010 George V. Neville-Neil <gnn@FreeBSD.org>

Add new, per connection, statistics for TCP, including:
Retransmitted Packets
Zero Window Advertisements
Out of Order Receives

These statistics are available via the -T argument to
netstat(1).
MFC after: 2 weeks


# 99065ae6 16-Nov-2010 Lawrence Stewart <lstewart@FreeBSD.org>

Move protocol specific implementation detail out of the core CC framework.

Sponsored by: FreeBSD Foundation
Tested by: Mikolaj Golub <to.my.trociny at gmail com>
MFC after: 11 weeks
X-MFC with: r215166


# dbc42409 11-Nov-2010 Lawrence Stewart <lstewart@FreeBSD.org>

This commit marks the first formal contribution of the "Five New TCP Congestion
Control Algorithms for FreeBSD" FreeBSD Foundation funded project. More details
about the project are available at: http://caia.swin.edu.au/freebsd/5cc/

- Add a KPI and supporting infrastructure to allow modular congestion control
algorithms to be used in the net stack. Algorithms can maintain per-connection
state if required, and connections maintain their own algorithm pointer, which
allows different connections to concurrently use different algorithms. The
TCP_CONGESTION socket option can be used with getsockopt()/setsockopt() to
programmatically query or change the congestion control algorithm respectively
from within an application at runtime.

- Integrate the framework with the TCP stack in as least intrusive a manner as
possible. Care was also taken to develop the framework in a way that should
allow integration with other congestion aware transport protocols (e.g. SCTP)
in the future. The hope is that we will one day be able to share a single set
of congestion control algorithm modules between all congestion aware transport
protocols.

- Introduce a new congestion recovery (TF_CONGRECOVERY) state into the TCP stack
and use it to decouple the meaning of recovery from a congestion event and
recovery from packet loss (TF_FASTRECOVERY) a la RFC2581. ECN and delay based
congestion control protocols don't generally need to recover from packet loss
and need a different way to note a congestion recovery episode within the
stack.

- Remove the net.inet.tcp.newreno sysctl, which simplifies some portions of code
and ensures the stack always uses the appropriate mechanisms for recovering
from packet loss during a congestion recovery episode.

- Extract the NewReno congestion control algorithm from the TCP stack and
massage it into module form. NewReno is always built into the kernel and will
remain the default algorithm for the forseeable future. Implementations of
additional different algorithms will become available in the near future.

- Bump __FreeBSD_version to 900025 and note in UPDATING that rebuilding code
that relies on the size of "struct tcpcb" is required.

Many thanks go to the Cisco University Research Program Fund at Community
Foundation Silicon Valley and the FreeBSD Foundation. Their support of our work
at the Centre for Advanced Internet Architectures, Swinburne University of
Technology is greatly appreciated.

In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: Cisco URP, FreeBSD Foundation
Reviewed by: rpaulo
Tested by: David Hayes (and many others over the years)
MFC after: 3 months


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 0c236c4e 24-Sep-2010 Lawrence Stewart <lstewart@FreeBSD.org>

Internalise reassembly queue related functionality and variables which should
not be used outside of the reassembly queue implementation. Provide a new
function to flush all segments from a reassembly queue and call it from the
appropriate places instead of manipulating the queue directly.

Sponsored by: FreeBSD Foundation
Reviewed by: andre, gnn, rpaulo
MFC after: 2 weeks


# 1c18314d 16-Sep-2010 Andre Oppermann <andre@FreeBSD.org>

Remove the TCP inflight bandwidth limiter as announced in r211315
to give way for the pluggable congestion control framework. It is
the task of the congestion control algorithm to set the congestion
window and amount of inflight data without external interference.

In 'struct tcpcb' the variables previously used by the inflight
limiter are renamed to spares to keep the ABI intact and to have
some more space for future extensions.

In 'struct tcp_info' the variable 'tcpi_snd_bwnd' is not removed to
preserve the ABI. It is always set to 0.

In siftr.c in 'struct pkt_node' the variable 'snd_bwnd' is not removed
to preserve the ABI. It is always set to 0.

These unused variable in the various structures may be reused in the
future or garbage collected before the next release or at some other
point when an ABI change happens anyway for other reasons.

No MFC is planned. The inflight bandwidth limiter stays disabled by
default in the other branches but remains available.


# c3f0bdc6 18-Aug-2010 Andre Oppermann <andre@FreeBSD.org>

If a TCP connection has been idle for one retransmit timeout or more
it must reset its congestion window back to the initial window.

RFC3390 has increased the initial window from 1 segment to up to
4 segments.

The initial window increase of RFC3390 wasn't reflected into the
restart window which remained at its original defaults of 4 segments
for local and 1 segment for all other connections. Both values are
controllable through sysctl net.inet.tcp.local_slowstart_flightsize
and net.inet.tcp.slowstart_flightsize.

The increase helps TCP's slow start algorithm to open up the congestion
window much faster.

Reviewed by: lstewart
MFC after: 1 week


# b7d747ec 18-Aug-2010 Andre Oppermann <andre@FreeBSD.org>

Untangle the net.inet.tcp.log_in_vain and net.inet.tcp.log_debug
sysctl's and remove any side effects.

Both sysctl's share the same backend infrastructure and due to the
way it was implemented enabling net.inet.tcp.log_in_vain would also
cause log_debug output to be generated. This was surprising and
eventually annoying to the user.

The log output backend is kept the same but a little shim is inserted
to properly separate log_in_vain and log_debug and to remove any side
effects.

PR: kern/137317
MFC after: 1 week


# 480d7c6c 06-May-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

MFC r207369:
MFP4: @176978-176982, 176984, 176990-176994, 177441

"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by: jhb
Discussed with: rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH


# 82cea7e6 29-Apr-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

MFP4: @176978-176982, 176984, 176990-176994, 177441

"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by: jhb
Discussed with: rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
MFC after: 6 days


# 62f500d0 27-Mar-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

MFC r204838:

Destroy TCP UMA zones (empty or not) upon network stack teardown
to not leak them, otherwise making UMA/vmstat unhappy with every
stoped vnet.
We will still leak pages (especially for zones marked NOFREE).

Reshuffle cleanup order in tcp_destroy() to get rid of what we can
easily free first.

Reviewed by: rwatson


# 376aadf8 07-Mar-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

Destroy TCP UMA zones (empty or not) upon network stack teardown
to not leak them, otherwise making UMA/vmstat unhappy with every stoped vnet.
We will still leak pages (especially for zones marked NOFREE).

Reshuffle cleanup order in tcp_destroy() to get rid of what we can
easily free first.

Sponsored by: ISPsystem
Reviewed by: rwatson
MFC after: 5 days


# be6797dd 23-Jan-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

MFC r202469:
Garbage collect references to the no longer implemented tcp_fasttimo().


# 4dcc55a3 17-Jan-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

Garbage collect references to the no longer implemented tcp_fasttimo().

Discussed with: rwatson
MFC after: 5 days


# b8614722 15-Sep-2009 Mike Silbersack <silby@FreeBSD.org>

Add the ability to see TCP timers via netstat -x. This can be a useful
feature when you have a seemingly stuck socket and want to figure
out why it has not been closed yet.

No plans to MFC this, as it changes the netstat sysctl ABI.

Reviewed by: andre, rwatson, Eric Van Gyzen


# 315e3e38 02-Aug-2009 Robert Watson <rwatson@FreeBSD.org>

Many network stack subsystems use a single global data structure to hold
all pertinent statatistics for the subsystem. These structures are
sometimes "borrowed" by kernel modules that require a place to store
statistics for similar events.

Add KPI accessor functions for statistics structures referenced by kernel
modules so that they no longer encode certain specifics of how the data
structures are named and stored. This change is intended to make it
easier to move to per-CPU network stats following 8.0-RELEASE.

The following modules are affected by this change:

if_bridge
if_cxgb
if_gif
ip_mroute
ipdivert
pf

In practice, most of these statistics consumers should, in fact, maintain
their own statistics data structures rather than borrowing structures
from the base network stack. However, that change is too agressive for
this point in the release cycle.

Reviewed by: bz
Approved by: re (kib)


# 1e77c105 16-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Remove unused VNET_SET() and related macros; only VNET_GET() is
ever actually used. Rename VNET_GET() to VNET() to shorten
variable references.

Discussed with: bz, julian
Reviewed by: bz
Approved by: re (kensmith, kib)


# eddfbb76 14-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)


# 237fbe0a 13-Jul-2009 Lawrence Stewart <lstewart@FreeBSD.org>

Replace struct tcpopt with a proxy toeopt struct in the TOE driver interface to
the TCP syncache. This returns struct tcpopt to being private within the TCP
implementation, thus allowing it to be modified without ABI concerns.

The patch breaks the ABI. Bump __FreeBSD_version to 800103 accordingly. The cxgb
driver is the only TOE consumer affected by this change, and needs to be
recompiled along with the kernel.

Suggested by: rwatson
Reviewed by: rwatson, kmacy
Approved by: re (kensmith), kensmith (mentor temporarily unavailable)


# 962ebef8 12-Jul-2009 Lawrence Stewart <lstewart@FreeBSD.org>

Pad the following TCP related structs to allow MFCs of upcoming features/fixes
back to the 8 branch:

tcp_var.h
- struct sackhint
- struct tcpcb
- struct tcpstat

The patch breaks the ABI. Bump __FreeBSD_version to 800102 accordingly. User
space tools that rely on the size of any of these structs (e.g. sockstat) need
to be recompiled.

Reviewed by: rpaulo, sam, andre, rwatson
Approved by: re & mentor (gnn)


# 9f78a87a 16-Jun-2009 John Baldwin <jhb@FreeBSD.org>

- Change members of tcpcb that cache values of ticks from int to u_int:
t_rcvtime, t_starttime, t_rtttime, t_bw_rtttime, ts_recent_age,
t_badrxtwin.
- Change t_recent in struct timewait from u_long to u_int32_t to match
the type of the field it shadows from tcpcb: ts_recent.
- Change t_starttime in struct timewait from u_long to u_int to match
the t_starttime field in tcpcb.

Requested by: bde (1, 3)


# 0e8cc7e7 10-Jun-2009 John Baldwin <jhb@FreeBSD.org>

Change a few members of tcpcb that store cached copies of ticks to be ints
instead of unsigned longs. This fixes a few overflow edge cases on 64-bit
platforms. Specifically, if an idle connection receives a packet shortly
before 2^31 clock ticks of uptime (about 25 days with hz=1000) and the keep
alive timer fires after 2^31 clock ticks, the keep alive timer will think
that the connection has been idle for a very long time and will immediately
drop the connection instead of sending a keep alive probe.

Reviewed by: silby, gnn, lstewart
MFC after: 1 week


# bc29160d 08-Jun-2009 Marko Zec <zec@FreeBSD.org>

Introduce an infrastructure for dismantling vnet instances.

Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state. The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.

While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers. Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.

Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels. Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.

Bump __FreeBSD_version to 800097.
Reviewed by: bz, julian
Approved by: rwatson, kib (re), julian (mentor)


# f6dfe47a 30-Apr-2009 Marko Zec <zec@FreeBSD.org>

Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance. Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:

1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables. As an example, V_ifnet becomes:

options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet
default build: vnet_net_0._ifnet
options VIMAGE_GLOBALS: ifnet

2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:

INIT_VNET_NET(ifp->if_vnet); becomes

struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];

3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals. If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.

4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet. options VIMAGE builds
will fill in those fields as required.

5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.

6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod. SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.

Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.

Reviewed by: bz, rwatson
Approved by: julian (mentor)


# de231a06 12-Apr-2009 Robert Watson <rwatson@FreeBSD.org>

Put TCPSTAT_ADD() and TCPSTAT_INC() behind _KERNEL.

MFC after: 3 days


# 78b50714 11-Apr-2009 Robert Watson <rwatson@FreeBSD.org>

Update stats in struct tcpstat using two new macros, TCPSTAT_ADD() and
TCPSTAT_INC(), rather than directly manipulating the fields across the
kernel. This will make it easier to change the implementation of
these statistics, such as using per-CPU versions of the data structures.

MFC after: 3 days


# de4fbddd 22-Jan-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Add externs to fix build with VIMAGE_GLOBALS after r187289.


# 24cb0f22 14-Jan-2009 Lawrence Stewart <lstewart@FreeBSD.org>

Add TCP Appropriate Byte Counting (RFC 3465) support to kernel.

The new behaviour is on by default, and can be disabled by setting the
net.inet.tcp.rfc3465 sysctl to 0 to obtain previous behaviour.

The patch changes struct tcpcb in sys/netinet/tcp_var.h which breaks
the ABI. Bump __FreeBSD_version to 800061 accordingly. User space tools
that rely on the size of struct tcpcb (e.g. sockstat) need to be recompiled.

Reviewed by: rpaulo, gnn
Approved by: gnn, kmacy (mentors)
Sponsored by: FreeBSD Foundation


# 1b193af6 13-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Second round of putting global variables, which were virtualized
but formerly missed under VIMAGE_GLOBAL.

Put the extern declarations of the virtualized globals
under VIMAGE_GLOBAL as the globals themsevles are already.
This will help by the time when we are going to remove the globals
entirely.

Sponsored by: The FreeBSD Foundation


# 86413abf 11-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Put a global variables, which were virtualized but formerly
missed under VIMAGE_GLOBAL.

Start putting the extern declarations of the virtualized globals
under VIMAGE_GLOBAL as the globals themsevles are already.
This will help by the time when we are going to remove the globals
entirely.

While there garbage collect a few dead externs from ip6_var.h.

Sponsored by: The FreeBSD Foundation


# c3ce7a79 10-Dec-2008 Robert Watson <rwatson@FreeBSD.org>

Move flag definitions for t_flags and t_oobflags below the definition of
struct tcpcb so that the structure definition is a bit more vertically
compact. Can't yet fit it on one printed page, though.

MFC after: pretty soon


# 44e33a07 19-Nov-2008 Marko Zec <zec@FreeBSD.org>

Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions. As a rule,
initialization at instatiation for such variables should never be
introduced again from now on. Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact. In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 91d6cfa6 06-Nov-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Fix a bug introduced with r182851 splitting tcp_mss() into
tcp_mss() and tcp_mss_update() so that tcp_mtudisc() could
re-use the same code.

Move the TSO logic back to tcp_mss() and out of tcp_mss_update().
We tried to avoid that initially but if were are called from
tcp_output() with EMSGSIZE, we cleared the TSO flag on the tcpcb
there, called into tcp_mtudisc() and tcp_mss_update() which
then would reenable TSO on the tcpcb based on TSO capabilities
of the interface as learnt in tcp_maxmtu/6().
So if TSO was enabled on the (possibly new) outgoing interface
it was turned back on, which lead to an endless loop between
tcp_output() and tcp_mtudisc() until we overflew the stack.

Reported by: kmacy
MFC after: 2 months (along with r182851)


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 3cee92e0 07-Sep-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Split tcp_mss() in tcp_mss() and tcp_mss_update() where the former
calls the latter.

Merge tcp_mss_update() with code from tcp_mtudisc() basically
doing the same thing.

This gives us one central place where we calcuate and check mss values
to update t_maxopd (maximum mss + options length) instead of two slightly
different but almost equal implementations to maintain.

PR: kern/118455
Reviewed by: silby (back in March)
MFC after: 2 months


# f2512ba1 31-Jul-2008 Rui Paulo <rpaulo@FreeBSD.org>

MFp4 (//depot/projects/tcpecn/):

TCP ECN support. Merge of my GSoC 2006 work for NetBSD.
TCP ECN is defined in RFC 3168.

Partly reviewed by: dwmalone, silby
Obtained from: NetBSD


# 032fae41 20-Apr-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Revert to rev. 1.161 - switch back to optimized TCP options ordering.

A lot of testing has shown that the problem people were seeing was due
to invalid padding after the end of option list option, which was corrected
in tcp_output.c rev. 1.146.

Thanks to: anders@, s3raphi, Matt Reimer
Thanks to: Doug Hardie and Randy Rose, John Mayer, Susan Guzzardi
Special thanks to: dwhite@ and BitGravity
Discussed with: silby
MFC after: 1 day


# ea346b19 23-Feb-2008 Mike Silbersack <silby@FreeBSD.org>

Change FreeBSD 7 so that it returns TCP options in
the same order that FreeBSD 6 and before did. Doug
White and the other bloodhounds at ISC discovered that
while FreeBSD 7's ordering of options was more efficient,
it caused some cable modem routers to ignore the
SYN-ACKs ordered in this fashion.

The placement of sackOK after the timestamp option seems
to be the critical difference:

FreeBSD 6:
<mss 1460,nop,wscale 1,nop,nop,timestamp 3512155768 0,sackOK,eol>

FreeBSD 7.0:
<mss 1460,nop,wscale 3,sackOK,timestamp 1370692577 0>

FreeBSD 7.0 + this change:
<mss 1460,nop,wscale 3,nop,nop,timestamp 7371813 0,sackOK,eol>

MFC after: 1 week


# 76b262c4 12-Dec-2007 Kip Macy <kmacy@FreeBSD.org>

Fix style issues with initial TCP offload commit

Requested by: rwatson
Submitted by: rwatson


# 620721db 12-Dec-2007 Kip Macy <kmacy@FreeBSD.org>

Add driver independent interface to offload active established TCP connections

Reviewed by: silby


# 2de2af32 06-Dec-2007 Kip Macy <kmacy@FreeBSD.org>

Add padding for anticipated functionality
- vimage
- TOE
- multiq
- host rtentry caching

Rename spare used by 80211 to if_llsoftc

Reviewed by: rwatson, gnn
MFC after: 1 day


# e2f2059f 23-Sep-2007 Mike Silbersack <silby@FreeBSD.org>

Two changes:

- Reintegrate the ANSI C function declaration change
from tcp_timer.c rev 1.92

- Reorganize the tcpcb structure so that it has a single
pointer to the "tcp_timer" structure which contains all
of the tcp timer callouts. This change means that when
the single tcp timer change is reintegrated, tcpcb will
not change in size, and therefore the ABI between
netstat and the kernel will not change.

Neither of these changes should have any functional
impact.

Reviewed by: bmah, rrs
Approved by: re (bmah)


# 85d94372 07-Sep-2007 Robert Watson <rwatson@FreeBSD.org>

Back out tcp_timer.c:1.93 and associated changes that reimplemented the many
TCP timers as a single timer, but retain the API changes necessary to
reintroduce this change. This will back out the source of at least two
reported problems: lock leaks in certain timer edge cases, and TCP timers
continuing to fire after a connection has closed (a bug previously fixed and
then reintroduced with the timer rewrite).

In a follow-up commit, some minor restylings and comment changes performed
after the TCP timer rewrite will be reapplied, and a further change to allow
the TCP timer rewrite to be added back without disturbing the ABI. The new
design is believed to be a good thing, but the outstanding issues are
leading to significant stability/correctness problems that are holding
up 7.0.

This patch was generated by silby, but is being committed by proxy due to
poor network connectivity for silby this week.

Approved by: re (kensmith)
Submitted by: silby
Tested by: rwatson, kris
Problems reported by: peter, kris, others


# 773673c1 27-Jul-2007 Andre Oppermann <andre@FreeBSD.org>

Provide a sysctl to toggle reporting of TCP debug logging:

sys.net.inet.tcp.log_debug = 1

It defaults to enabled for the moment and is to be turned off for
the next release like other diagnostics from development branches.

It is important to note that sysctl sys.net.inet.tcp.log_in_vain
uses the same logging function as log_debug. Enabling of the former
also causes the latter to engage, but not vice versa.

Use consistent terminology in tcp log messages:

"ignored" means a segment contains invalid flags/information and
is dropped without changing state or issuing a reply.

"rejected" means a segments contains invalid flags/information but
is causing a reply (usually RST) and may cause a state change.

Approved by: re (rwatson)


# c325962b 26-Jul-2007 Mike Silbersack <silby@FreeBSD.org>

Export the contents of the syncache to netstat.

Approved by: re (kensmith)
MFC after: 2 weeks


# 9fb5d4c0 04-Jul-2007 Peter Wemm <peter@FreeBSD.org>

Fix cast-qualifiers warning when INET6 is not present

Approved by: re (rwatson)


# a160e630 28-May-2007 Andre Oppermann <andre@FreeBSD.org>

Refactor and rewrite in parts the SYN handling code on listen sockets
in tcp_input():

o tighten the checks on allowed TCP flags to be RFC793 and
tcp-secure conform
o log check failures to syslog at LOG_DEBUG level
o rearrange the code flow to be easier to follow
o add KASSERTs to validate assumptions of the code flow

Add sysctl net.inet.tcp.syncache.rst_on_sock_fail defaulting to enable
that controls the behavior on socket creation failure for a otherwise
successful 3-way handshake. The socket creation can fail due to global
memory shortage, listen queue limits and file descriptor limits. The
sysctl allows to chose between two options to deal with this. One is
to send a reset to the other endpoint to notify it about the failure
(default). The other one is to ignore and treat the failure as a
transient error and have the other endpoint retransmit for another try.

Reviewed by: rwatson (in general)


# df541e5f 18-May-2007 Andre Oppermann <andre@FreeBSD.org>

Add tcp_log_addrs() function to generate and standardized TCP log line
for use thoughout the tcp subsystem.

It is IPv4 and IPv6 aware creates a line in the following format:

"TCP: [1.2.3.4]:50332 to [1.2.3.4]:80 tcpflags <RST>"

A "\n" is not included at the end. The caller is supposed to add
further information after the standard tcp log header.

The function returns a NUL terminated string which the caller has
to free(s, M_TCPLOG) after use. All memory allocation is done
with M_NOWAIT and the return value may be NULL in memory shortage
situations.

Either struct in_conninfo || (struct tcphdr && (struct ip || struct
ip6_hdr) have to be supplied.

Due to ip[6].h header inclusion limitations and ordering issues the
struct ip and struct ip6_hdr parameters have to be casted and passed
as void * pointers.

tcp_log_addrs(struct in_conninfo *inc, struct tcphdr *th, void *ip4hdr,
void *ip6hdr)

Usage example:

struct ip *ip;
char *tcplog;

if (tcplog = tcp_log_addrs(NULL, th, (void *)ip, NULL)) {
log(LOG_DEBUG, "%s; %s: Connection attempt to closed port\n",
tcplog, __func__);
free(s, M_TCPLOG);
}


# 2104448f 16-May-2007 Andre Oppermann <andre@FreeBSD.org>

Move TIME_WAIT related functions and timer handling from files
other than repo copied tcp_subr.c into tcp_timewait.c#1.284:

tcp_input.c#1.350 tcp_timewait() -> tcp_twcheck()

tcp_timer.c#1.92 tcp_timer_2msl_reset() -> tcp_tw_2msl_reset()
tcp_timer.c#1.92 tcp_timer_2msl_stop() -> tcp_tw_2msl_stop()
tcp_timer.c#1.92 tcp_timer_2msl_tw() -> tcp_tw_2msl_scan()

This is a mechanical move with appropriate renames and making
them static if used only locally.

The tcp_tw_2msl_scan() cleanup function is still run from the
tcp_slowtimo() in tcp_timer.c.


# ec9c7553 13-May-2007 Andre Oppermann <andre@FreeBSD.org>

Complete the (mechanical) move of the TCP reassembly and timewait
functions from their origininal place to their own files.

TCP Reassembly from tcp_input.c -> tcp_reass.c
TCP Timewait from tcp_subr.c -> tcp_timewait.c


# 504abdc6 11-May-2007 Andre Oppermann <andre@FreeBSD.org>

Add the timestamp offset to struct tcptw so we can generate proper
ACKs in TIME_WAIT state that don't get dropped by the PAWS check
on the receiver.


# 1a553740 06-May-2007 Andre Oppermann <andre@FreeBSD.org>

Remove unused requested_s_scale from struct tcpcb.


# 3529149e 06-May-2007 Andre Oppermann <andre@FreeBSD.org>

Use existing TF_SACK_PERMIT flag in struct tcpcb t_flags field instead of
a decdicated sack_enable int for this bool. Change all users accordingly.


# 6087c3c2 04-May-2007 Robert Watson <rwatson@FreeBSD.org>

Add global mutex tcp_debug_mtx, which will protect global TCP debugging
state tcp_debug, tcp_debx. Acquire and drop as required in tcp_trace().

Move to ANSI C function header, correct prototype types so that short TCP
state is no longer promoted to int unnecessarily.

Add comments.

MFC after: 3 weeks


# df47e437 20-Apr-2007 Andre Oppermann <andre@FreeBSD.org>

o Remove unncessary TOF_SIGLEN flag from struct tcpopt
o Correctly set to->to_signature in tcp_dooptions()
o Update comments


# 4d6e7130 20-Apr-2007 Andre Oppermann <andre@FreeBSD.org>

Remove bogus check for accept queue length and associated failure handling
from the incoming SYN handling section of tcp_input().

Enforcement of the accept queue limits is done by sonewconn() after the
3WHS is completed. It is not necessary to have an earlier check before a
connection request enters the SYN cache awaiting the full handshake. It
rather limits the effectiveness of the syncache by preventing legit and
illegit connections from entering it and having them shaken out before we
hit the real limit which may have vanished by then.

Change return value of syncache_add() to void. No status communication
is required.


# b8152ba7 11-Apr-2007 Andre Oppermann <andre@FreeBSD.org>

Change the TCP timer system from using the callout system five times
directly to a merged model where only one callout, the next to fire,
is registered.

Instead of callout_reset(9) and callout_stop(9) the new function
tcp_timer_activate() is used which then internally manages the callout.

The single new callout is a mutex callout on inpcb simplifying the
locking a bit.

tcp_timer() is the called function which handles all race conditions
in one place and then dispatches the individual timer functions.

Reviewed by: rwatson (earlier version)


# e406f5a1 21-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

Remove tcp_minmssoverload DoS detection logic. The problem it tried to
protect us from wasn't really there and it only bloats the code. Should
the problem surface in the future we can simply resurrect it from cvs
history.


# 02a1a643 15-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

Consolidate insertion of TCP options into a segment from within tcp_output()
and syncache_respond() into its own generic function tcp_addoptions().

tcp_addoptions() is alignment agnostic and does optimal packing in all cases.

In struct tcpopt rename to_requested_s_scale to just to_wscale.

Add a comment with quote from RFC1323: "The Window field in a SYN (i.e.,
a <SYN> or <SYN,ACK>) segment itself is never scaled."

Reviewed by: silby, mohans, julian
Sponsored by: TCP/IP Optimization Fundraise 2005


# 7c72af87 26-Feb-2007 Mohan Srinivasan <mohans@FreeBSD.org>

Reap FIN_WAIT_2 connections marked SOCANTRCVMORE faster. This mitigate
potential issues where the peer does not close, potentially leaving
thousands of connections in FIN_WAIT_2. This is controlled by a new sysctl
fast_finwait2_recycle, which is disabled by default.

Reviewed by: gnn, silby.


# 6741ecf5 01-Feb-2007 Andre Oppermann <andre@FreeBSD.org>

Auto sizing TCP socket buffers.

Normally the socket buffers are static (either derived from global
defaults or set with setsockopt) and do not adapt to real network
conditions. Two things happen: a) your socket buffers are too small
and you can't reach the full potential of the network between both
hosts; b) your socket buffers are too big and you waste a lot of
kernel memory for data just sitting around.

With automatic TCP send and receive socket buffers we can start with a
small buffer and quickly grow it in parallel with the TCP congestion
window to match real network conditions.

FreeBSD has a default 32K send socket buffer. This supports a maximal
transfer rate of only slightly more than 2Mbit/s on a 100ms RTT
trans-continental link. Or at 200ms just above 1Mbit/s. With TCP send
buffer auto scaling and the default values below it supports 20Mbit/s
at 100ms and 10Mbit/s at 200ms. That's an improvement of factor 10, or
1000%. For the receive side it looks slightly better with a default of
64K buffer size.

New sysctls are:
net.inet.tcp.sendbuf_auto=1 (enabled)
net.inet.tcp.sendbuf_inc=8192 (8K, step size)
net.inet.tcp.sendbuf_max=262144 (256K, growth limit)
net.inet.tcp.recvbuf_auto=1 (enabled)
net.inet.tcp.recvbuf_inc=16384 (16K, step size)
net.inet.tcp.recvbuf_max=262144 (256K, growth limit)

Tested by: many (on HEAD and RELENG_6)
Approved by: re
MFC after: 1 month


# bf6d304a 13-Sep-2006 Andre Oppermann <andre@FreeBSD.org>

Rewrite of TCP syncookies to remove locking requirements and to enhance
functionality:

- Remove a rwlock aquisition/release per generated syncookie. Locking
is now integrated with the bucket row locking of syncache itself and
syncookies no longer add any additional lock overhead.
- Syncookie secrets are different for and stored per syncache buck row.
Secrets expire after 16 seconds and are reseeded on-demand.
- The computational overhead for syncookie generation and verification
is one MD5 hash computation as before.
- Syncache can be turned off and run with syncookies only by setting the
sysctl net.inet.tcp.syncookies_only=1.

This implementation extends the orginal idea and first implementation
of FreeBSD by using not only the initial sequence number field to store
information but also the timestamp field if present. This way we can
keep track of the entire state we need to know to recreate the session in
its original form. Almost all TCP speakers implement RFC1323 timestamps
these days. For those that do not we still have to live with the known
shortcomings of the ISN only SYN cookies. The use of the timestamp field
causes the timestamps to be randomized if syncookies are enabled.

The idea of SYN cookies is to encode and include all necessary information
about the connection setup state within the SYN-ACK we send back and thus
to get along without keeping any local state until the ACK to the SYN-ACK
arrives (if ever). Everything we need to know should be available from
the information we encoded in the SYN-ACK.

A detailed description of the inner working of the syncookies mechanism
is included in the comments in tcp_syncache.c.

Reviewed by: silby (slightly earlier version)
Sponsored by: TCP/IP Optimization Fundraise 2005


# 751dea29 07-Sep-2006 Ruslan Ermilov <ru@FreeBSD.org>

Back when we had T/TCP support, we used to apply different
timeouts for TCP and T/TCP connections in the TIME_WAIT
state, and we had two separate timed wait queues for them.
Now that is has gone, the timeout is always 2*MSL again,
and there is no reason to keep two queues (the first was
unused anyway!).

Also, reimplement the remaining queue using a TAILQ (it
was technically impossible before, with two queues).


# 233dcce1 06-Sep-2006 Andre Oppermann <andre@FreeBSD.org>

First step of TSO (TCP segmentation offload) support in our network stack.

o add IFCAP_TSO[46] for drivers to announce this capability for IPv4 and IPv6
o add CSUM_TSO flag to mbuf pkthdr csum_flags field
o add tso_segsz field to mbuf pkthdr
o enhance ip_output() packet length check to allow for large TSO packets
o extend tcp_maxmtu[46]() with a flag pointer to pass interface capabilities
o adjust all callers of tcp_maxmtu[46]() accordingly

Discussed on: -current, -net
Sponsored by: TCP/IP Optimization Fundraise 2005


# 2c857a9b 06-Sep-2006 Gleb Smirnoff <glebius@FreeBSD.org>

o Backout rev. 1.125 of in_pcb.c. It appeared to behave extremely
bad under high load. For example with 40k sockets and 25k tcptw
entries, connect() syscall can run for seconds. Debugging showed
that it iterates the cycle millions times and purges thousands of
tcptw entries at a time.
Besides practical unusability this change is architecturally
wrong. First, in_pcblookup_local() is used in connect() and bind()
syscalls. No stale entries purging shouldn't be done here. Second,
it is a layering violation.
o Return back the tcptw purging cycle to tcp_timer_2msl_tw(),
that was removed in rev. 1.78 by rwatson. The commit log of this
revision tells nothing about the reason cycle was removed. Now
we need this cycle, since major cleaner of stale tcptw structures
is removed.
o Disable probably necessary, but now unused
tcp_twrecycleable() function.

Reviewed by: ru


# f72167f4 26-Jun-2006 Andre Oppermann <andre@FreeBSD.org>

Some cleanups and janitorial work to tcp_dooptions():

o redefine the parameter 'is_syn' to 'flags', add TO_SYN flag and adjust its
usage accordingly
o update the comments to the tcp_dooptions() invocation in
tcp_input():after_listen to reflect reality
o move the logic checking the echoed timestamp out of tcp_dooptions() to the
only place that uses it next to the invocation described in the previous
item
o adjust parsing of TCPOPT_SACK_PERMITTED to use the same style as the others
o add comments in to struct tcpopt.to_flags #defines

No functional changes.

Sponsored by: TCP/IP Optimization Fundraise 2005


# 8411d000 17-Jun-2006 Andre Oppermann <andre@FreeBSD.org>

Move all syncache related structures to tcp_syncache.c. They are only used
there.

This unbreaks userland programs that include tcp_var.h.

Discussed with: rwatson


# 93f0d0c5 17-Jun-2006 Andre Oppermann <andre@FreeBSD.org>

Rearrange fields in struct syncache and syncache_head to make them more
cache line friendly.

Sponsored by: TCP/IP Optimization Fundraise 2005


# 351630c4 17-Jun-2006 Andre Oppermann <andre@FreeBSD.org>

Add locking to TCP syncache and drop the global tcpinfo lock as early
as possible for the syncache_add() case. The syncache timer no longer
aquires the tcpinfo lock and timeout/retransmit runs can happen in
parallel with bucket granularity.

On a P4 the additional locks cause a slight degression of 0.7% in tcp
connections per second. When IP and TCP input are deserialized and
can run in parallel this little overhead can be neglected. The syncookie
handling still leaves room for improvement and its random salts may be
moved to the syncache bucket head structures to remove the second lock
operation currently required for it. However this would be a more
involved change from the way syncookies work at the moment.

Reviewed by: rwatson
Tested by: rwatson, ps (earlier version)
Sponsored by: TCP/IP Optimization Fundraise 2005


# 623dce13 01-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():

- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.

- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.

- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.

- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.

- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.

- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.

- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.

- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.

- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.

These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.

MFC after: 3 months


# 464fcfbc 28-Feb-2006 Andre Oppermann <andre@FreeBSD.org>

Rework TCP window scaling (RFC1323) to properly scale the send window
right from the beginning and partly clean up the differences in handling
between SYN_SENT and SYN_RCVD (syncache).

Further changes to this code to come. This is a first incremental step
to a general overhaul and streamlining of the TCP code.

PR: kern/15095
PR: kern/92690 (partly)
Reviewed by: qingli (and tested with ANVL)
Sponsored by: TCP/IP Optimization Fundraise 2005


# eaf80179 16-Feb-2006 Andre Oppermann <andre@FreeBSD.org>

Have TCP Inflight disable itself if the RTT is below a certain
threshold. Inflight doesn't make sense on a LAN as it has
trouble figuring out the maximal bandwidth because of the coarse
tick granularity.

The sysctl net.inet.tcp.inflight.rttthresh specifies the threshold
in milliseconds below which inflight will disengage. It defaults
to 10ms.

Tested by: Joao Barros <joao.barros-at-gmail.com>,
Rich Murphey <rich-at-whiteoaklabs.com>
Sponsored by: TCP/IP Optimization Fundraise 2005


# 5a53ca16 27-Jun-2005 Paul Saab <ps@FreeBSD.org>

- Postpone SACK option processing until after PAWS checks. SACK option
processing is now done in the ACK processing case.
- Merge tcp_sack_option() and tcp_del_sackholes() into a new function
called tcp_sack_doack().
- Test (SEG.ACK < SND.MAX) before processing the ACK.

Submitted by: Noritoshi Demizu
Reveiewed by: Mohan Srinivasan, Raja Mukerji
Approved by: re


# 9d17a7a6 04-Jun-2005 Paul Saab <ps@FreeBSD.org>

Changes to tcp_sack_option() that
- Walks the scoreboard backwards from the tail to reduce the number of
comparisons for each sack option received.
- Introduce functions to add/remove sack scoreboard elements, making
the code more readable.

Submitted by: Noritoshi Demizu
Reviewed by: Raja Mukerji, Mohan Srinivasan


# 808f11b7 25-May-2005 Paul Saab <ps@FreeBSD.org>

This is conform with the terminology in

M.Mathis and J.Mahdavi,
"Forward Acknowledgement: Refining TCP Congestion Control"
SIGCOMM'96, August 1996.

Submitted by: Noritoshi Demizu, Raja Mukerji


# 2cdbfa66 20-May-2005 Paul Saab <ps@FreeBSD.org>

Replace t_force with a t_flag (TF_FORCEDATA).

Submitted by: Raja Mukerji.
Reviewed by: Mohan, Silby, Andre Opperman.


# 0077b016 11-May-2005 Paul Saab <ps@FreeBSD.org>

When looking for the next hole to retransmit from the scoreboard,
or to compute the total retransmitted bytes in this sack recovery
episode, the scoreboard is traversed. While in sack recovery, this
traversal occurs on every call to tcp_output(), every dupack and
every partial ack. The scoreboard could potentially get quite large,
making this traversal expensive.

This change optimizes this by storing hints (for the next hole to
retransmit and the total retransmitted bytes in this sack recovery
episode) reducing the complexity to find these values from O(n) to
constant time.

The debug code that sanity checks the hints against the computed
value will be removed eventually.

Submitted by: Mohan Srinivasan, Noritoshi Demizu, Raja Mukerji.


# a6235da6 21-Apr-2005 Paul Saab <ps@FreeBSD.org>

- Make the sack scoreboard logic use the TAILQ macros. This improves
code readability and facilitates some anticipated optimizations in
tcp_sack_option().
- Remove tcp_print_holes() and TCP_SACK_DEBUG.

Submitted by: Raja Mukerji.
Reviewed by: Mohan Srinivasan, Noritoshi Demizu.


# 1600372b 20-Apr-2005 Andre Oppermann <andre@FreeBSD.org>

Ignore ICMP Source Quench messages for TCP sessions. Source Quench is
ineffective, depreciated and can be abused to degrade the performance
of active TCP sessions if spoofed.

Replace a bogus call to tcp_quench() in tcp_output() with the direct
equivalent tcpcb variable assignment.

Security: draft-gont-tcpm-icmp-attacks-03.txt Section 7.1
MFC after: 3 days


# 25e6f9ed 14-Apr-2005 Paul Saab <ps@FreeBSD.org>

Fix for a TCP SACK bug where more than (win/2) bytes could have been
in flight in SACK recovery.

Found by: Noritoshi Demizu
Submitted by: Mohan Srinivasan <mohans at yahoo-inc dot com>
Noritoshi Demizu <demizu at dd dot ij4u dot or dot jp>
Raja Mukerji <raja at moselle dot com>


# e891d82b 09-Mar-2005 Paul Saab <ps@FreeBSD.org>

Add limits on the number of elements in the sack scoreboard both
per-connection and globally. This eliminates potential DoS attacks
where SACK scoreboard elements tie up too much memory.

Submitted by: Raja Mukerji (raja at moselle dot com).
Reviewed by: Mohan Srinivasan (mohans at yahoo-inc dot com).


# 7643c37c 17-Feb-2005 Paul Saab <ps@FreeBSD.org>

Remove 2 (SACK) fields from the tcpcb. These are only used by a
function that is called from tcp_input(), so they oughta be passed on
the stack instead of stuck in the tcpcb.

Submitted by: Mohan Srinivasan


# 9945c0e2 14-Feb-2005 Maxim Konovalov <maxim@FreeBSD.org>

o Add handling of an IPv4-mapped IPv6 address.
o Use SYSCTL_IN() macro instead of direct call of copyin(9).

Submitted by: ume

o Move sysctl_drop() implementation to sys/netinet/tcp_subr.c where
most of tcp sysctls live.
o There are net.inet[6].tcp[6].getcred sysctls already, no needs in
a separate struct tcp_ident_mapping.

Suggested by: ume


# 212a79b0 06-Feb-2005 Maxim Konovalov <maxim@FreeBSD.org>

o Implement net.inet.tcp.drop sysctl and userland part, tcpdrop(8)
utility:

The tcpdrop command drops the TCP connection specified by the
local address laddr, port lport and the foreign address faddr,
port fport.

Obtained from: OpenBSD
Reviewed by: rwatson (locking), ru (man page), -current
MFC after: 1 month


# c398230b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# db0aae38 22-Dec-2004 Robert Watson <rwatson@FreeBSD.org>

Remove the now unused tcp_canceltimers() function. tcpcb timers are
now stopped as part of tcp_discardcb().

MFC after: 2 weeks


# c94c54e4 02-Nov-2004 Andre Oppermann <andre@FreeBSD.org>

Remove RFC1644 T/TCP support from the TCP side of the network stack.

A complete rationale and discussion is given in this message
and the resulting discussion:

http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706

Note that this commit removes only the functional part of T/TCP
from the tcp_* related functions in the kernel. Other features
introduced with RFC1644 are left intact (socket layer changes,
sendmsg(2) on connection oriented protocols) and are meant to
be reused by a simpler and less intrusive reimplemention of the
previous T/TCP functionality.

Discussed on: -arch


# cd109b0d 22-Oct-2004 Andre Oppermann <andre@FreeBSD.org>

Shave 40 unused bytes from struct tcpcb.


# a55db2b6 05-Oct-2004 Paul Saab <ps@FreeBSD.org>

- Estimate the amount of data in flight in sack recovery and use it
to control the packets injected while in sack recovery (for both
retransmissions and new data).
- Cleanups to the sack codepaths in tcp_output.c and tcp_sack.c.
- Add a new sysctl (net.inet.tcp.sack.initburst) that controls the
number of sack retransmissions done upon initiation of sack recovery.

Submitted by: Mohan Srinivasan <mohans@yahoo-inc.com>


# a4f757cd 16-Aug-2004 Robert Watson <rwatson@FreeBSD.org>

White space cleanup for netinet before branch:

- Trailing tab/space cleanup
- Remove spurious spaces between or before tabs

This change avoids touching files that Andre likely has in his working
set for PFIL hooks changes for IPFW/DUMMYNET.

Approved by: re (scottl)
Submitted by: Xin LI <delphij@frontfree.net>


# 969860f3 17-Jul-2004 David Malone <dwmalone@FreeBSD.org>

The tcp syncache code was leaving the IPv6 flowlabel uninitialised
for the SYN|ACK packet and then letting in6_pcbconnect set the
flowlabel later. Arange for the syncache/syncookie code to set and
recall the flow label so that the flowlabel used for the SYN|ACK
is consistent. This is done by using some of the cookie (when tcp
cookies are enabeled) and by stashing the flowlabel in syncache.

Tested and Discovered by: Orla McGann <orly@cnri.dit.ie>
Approved by: ume, silby
MFC after: 1 month


# 37332f04 24-Jun-2004 Bruce M Simpson <bms@FreeBSD.org>

Whitespace.


# 6d90faf3 23-Jun-2004 Paul Saab <ps@FreeBSD.org>

Add support for TCP Selective Acknowledgements. The work for this
originated on RELENG_4 and was ported to -CURRENT.

The scoreboarding code was obtained from OpenBSD, and many
of the remaining changes were inspired by OpenBSD, but not
taken directly from there.

You can enable/disable sack using net.inet.tcp.do_sack. You can
also limit the number of sack holes that all senders can have in
the scoreboard with net.inet.tcp.sackhole_limit.

Reviewed by: gnn
Obtained from: Yahoo! (Mohan Srinivasan, Jayanth Vijayaraghavan)


# 80dd2a81 25-Apr-2004 Mike Silbersack <silby@FreeBSD.org>

Tighten up reset handling in order to make reset attacks as difficult as
possible while maintaining compatibility with the widest range of TCP stacks.

The algorithm is as follows:

---
For connections in the ESTABLISHED state, only resets with
sequence numbers exactly matching last_ack_sent will cause a reset,
all other segments will be silently dropped.

For connections in all other states, a reset anywhere in the window
will cause the connection to be reset. All other segments will be
silently dropped.
---

The necessity of accepting all in-window resets was discovered
by jayanth and jlemon, both of whom have seen TCP stacks that
will respond to FIN-ACK packets with resets not meeting the
strict last_ack_sent check.

Idea by: Darren Reed
Reviewed by: truckman, jlemon, others(?)


# de9f59f8 20-Apr-2004 Bruce M Simpson <bms@FreeBSD.org>

Fix a typo in a comment.


# c1537ef0 20-Apr-2004 Mike Silbersack <silby@FreeBSD.org>

Enhance our RFC1948 implementation to perform better in some pathlogical
TIME_WAIT recycling cases I was able to generate with http testing tools.

In short, as the old algorithm relied on ticks to create the time offset
component of an ISN, two connections with the exact same host, port pair
that were generated between timer ticks would have the exact same sequence
number. As a result, the second connection would fail to pass the TIME_WAIT
check on the server side, and the SYN would never be acknowledged.

I've "fixed" this by adding random positive increments to the time component
between clock ticks so that ISNs will *always* be increasing, no matter how
quickly the port is recycled.

Except in such contrived benchmarking situations, this problem should never
come up in normal usage... until networks get faster.

No MFC planned, 4.x is missing other optimizations that are needed to even
create the situation in which such quick port recycling will occur.


# f36cfd49 07-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson


# a7b6a14a 28-Feb-2004 Robert Watson <rwatson@FreeBSD.org>

Remove now unneeded arguments to tcp_twrespond() -- so and msrc. These
were needed by the MAC Framework until inpcbs gained labels.

Submitted by: sam


# 0613995b 25-Feb-2004 Bruce Evans <bde@FreeBSD.org>

Fixed namespace pollution in rev.1.74. Implementation of the syncache
increased <netinet/tcp_var>'s already large set of prerequisites, and
this was handled badly. Just don't declare the complete syncache struct
unless <netinet/pcb.h> is included before <netinet/tcp_var.h>.

Approved by: jlemon (years ago, for a more invasive fix)


# a545b1dc 25-Feb-2004 Bruce Evans <bde@FreeBSD.org>

Don't use the negatively-opaque type uma_zone_t or be chummy with
<vm/uma.h>'s idempotency indentifier or its misspelling.


# 12e2e970 24-Feb-2004 Andre Oppermann <andre@FreeBSD.org>

Convert the tcp segment reassembly queue to UMA and limit the maximum
amount of segments it will hold.

The following tuneables and sysctls control the behaviour of the tcp
segment reassembly queue:

net.inet.tcp.reass.maxsegments (loader tuneable)
specifies the maximum number of segments all tcp reassemly queues can
hold (defaults to 1/16 of nmbclusters).

net.inet.tcp.reass.maxqlen
specifies the maximum number of segments any individual tcp session queue
can hold (defaults to 48).

net.inet.tcp.reass.cursegments (readonly)
counts the number of segments currently in all reassembly queues.

net.inet.tcp.reass.overflows (readonly)
counts how often either the global or local queue limit has been reached.

Tested by: bms, silby
Reviewed by: bms, silby


# 265ed012 13-Feb-2004 Bruce M Simpson <bms@FreeBSD.org>

Brucification.

Submitted by: bde


# b30190b5 12-Feb-2004 Bruce M Simpson <bms@FreeBSD.org>

Update the prototype for tcpsignature_apply() to reflect the spelling of
the types used by m_apply()'s callback function, f, as documented in mbuf(9).

Noticed by: njl


# 1cfd4b53 10-Feb-2004 Bruce M Simpson <bms@FreeBSD.org>

Initial import of RFC 2385 (TCP-MD5) digest support.

This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.

For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.

Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.

There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.

Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.

This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.

Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.

Sponsored by: sentex.net


# 53369ac9 08-Jan-2004 Andre Oppermann <andre@FreeBSD.org>

Limiters and sanity checks for TCP MSS (maximum segement size)
resource exhaustion attacks.

For network link optimization TCP can adjust its MSS and thus
packet size according to the observed path MTU. This is done
dynamically based on feedback from the remote host and network
components along the packet path. This information can be
abused to pretend an extremely low path MTU.

The resource exhaustion works in two ways:

o during tcp connection setup the advertized local MSS is
exchanged between the endpoints. The remote endpoint can
set this arbitrarily low (except for a minimum MTU of 64
octets enforced in the BSD code). When the local host is
sending data it is forced to send many small IP packets
instead of a large one.

For example instead of the normal TCP payload size of 1448
it forces TCP payload size of 12 (MTU 64) and thus we have
a 120 times increase in workload and packets. On fast links
this quickly saturates the local CPU and may also hit pps
processing limites of network components along the path.

This type of attack is particularly effective for servers
where the attacker can download large files (WWW and FTP).

We mitigate it by enforcing a minimum MTU settable by sysctl
net.inet.tcp.minmss defaulting to 256 octets.

o the local host is reveiving data on a TCP connection from
the remote host. The local host has no control over the
packet size the remote host is sending. The remote host
may chose to do what is described in the first attack and
send the data in packets with an TCP payload of at least
one byte. For each packet the tcp_input() function will
be entered, the packet is processed and a sowakeup() is
signalled to the connected process.

For example an attack with 2 Mbit/s gives 4716 packets per
second and the same amount of sowakeup()s to the process
(and context switches).

This type of attack is particularly effective for servers
where the attacker can upload large amounts of data.
Normally this is the case with WWW server where large POSTs
can be made.

We mitigate this by calculating the average MSS payload per
second. If it goes below 'net.inet.tcp.minmss' and the pps
rate is above 'net.inet.tcp.minmssoverload' defaulting to
1000 this particular TCP connection is resetted and dropped.

MITRE CVE: CAN-2004-0002
Reviewed by: sam (mentor)
MFC after: 1 day


# 97d8d152 20-Nov-2003 Andre Oppermann <andre@FreeBSD.org>

Introduce tcp_hostcache and remove the tcp specific metrics from
the routing table. Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.

It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination. Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.

tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.

It removes significant locking requirements from the tcp stack with
regard to the routing table.

Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)


# 4bd4fa3f 02-Nov-2003 Mike Silbersack <silby@FreeBSD.org>

Add an additional check to the tcp_twrecycleable function; I had
previously only considered the send sequence space. Unfortunately,
some OSes (windows) still use a random positive increments scheme for
their syn-ack ISNs, so I must consider receive sequence space as well.

The value of 250000 bytes / second for Microsoft's ISN rate of increase
was determined by testing with an XP machine.


# 96af9ea5 01-Nov-2003 Mike Silbersack <silby@FreeBSD.org>

- Add a new function tcp_twrecycleable, which tells us if the ISN which
we will generate for a given ip/port tuple has advanced far enough
for the time_wait socket in question to be safely recycled.

- Have in_pcblookup_local use tcp_twrecycleable to determine if
time_Wait sockets which are hogging local ports can be safely
freed.

This change preserves proper TIME_WAIT behavior under normal
circumstances while allowing for safe and fast recycling whenever
ephemeral port space is scarce.


# 9d11646d 15-Jul-2003 Jeffrey Hsu <hsu@FreeBSD.org>

Unify the "send high" and "recover" variables as specified in the
lastest rev of the spec. Use an explicit flag for Fast Recovery. [1]

Fix bug with exiting Fast Recovery on a retransmit timeout
diagnosed by Lu Guohan. [2]

Reviewed by: Thomas Henderson <thomas.r.henderson@boeing.com>
Reported and tested by: Lu Guohan <lguohan00@mails.tsinghua.edu.cn> [2]
Approved by: Thomas Henderson <thomas.r.henderson@boeing.com>,
Sally Floyd <floyd@acm.org> [1]


# 430c6354 06-May-2003 Robert Watson <rwatson@FreeBSD.org>

Correct a bug introduced with reduced TCP state handling; make
sure that the MAC label on TCP responses during TIMEWAIT is
properly set from either the socket (if available), or the mbuf
that it's responding to.

Unfortunately, this is made somewhat difficult by the TCP code,
as tcp_twstart() calls tcp_twrespond() after discarding the socket
but without a reference to the mbuf that causes the "response".
Passing both the socket and the mbuf works arounds this--eventually
it might be good to make sure the mbuf always gets passed in in
"response" scenarios but working through this provided to
complicate things too much.

Approved by: re (scottl)
Reviewed by: hsu
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 48d2549c 01-Apr-2003 Jeffrey Hsu <hsu@FreeBSD.org>

Observe conservation of packets when entering Fast Recovery while
doing Limited Transmit. Only artificially inflate the congestion
window by 1 segment instead of the usual 3 to take into account
the 2 already sent by Limited Transmit.

Approved in principle by: Mark Allman <mallman@grc.nasa.gov>,
Hari Balakrishnan <hari@nms.lcs.mit.edu>, Sally Floyd <floyd@icir.org>


# 607b0b0c 08-Mar-2003 Jonathan Lemon <jlemon@FreeBSD.org>

Remove a panic(); if the zone allocator can't provide more timewait
structures, reuse the oldest one. Also move the expiry timer from
a per-structure callout to the tcp slow timer.

Sponsored by: DARPA, NAI Labs


# 340c35de 19-Feb-2003 Jonathan Lemon <jlemon@FreeBSD.org>

Add a TCP TIMEWAIT state which uses less space than a fullblown TCP
control block. Allow the socket and tcpcb structures to be freed
earlier than inpcb. Update code to understand an inp w/o a socket.

Reviewed by: hsu, silby, jayanth
Sponsored by: DARPA, NAI Labs


# 79909384 19-Feb-2003 Jonathan Lemon <jlemon@FreeBSD.org>

Convert tcp_fillheaders(tp, ...) -> tcpip_fillheaders(inp, ...) so the
routine does not require a tcpcb to operate. Since we no longer keep
template mbufs around, move pseudo checksum out of this routine, and
merge it with the length update.

Sponsored by: DARPA, NAI Labs


# cb942153 13-Jan-2003 Jeffrey Hsu <hsu@FreeBSD.org>

Fix NewReno.

Reviewed by: Tom Henderson <thomas.r.henderson@boeing.com>


# 1fcc99b5 17-Aug-2002 Matthew Dillon <dillon@FreeBSD.org>

Implement TCP bandwidth delay product window limiting, similar to (but
not meant to duplicate) TCP/Vegas. Add four sysctls and default the
implementation to 'off'.

net.inet.tcp.inflight_enable enable algorithm (defaults to 0=off)
net.inet.tcp.inflight_debug debugging (defaults to 1=on)
net.inet.tcp.inflight_min minimum window limit
net.inet.tcp.inflight_max maximum window limit

MFC after: 1 week


# d65bf08a 19-Jul-2002 Matthew Dillon <dillon@FreeBSD.org>

Add the tcps_sndrexmitbad statistic, keep track of late acks that caused
unnecessary retransmissions.


# 3ce144ea 14-Jun-2002 Jeffrey Hsu <hsu@FreeBSD.org>

Notify functions can destroy the pcb, so they have to return an
indication of whether this happenned so the calling function
knows whether or not to unlock the pcb.

Submitted by: Jennifer Yang (yangjihui@yahoo.com)
Bug reported by: Sid Carter (sidcarter@symonds.net)


# eb5afeba 13-Jun-2002 Mike Silbersack <silby@FreeBSD.org>

Re-commit w/fix:

Ensure that the syn cache's syn-ack packets contain the same
ip_tos, ip_ttl, and DF bits as all other tcp packets.

PR: 39141
MFC after: 2 weeks

This time, make sure that ipv4 specific code (aka all of the above)
is only run in the ipv4 case.


# 70d2b170 13-Jun-2002 Mike Silbersack <silby@FreeBSD.org>

Back out ip_tos/ip_ttl/DF "fix", it just panic'd my box. :)

Pointy-hat to: silby


# 21c3b2fc 13-Jun-2002 Mike Silbersack <silby@FreeBSD.org>

Ensure that the syn cache's syn-ack packets contain the same
ip_tos, ip_ttl, and DF bits as all other tcp packets.

PR: 39141
MFC after: 2 weeks


# f76fcf6d 10-Jun-2002 Jeffrey Hsu <hsu@FreeBSD.org>

Lock up inpcb.

Submitted by: Jennifer Yang <yangjihui@yahoo.com>


# 4d77a549 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# 262c1c1a 02-Dec-2001 Matthew Dillon <dillon@FreeBSD.org>

Fix a bug with transmitter restart after receiving a 0 window. The
receiver was not sending an immediate ack with delayed acks turned on
when the input buffer is drained, preventing the transmitter from
restarting immediately.

Propogate the TCP_NODELAY option to accept()ed sockets. (Helps tbench and
is a good idea anyway).

Some cleanup. Identify additonal issues in comments.

MFC after: 1 day


# be2ac88c 21-Nov-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Introduce a syncache, which enables FreeBSD to withstand a SYN flood
DoS in an improved fashion over the existing code.

Reviewed by: silby (in a previous iteration)
Sponsored by: DARPA, NAI Labs


# c24d5dae 05-Oct-2001 Jayanth Vijayaraghavan <jayanth@FreeBSD.org>

Add a flag TF_LASTIDLE, that forces a previously idle connection
to send all its data, especially when the data is less than one MSS.
This fixes an issue where the stack was delaying the sending
of data, eventhough there was enough window to send all the data and
the sending of data was emptying the socket buffer.

Problem found by Yoshihiro Tsuchiya (tsuchiya@flab.fujitsu.co.jp)

Submitted by: Jayanth Vijayaraghavan


# f0ffb944 03-Sep-2001 Julian Elischer <julian@FreeBSD.org>

Patches from Keiichi SHIMA <keiichi@iij.ad.jp>
to make ip use the standard protosw structure again.

Obtained from: Well, KAME I guess.


# b0e3ad75 21-Aug-2001 Mike Silbersack <silby@FreeBSD.org>

Much delayed but now present: RFC 1948 style sequence numbers

In order to ensure security and functionality, RFC 1948 style
initial sequence number generation has been implemented. Barring
any major crypographic breakthroughs, this algorithm should be
unbreakable. In addition, the problems with TIME_WAIT recycling
which affect our currently used algorithm are not present.

Reviewed by: jesper


# 2d610a50 07-Jul-2001 Mike Silbersack <silby@FreeBSD.org>

Temporary feature: Runtime tuneable tcp initial sequence number
generation scheme. Users may now select between the currently used
OpenBSD algorithm and the older random positive increment method.

While the OpenBSD algorithm is more secure, it also breaks TIME_WAIT
handling; this is causing trouble for an increasing number of folks.

To switch between generation schemes, one sets the sysctl
net.inet.tcp.tcp_seq_genscheme. 0 = random positive increments,
1 = the OpenBSD algorithm. 1 is still the default.

Once a secure _and_ compatible algorithm is implemented, this sysctl
will be removed.

Reviewed by: jlemon
Tested by: numerous subscribers of -net


# 08517d53 22-Jun-2001 Mike Silbersack <silby@FreeBSD.org>

Eliminate the allocation of a tcp template structure for each
connection. The information contained in a tcptemp can be
reconstructed from a tcpcb when needed.

Previously, tcp templates required the allocation of one
mbuf per connection. On large systems, this change should
free up a large number of mbufs.

Reviewed by: bmilekic, jlemon, ru
MFC after: 2 weeks


# f0a04f3f 17-Apr-2001 Kris Kennaway <kris@FreeBSD.org>

Randomize the TCP initial sequence numbers more thoroughly.

Obtained from: OpenBSD
Reviewed by: jesper, peter, -developers


# c693a045 26-Feb-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Remove in_pcbnotify and use in_pcblookup_hash to find the cb directly.

For TCP, verify that the sequence number in the ICMP packet falls within
the tcp receive window before performing any actions indicated by the
icmp packet.

Clean up some layering violations (access to tcp internals from in_pcb)


# 694a9ff9 25-Feb-2001 Jesper Skriver <jesper@FreeBSD.org>

Remove tcp_drop_all_states, which is unneeded after jlemon removed it
from tcp_subr.c in rev 1.92


# 90fcbbd6 18-Feb-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unneeded loop increment in src/sys/netinet/in_pcb.c:in_pcbnotify

Add new PRC_UNREACH_ADMIN_PROHIB in sys/sys/protosw.h

Remove condition on TCP in src/sys/netinet/ip_icmp.c:icmp_input

In src/sys/netinet/ip_icmp.c:icmp_input set code = PRC_UNREACH_ADMIN_PROHIB
or PRC_UNREACH_HOST for all unreachables except ICMP_UNREACH_NEEDFRAG

Rename sysctl icmp_admin_prohib_like_rst to icmp_unreach_like_rst
to reflect the fact that we also react on ICMP unreachables that
are not administrative prohibited. Also update the comments to
reflect this.

In sys/netinet/tcp_subr.c:tcp_ctlinput add code to treat
PRC_UNREACH_ADMIN_PROHIB and PRC_UNREACH_HOST different.

PR: 23986
Submitted by: Jesper Skriver <jesper@skriver.dk>


# 442fad67 24-Dec-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Update the "icmp_admin_prohib_like_rst" code to check the tcp-window and
to be configurable with respect to acting only in SYN or in all TCP states.

PR: 23665
Submitted by: Jesper Skriver <jesper@skriver.dk>


# b11d7a4a 16-Dec-2000 Poul-Henning Kamp <phk@FreeBSD.org>

We currently does not react to ICMP administratively prohibited
messages send by routers when they deny our traffic, this causes
a timeout when trying to connect to TCP ports/services on a remote
host, which is blocked by routers or firewalls.

rfc1122 (Requirements for Internet Hosts) section 3.2.2.1 actually
requi re that we treat such a message for a TCP session, that we
treat it like if we had recieved a RST.

quote begin.

A Destination Unreachable message that is received MUST be
reported to the transport layer. The transport layer SHOULD
use the information appropriately; for example, see Sections
4.1.3.3, 4.2.3.9, and 4.2.4 below. A transport protocol
that has its own mechanism for notifying the sender that a
port is unreachable (e.g., TCP, which sends RST segments)
MUST nevertheless accept an ICMP Port Unreachable for the
same purpose.

quote end.

I've written a small extension that implement this, it also create
a sysctl "net.inet.tcp.icmp_admin_prohib_like_rst" to control if
this new behaviour is activated.

When it's activated (set to 1) we'll treat a ICMP administratively
prohibited message (icmp type 3 code 9, 10 and 13) for a TCP
sessions, as if we recived a TCP RST, but only if the TCP session
is in SYN_SENT state.

The reason for only reacting when in SYN_SENT state, is that this
will solve the problem, and at the same time minimize the risk of
this being abused.

I suggest that we enable this new behaviour by default, but it
would be a change of current behaviour, so if people prefer to
leave it disabled by default, at least for now, this would be ok
for me, the attached diff actually have the sysctl set to 0 by
default.

PR: 23086
Submitted by: Jesper Skriver <jesper@skriver.dk>


# e7f32693 21-Jul-2000 Jayanth Vijayaraghavan <jayanth@FreeBSD.org>

When a connection is being dropped due to a listen queue overflow,
delete the cloned route that is associated with the connection.
This does not exhaust the routing table memory when the system
is under a SYN flood attack. The route entry is not deleted if there
is any prior information cached in it.

Reviewed by: Peter Wemm,asmodai


# 571214d4 18-Jul-2000 Sheldon Hearn <sheldonh@FreeBSD.org>

Fix a comment which was broken in rev 1.36.

PR: 19947
Submitted by: Tetsuya Isaki <isaki@net.ipc.hiroshima-u.ac.jp>


# e3975643 25-May-2000 Jake Burkholder <jake@FreeBSD.org>

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 740a1973 23-May-2000 Jake Burkholder <jake@FreeBSD.org>

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# 46f58482 05-May-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Implement TCP NewReno, as documented in RFC 2582. This allows
better recovery for multiple packet losses in a single window.
The algorithm can be toggled via the sysctl net.inet.tcp.newreno,
which defaults to "on".

Submitted by: Jayanth Vijayaraghavan <jayanth@yahoo-inc.com>


# fb59c426 09-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

tcp updates to support IPv6.
also a small patch to sys/nfs/nfs_socket.c, as max_hdr size change.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 664a31e4 28-Dec-1999 Peter Wemm <peter@FreeBSD.org>

Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL"
is an application space macro and the applications are supposed to be free
to use it as they please (but cannot). This is consistant with the other
BSD's who made this change quite some time ago. More commits to come.


# 6a800098 22-Dec-1999 Yoshinobu Inoue <shin@FreeBSD.org>

IPSEC support in the kernel.
pr_input() routines prototype is also changed to support IPSEC and IPV6
chained protocol headers.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 76429de4 05-Nov-1999 Yoshinobu Inoue <shin@FreeBSD.org>

KAME related header files additions and merges.
(only those which don't affect c source files so much)

Reviewed by: cvs-committers
Obtained from: KAME project


# 9b8b58e0 30-Aug-1999 Jonathan Lemon <jlemon@FreeBSD.org>

Restructure TCP timeout handling:

- eliminate the fast/slow timeout lists for TCP and instead use a
callout entry for each timer.
- increase the TCP timer granularity to HZ
- implement "bad retransmit" recovery, as presented in
"On Estimating End-to-End Network Path Properties", by Allman and Paxson.

Submitted by: jlemon, wollmann


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# ce02431f 16-Feb-1999 Doug Rabson <dfr@FreeBSD.org>

* Change sysctl from using linker_set to construct its tree using SLISTs.
This makes it possible to change the sysctl tree at runtime.

* Change KLD to find and register any sysctl nodes contained in the loaded
file and to unregister them when the file is unloaded.

Reviewed by: Archie Cobbs <archie@whistle.com>,
Peter Wemm <peter@netplex.com.au> (well they looked at it anyway)


# b0acefa8 20-Jan-1999 Bill Fenner <fenner@FreeBSD.org>

Add a flag, passed to pru_send routines, PRUS_MORETOCOME. This
flag means that there is more data to be put into the socket buffer.
Use it in TCP to reduce the interaction between mbuf sizes and the
Nagle algorithm.

Based on: "Justin C. Walker" <justin@apple.com>'s description of Apple's
fix for this problem.


# 6effc713 24-Aug-1998 Doug Rabson <dfr@FreeBSD.org>

Re-implement tcp and ip fragment reassembly to not store pointers in the
ip header which can't work on alpha since pointers are too big.

Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>


# cfe8b629 22-Aug-1998 Garrett Wollman <wollman@FreeBSD.org>

Yow! Completely change the way socket options are handled, eliminating
another specialized mbuf type in the process. Also clean up some
of the cruft surrounding IPFW, multicast routing, RSVP, and other
ill-explored corners.


# 07a4df4f 13-Jul-1998 Bruce Evans <bde@FreeBSD.org>

Declare tcp_seq and tcp_cc as fixed-size types. Half fixed type
mismatches exposed by this (the prototype for tcp_respond() didn't
match the function definition lexically, and still depends on a
gcc feature to match if ints have more than 32 bits).


# a910fdcb 27-Jun-1998 John Hay <jhay@FreeBSD.org>

Only make struct xtcpcb visable if _NETINET_IN_PCB_H_ and _SYS_SOCKETVAR_H_
are defined.
Reviewed by: bde


# 98271db4 15-May-1998 Garrett Wollman <wollman@FreeBSD.org>

Convert socket structures to be type-stable and add a version number.

Define a parameter which indicates the maximum number of sockets in a
system, and use this to size the zone allocators used for sockets and
for certain PCBs.

Convert PF_LOCAL PCB structures to be type-stable and add a version number.

Define an external format for infomation about socket structures and use
it in several places.

Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through
sysctl(3) without blocking network interrupts for an unreasonable
length of time. This probably still has some bugs and/or race
conditions, but it seems to work well enough on my machines.

It is now possible for `netstat' to get almost all of its information
via the sysctl(3) interface rather than reading kmem (changes to follow).


# 552b7df4 24-Apr-1998 David Greenman <dg@FreeBSD.org>

Ensure that TCP_REXMTVAL doesn't return a value less than t_rttmin. This
is believed to have been broken with the Brakmo/Peterson srtt
calculation changes. The result of this bug is that TCP connections
could time out extremely quickly (in 12 seconds).
Also backed out jdp's partial fix for this problem in rev 1.17 of
tcp_timer.c as it is obsoleted by this commit.
Bug was pointed out by Kevin Lehey <kml@roller.nas.nasa.gov>.

PR: 6068


# 8e5db87c 06-Apr-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Remove the last traces of TUBA.

Inspired by: PR kern/3317


# f498eeee 25-Feb-1998 David Greenman <dg@FreeBSD.org>

Changes to support the addition of a new sysctl variable:
net.inet.tcp.delack_enabled
Which defaults to 1 and can be set to 0 to disable TCP delayed-ack
processing (i.e. all acks are immediate).


# c3229e05 27-Jan-1998 David Greenman <dg@FreeBSD.org>

Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.

Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.

These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.

Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.

WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).


# a29f300e 27-Apr-1997 Garrett Wollman <wollman@FreeBSD.org>

The long-awaited mega-massive-network-code- cleanup. Part I.

This commit includes the following changes:
1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility
glue for them is deleted, and the kernel will panic on boot if any are compiled
in.

2) Certain protocol entry points are modified to take a process structure,
so they they can easily tell whether or not it is possible to sleep, and
also to access credentials.

3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt()
call. Protocols should use the process pointer they are now passed.

4) The PF_LOCAL and PF_ROUTE families have been updated to use the new
style, as has the `raw' skeleton family.

5) PF_LOCAL sockets now obey the process's umask when creating a socket
in the filesystem.

As a result, LINT is now broken. I'm hoping that some enterprising hacker
with a bit more time will either make the broken bits work (should be
easy for netipx) or dike them out.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 561c2ad3 13-Sep-1996 Paul Traina <pst@FreeBSD.org>

Move TCPCTL_KEEPINIT to end of MIB list (sigh)


# 7b40aa32 13-Sep-1996 Paul Traina <pst@FreeBSD.org>

Make the misnamed tcp initial keepalive timer value (which is really the
time, in seconds, that state for non-established TCP sessions stays about)
a sysctl modifyable variable.

[part 1 of two commits, I just realized I can't play with the indices as
I was typing this commit message.]


# 2c37256e 11-Jul-1996 Garrett Wollman <wollman@FreeBSD.org>

Modify the kernel to use the new pr_usrreqs interface rather than the old
pr_usrreq mechanism which was poorly designed and error-prone. This
commit renames pr_usrreq to pr_ousrreq so that old code which depended on it
would break in an obvious manner. This commit also implements the new
interface for TCP, although the old function is left as an example
(#ifdef'ed out). This commit ALSO fixes a longstanding bug in the
TCP timer processing (introduced by davidg on 1995/04/12) which caused
timer processing on a TCB to always stop after a single timer had
expired (because it misinterpreted the return value from tcp_usrreq()
to indicate that the TCB had been deleted). Finally, some code
related to polling has been deleted from if.c because it is not
relevant t -current and doesn't look at all like my current code.


# 6da5712b 05-Jun-1996 Garrett Wollman <wollman@FreeBSD.org>

Correct formula for TCP RTO calculation. Also try to do a better job in
filling in a new PCB's rttvar (but this is not the last word on the subject).
And get rid of `#ifdef RTV_RTT', it's been true for four years now...


# a2352fc1 26-Apr-1996 Garrett Wollman <wollman@FreeBSD.org>

Delete #ifdef notdef blocks containing old method of srtt calculation.

Requested by: davidg


# 233e8c18 22-Mar-1996 Garrett Wollman <wollman@FreeBSD.org>

A number of performance-reducing flaws fixed based on comments
from Larry Peterson &co. at Arizona:

- Header prediction for ACKs did not exclude Fast Retransmit/Recovery.
- srtt calculation tended to get ``stuck'' and could never decrease
when below 8. It still can't, but the scaling factors are adjusted
so that this artifact does not cause as bad an effect on the RTO
value as it used to.

The paper also points out the incr/8 error that has been long since fixed,
and the problems with ACKing frequency resulting from the use of options
which I suspect to be fixed already as well (as part of the T/TCP work).

Obtained from: Brakmo & Peterson, ``Performance Problems in BSD4.4 TCP''


# 3420f4ab 27-Feb-1996 Bruce Evans <bde@FreeBSD.org>

Spell tcp_listendrop consistently so that tcp_input.c and netstat compile.


# 1347f5b8 26-Feb-1996 Guido van Rooij <guido@FreeBSD.org>

Add a counter for the number of times the listen queue was overflowed to
the tcpstat structure. (netstat -s)
Reviewed by: wollman
Obtained from: Steves, TCP/IP Ill. vol.3, page 189


# 6c5e9bbd 30-Jan-1996 Mike Pritchard <mpp@FreeBSD.org>

Fix a bunch of spelling errors in the comment fields of
a bunch of system include files.


# 34da5848 19-Jan-1996 Peter Wemm <peter@FreeBSD.org>

remove tcp_lastport - it has not been used for quite a while (at least
since the hashed pcb's I think).


# 858c045f 28-Dec-1995 David Greenman <dg@FreeBSD.org>

Remove some bogus externs.


# b62d102c 15-Dec-1995 Bruce Evans <bde@FreeBSD.org>

Uniformized pr_ctlinput protosw functions. The third arg is now `void
*' instead of caddr_t and it isn't optional (it never was). Most of the
netipx (and netns) pr_ctlinput functions abuse the second arg instead of
using the third arg but fixing this is beyond the scope of this round
of changes.


# b7a44e34 05-Dec-1995 Garrett Wollman <wollman@FreeBSD.org>

Path MTU Discovery is now standard.


# 0312fbe9 14-Nov-1995 Poul-Henning Kamp <phk@FreeBSD.org>

New style sysctl & staticize alot of stuff.


# a45d2726 03-Nov-1995 Andras Olah <olah@FreeBSD.org>

Fix a logical error in T/TCP: when we actively open a connection, we
have to decide whether to send a CC or CCnew option in our SYN segment
depending on the contents of our TAO cache. This decision has to be
made once when the connection starts. The earlier code delayed this
decision until the segment was assembled in tcp_output() and
retransmitted SYN segments could have different CC options.

Reviewed by: Richard Stevens, davidg, wollman


# 3d1f141b 16-Oct-1995 Garrett Wollman <wollman@FreeBSD.org>

The ability to administratively change the MTU of an interface presents
a few new wrinkles for MTU discovery which tcp_output() had better
be prepared to handle. ip_output() is also modified to do something
helpful in this case, since it has already calculated the information
we need.


# 3abc79d2 12-Oct-1995 Garrett Wollman <wollman@FreeBSD.org>

The additional checks involving sequence numbers in MTU discovery resends
turned out not to be necessary; simply watching for MTU decreases (which
we already did) automagically eliminates all the cases we were trying to
protect against.


# 143d7a54 10-Oct-1995 Garrett Wollman <wollman@FreeBSD.org>

More MTU discovery: avoid over-retransmission if route changes in the
middle of a fully-open window. Also, keep track of how many retransmits
we do as a result of MTU discovery. This may actually do more work than
necessary, but it's an unusual condition...

Suggested by: Janey Hoe <janey@lcs.mit.edu>


# 6bb9a8e7 04-Oct-1995 Garrett Wollman <wollman@FreeBSD.org>

Make a whole bunch of PCB variables ints rather than shorts. There appear
to be no ill effects, and so far as Iknow none of the variables in
question depend on 16-bit wraparound behavior. (The sizes are in
many cases relics from when a PCB had to fit inside a 128-byte mbuf. PCBs
are no longer stored in that way, and the old structure would not have
fit, either.)


# 4dc45a5f 22-Sep-1995 Peter Wemm <peter@FreeBSD.org>

Remove duplicate definition for tcps_persistdrop, as added by davidg some
time ago. I left in Garrett's one, because his was in the 4.4-Lite-2
location, making any diffs just that little bit smaller.

I presume this choice means that netstat needs to be recompiled before
"netstat -s" will give a meaningful answer on tcp stats.


# 6c52bc46 21-Sep-1995 Garrett Wollman <wollman@FreeBSD.org>

Merge with 4.4-Lite-2. This just adds a couple of tcpstat entries which
we don't currently set, but might in the future.


# efe4b0eb 21-Sep-1995 Garrett Wollman <wollman@FreeBSD.org>

Second try: get 4.4-Lite-2 into the source tree. The conflicts don't
matter because none of our working source files are on the CSRG branch
any more.

Obtained from: 4.4BSD-Lite-2


# cc0964fb 29-Jul-1995 David Greenman <dg@FreeBSD.org>

Add connection drop capability for persist timeouts.

Reviewed by: Andras Olah
Obtained from: 4.4BSD-lite2 via W. Richard Stevens


# dd224982 10-Jul-1995 Garrett Wollman <wollman@FreeBSD.org>

tcp_input.c - keep track of how many times a route contained a cached rtt
or ssthresh that we were able to use

tcp_var.h - declare tcpstat entries for above; declare tcp_{send,recv}space

in_rmx.c - fill in the MTU and pipe sizes with the defaults TCP would have
used anyway in the absence of values here


# fc978271 29-Jun-1995 Garrett Wollman <wollman@FreeBSD.org>

Keep track of the number of samples through the srtt filter so that we
know better when to cache values in the route, rather than relying on a
heuristic involving sequence numbers that broke when tcp_sendspace
was increased to 16k.


# 91677201 19-Jun-1995 Garrett Wollman <wollman@FreeBSD.org>

Now that we've gone to all sorts of effort to allow TCP to cache some of
its connection parameters, we want to keep statistics on how often this
actually happens to see whether there is any work that needs to be done in
TCP itself.

Suggested by: John Wroclawski <jtw@lcs.mit.edu>


# 15bd2b43 08-Apr-1995 David Greenman <dg@FreeBSD.org>

Implemented PCB hashing. Includes new functions in_pcbinshash, in_pcbrehash,
and in_pcblookuphash.


# b5e8ce9f 16-Mar-1995 Bruce Evans <bde@FreeBSD.org>

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 41f82abe 15-Feb-1995 Garrett Wollman <wollman@FreeBSD.org>

Transaction TCP support now standard. Hack away!


# f2ea20e6 15-Feb-1995 Garrett Wollman <wollman@FreeBSD.org>

Add lots of useful MIB variables and a few not-so-useful ones for
completeness.


# 2f96f1f4 13-Feb-1995 Garrett Wollman <wollman@FreeBSD.org>

Get rid of some unneeded #ifdef TTCP lines. Also, get rid of some
bogus commons declared in header files.


# a0292f23 09-Feb-1995 Garrett Wollman <wollman@FreeBSD.org>

Merge Transaction TCP, courtesy of Andras Olah <olah@cs.utwente.nl> and
Bob Braden <braden@isi.edu>.

NB: This has not had David's TCP ACK hack re-integrated. It is not clear
what the correct solution to this problem is, if any. If a better solution
doesn't pop up in response to this message, I'll put David's code back in
(or he's welcome to do so himself).


# 512ff5ea 08-Feb-1995 David Greenman <dg@FreeBSD.org>

Fix/#ifdef prototype for tcp_mss...apparantly overlooked by Garrett.


# eb6ad696 08-Feb-1995 Garrett Wollman <wollman@FreeBSD.org>

Merge in T/TCP TCP header file changes.


# 707f139e 20-Aug-1994 Paul Richards <paul@FreeBSD.org>

Made idempotent.

Submitted by: Paul


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources