History log of /freebsd-current/sys/netinet/tcp_subr.c
Revision Date Author Comments
# ea916b64 18-May-2024 Randall Stewart <rrs@FreeBSD.org>

Remove TCP_SAD optional code now that the sack filter performs this function.

With the commit of D44903 we no longer need the SAD option. Instead all stacks that
use the sack filter inherit its protection against sack-attack.

Reviewed by: tuexen@
Differential Revision:https://reviews.freebsd.org/D45216


# 59884aea 04-May-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: clean up macro useage in tcp_fixed_maxseg()

Replace local PAD macro with PADTCPOLEN macro
No functional change.

Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D45076


# fce03f85 05-May-2024 Randall Stewart <rrs@FreeBSD.org>

TCP can be subject to Sack Attacks lets fix this issue.

There is a type of attack that a TCP peer can launch on a connection. This is for sure in Rack or BBR and probably even the default stack if it uses lists in sack processing. The idea of the attack is that the attacker is driving you to look at 100's of sack blocks that only update 1 byte. So for example if you have 1 - 10,000 bytes outstanding the attacker sends in something like:

ACK 0 SACK(1-512) SACK(1024 - 1536), SACK(2048-2536), SACK(4096 - 4608), SACK(8192-8704)
This first sack looks fine but then the attacker sends

ACK 0 SACK(1-512) SACK(1025 - 1537), SACK(2049-2537), SACK(4097 - 4609), SACK(8193-8705)
ACK 0 SACK(1-512) SACK(1027 - 1539), SACK(2051-2539), SACK(4099 - 4611), SACK(8195-8707)
...
These blocks are making you hunt across your linked list and split things up so that you have an entry for every other byte. Has your list grows you spend more and more CPU running through the lists. The idea here is the attacker chooses entries as far apart as possible that make you run through the list. This example is small but in theory if the window is open to say 1Meg you could end up with 100's of thousands link list entries.

To combat this we introduce three things.

when the peer requests a very small MSS we stop processing SACK's from them. This prevents a malicious peer from just using a small MSS to do the same thing.
Any time we get a sack block, we use the sack-filter to remove sacks that are smaller than the smallest v4 mss (minus 40 for max TCP options) unless it ties up to snd_max (since that is legal). All other sacks in theory should be at least an MSS. If we get such an attacker that means we basically start skipping all but MSS sized Sacked blocks.
The sack filter used to throw away data when its bounds were exceeded, instead now we increase its size to 15 and then throw away sack's if the filter gets over-run to prevent the malicious attacker from over-running the sack filter and thus we start to process things anyway.
The default stack will need to start using the sack-filter which we have talked about in past conference calls to take full advantage of the protections offered by it (and reduce cpu consumption when processing sacks).

After this set of changes is in rack can drop its SAD detection completely

Reviewed by:tuexen@, rscheff@
Differential Revision: <https://reviews.freebsd.org/D44903>


# 6b454da6 03-Apr-2024 Michael Tuexen <tuexen@FreeBSD.org>

tcp: address a warning

t_state is an unsigned variable, so no need for testing that it is
non-negative.

Reported by: Coverity Scan
CID: 1390885
Reviewed by: glebius
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44619


# e0bd1801 03-Apr-2024 Michael Tuexen <tuexen@FreeBSD.org>

tcp: fix conversion of rttvar

A wrong variable and wrong scaling factors were used.

Reported by: Coverity Scan
CID: 1508689
Reviewed by: rscheff
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44612


# 1a8d1764 29-Mar-2024 Gleb Smirnoff <glebius@FreeBSD.org>

inpcb: fully retire inp_ppcb pointer

Before a protocol specific control block started to embed inpcb in self
(see 0aa120d52f3c, e68b3792440c, 483fe96511ec) this pointer used to point
at it.

Retain kf_sock_inpcb field in the struct kinfo_file in <sys/user.h>. The
exp-run detected a minimal use of the field in ports:
* sysutils/lsof - patched upstream
* net-mgmt/netdata - patch accepted upstream
* emulators/qemu-user-static - upstream master branch seems not using
the field anymore
We can keep the field around for some time, but eventually it may be
reused for something else.

PR: 277659 (exp-run)
Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D44491


# e34ea019 18-Mar-2024 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: clear all TCP timers in tcp_timer_stop() when in callout

When a TCP callout decides to disable self, e.g. tcp_timer_2msl() calling
tcp_close(), we must also clear all other possible timers. Otherwise,
upon return, the callout would be scheduled again in tcp_timer_enter().

Revert 57e27ff07aff, which was a temporary partial revert of otherwise
correct 62d47d73b7eb, that exposed the problem being fixed now. Add an
extra assertion in tcp_timer_enter() to check we aren't arming callout for
a closed connection.

Reviewed by: rscheff


# dd7b86e2 18-Mar-2024 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove IS_FASTOPEN() macro

The macro is more obfuscating than helping as it just checks a single flag
of t_flags. All other t_flags bits are checked without a macro.

A bigger problem was that declaration of the macro in tcp_var.h depended
on a kernel option. It is a bad practice to create such definitions in
installable headers.

Reviewed by: rscheff, tuexen, kib
Differential Revision: https://reviews.freebsd.org/D44362


# e18b97bd 12-Mar-2024 Randall Stewart <rrs@FreeBSD.org>

Update to bring the rack stack with all its fixes in.

This brings the rack stack up to the current level used at NF. Many fixes
and improvements have been added. I also add in a fix to BBR to deal with
the changes that have been in hpts for a while i.e. only one call no matter
if mbuf queue or tcp_output.

It basically does little except BBlogs and is a placemark for future work on
doing path capacity measurements.

With a bit of a struggle with git I finally got rack_pcm.c into place (apologies
for not noticing this error). The LINT kernel is running on my box now .. sigh.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision:https://reviews.freebsd.org/D43986


# c112243f 11-Mar-2024 Brooks Davis <brooks@FreeBSD.org>

Revert "Update to bring the rack stack with all its fixes in."

This commit was incomplete and breaks LINT kernels. The tree has been
broken for 8+ hours.

This reverts commit f6d489f402c320f1a6eaa473491a0b8c3878113e.


# f6d489f4 11-Mar-2024 Randall Stewart <rrs@FreeBSD.org>

Update to bring the rack stack with all its fixes in.

This brings the rack stack up to the current level used at NF. Many fixes
and improvements have been added. I also add in a fix to BBR to deal with
the changes that have been in hpts for a while i.e. only one call no matter
if mbuf queue or tcp_output.

Note there is a new file that I can't figure out how to get in rack_pcm.c

It basically does little except BBlogs and is a placemark for future work on
doing path capacity measurements.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision:https://reviews.freebsd.org/D43986


# 57e27ff0 12-Feb-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: partially undo D43792

At the destruction of the tcpcb, no timers are supposed to
be running. However, it turns out that stopping them in the
close() / shutdown() call does not have the desired effect
under all circumstances.

This partially reverts 62d47d73b7eb to reduce the nuisance
caused.

PR: 277009
Reported-by: syzbot+9a9aa434a14a2b35c3ba@syzkaller.appspotmail.com
Reported-by: syzbot+e82856782410e895bae7@syzkaller.appspotmail.com
Reviewed By: glebius, tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43855


# 62d47d73 10-Feb-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: stop timers and clean scoreboard in tcp_close()

Stop timers when in tcp_close() instead of doing that in tcp_discardcb().
A connection in CLOSED state shall not need any timers. Assert that no
timer is rescheduled after that in tcp_timer_activate() and verfiy that
this is also the expected state in tcp_discardcb().

PR: 276761
Reviewed By: glebius, tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43792


# 3eeb22cb 10-Feb-2024 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: clean scoreboard when releasing the socket buffer

The SACK scoreboard is conceptually an extention of the socket
buffer. Remove it when the socket buffer goes away with
soisdisconnected(). Verify that this is also the expected
state in tcp_discardcb().

PR: 276761
Reviewed by: glebius, tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43805


# 3f46be6a 07-Dec-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_hpts: let tcp_hpts_init() set a random CPU only once

After d2ef52ef3dee the tcp_hpts_init() function can be called multiple
times on a tcpcb if it is switched there and back between two TCP stacks.
First, this makes existing assertion in tcp_hpts_init() incorrect. Second,
it creates possibility to change a randomly set t_hpts_cpu to a different
random value, while a tcpcb is already in the HPTS wheel, triggering other
assertions later in tcp_hptsi().

The best approach here would be to work on the stacks to really clear a
tcpcb out of HPTS wheel in tfb_tcp_fb_fini, draining the IHPTS_MOVING
state. But that's pretty intrusive change, so let's just get back to the
old logic (pre d2ef52ef3dee) where t_hpts_cpu was set to a random value
only once in a CPU lifetime and a newly switched stack inherits t_hpts_cpu
from the previous stack.

Reviewed by: rrs, tuexen
Differential Revision: https://reviews.freebsd.org/D42946
Reported-by: syzbot+fab29fe1ab089c52998d@syzkaller.appspotmail.com
Reported-by: syzbot+ca5f2aa0fda15dcfe6d7@syzkaller.appspotmail.com
Fixes: 2b3a77467dd3d74a7170f279fb25f9736b46ef8a


# ade05d63 07-Dec-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: stop stack timers in tcp_switch_back_to_default()

This funcion is an alternative code path that detaches an alternative
TCP stack, missed in d2ef52ef3dee38cccb7f54d33ecc2a4b944dad9d.

Reviewed by: rrs, tuexen
Differential Revision: https://reviews.freebsd.org/D42917
Reported-by: syzbot+186130be9f0ca5557d4e@syzkaller.appspotmail.com
Fixes: d2ef52ef3dee38cccb7f54d33ecc2a4b944dad9d


# d2ef52ef 04-Dec-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp/hpts: make stacks responsible for clearing themselves out HPTS

There already is the tfb_tcp_timer_stop_all method that is supposed to stop
all time events associated with a given tcpcb by given stack. Some time
ago it was doing actual callout_stop(). Today bbr/rack just mark their
internal state as inactive in their tfb_tcp_timer_stop_all methods, but
tcpcb stays in HPTS wheel and potentially called in from HPTS. Change the
methods to also call tcp_hpts_remove(). Note: I'm not sure if internal
flag is still relevant once we are out of HPTS wheel.

Call the method when connection goes into TCP_CLOSED state, instead of
calling it later when tcpcb is freed. Also call it when we switch between
stacks.

Reviewed by: tuexen, rrs
Differential Revision: https://reviews.freebsd.org/D42857


# 2b3a7746 04-Dec-2023 Gleb Smirnoff <glebius@FreeBSD.org>

hpts: make stacks responsible for tcp_hpts_init()

Those stacks that use HPTS should care about init, not generic code.

Reviewed by: imp, tuexen, rrs
Differential Revision: https://reviews.freebsd.org/D42856


# 8e907391 04-Dec-2023 Gleb Smirnoff <glebius@FreeBSD.org>

hpts: don't ifdef tcp_in_hpts()

This small inline function is always available.

Reviewed by: imp, tuexen, rrs
Differential Revision: https://reviews.freebsd.org/D42855


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 219a6ca9 21-Nov-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: uninline tcp_account_for_send()

This allows to clear inclusion of "opt_kern_tls.h" from a system header.

Reviewed by: rscheff, tuexen
Differential Revision: https://reviews.freebsd.org/D42696


# 38ecc80b 08-Oct-2023 Zhenlei Huang <zlei@FreeBSD.org>

tcp: Simplify the initialization of loader tunable 'net.inet.tcp.tcbhashsize'

No functional change intended.

Reviewed by: cc, rscheff, #transport
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D41998


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# d66540e8 05-Jun-2023 Michael Tuexen <tuexen@FreeBSD.org>

tcp: improve sending of TTL/hoplimit and DSCP

Ensure that a user specified value of TTL/hoplimit and DSCP is
used when sending packets.

Reviewed by: cc, rscheff
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D40423


# 4f2cc73f 31-May-2023 Jonathan T. Looney <jtl@FreeBSD.org>

tcp: Refactor tcp_get_srtt()

Refactor tcp_get_srtt() into its two component operations: unit
conversion and shifting. No functional change is intended.

Reviewed by: cc, tuexen
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D40304


# f5abdb03 25-May-2023 Cheng Cui <cc@FreeBSD.org>

tcp: fix a bug where unshifting should be put last in tcp_get_srtt()

Reported by: Bhaskar Pardeshi from VMware.
Reviewers: rscheff, tuexen, #transport!
Approved by: tuexen (mentor)
Subscribers: imp, melifaro, glebius
Differential Revision: https://reviews.freebsd.org/D40267


# 57a3a161 24-May-2023 Randall Stewart <rrs@FreeBSD.org>

tcp: request tracking is not http specific.

This change is a name change only. TCP Request tracking can track sendfile and even non-sendfile requests. The
names however in the current code use http, and they should not. The feature is not http specific. Lets change the
name so they more properly reflect whats going on. This also fixes conflicts with http_req which caused application pain.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D40229


# 72ae9382 19-May-2023 Randall Stewart <rrs@FreeBSD.org>

Add a comment to the new tcp_get_srtt method to clarify that ticks
are kept in a shifted form and need to be un-shifted before use.

Suggested by: rpokala@


# ec6d620b 19-May-2023 Randall Stewart <rrs@FreeBSD.org>

There are congestion control algorithms will that pull in srtt, and this can cause issues with rack.

When using rack, cubic and htcp will grab the srtt, but they think it is in ticks. For rack
it is in micro-seconds (which we should probably move all stacks to actually). This causes
issues so instead lets make a new interface so that any CC module can pull the srtt in
whatever granularity they want.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D40146


# c3c20de3 25-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: move HPTS/LRO flags out of inpcb to tcpcb

These flags are TCP specific. While here, make also several LRO
internal functions to pass tcpcb pointer instead of inpcb one.

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39698


# c2a69e84 25-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_hpts: move HPTS related fields from inpcb to tcpcb

This makes inpcb lighter and allows future cache line optimizations
of tcpcb. The reason why HPTS originally used inpcb is the compressed
TIME-WAIT state (see 0d7445193ab), that used to free a tcpcb, while the
associated connection is still on the HPTS ring.

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39697


# 144259f6 25-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: purge the input queue from tcp_discardcb()

The purge was intentionally removed in a540cdca3183. My assumption
was that the stacks that use the input queue always call the
tcp_handle_orphaned_packets() in their tfb_tcp_fb_fini method.
However, rack will skip doing that if t_fb_ptr is NULL and there are
scenarios when it is NULL, e.g. close(2) on a socket (but some
special close(2)). Instead of working out all possible scenarios
let's put this safebelt back.

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39696


# 8e813d07 20-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

netstat: fix printing of TCP pcbs with -A

This change touches both kernel and netstat(1), but either of the changes
will fix printing pcb addresses with -A.

The thing is that historically netstat(1) treated TCP differently, and
printed tcpcb address instead of inpcb address. This is not documented
anywhere! With e68b3792440 these two addresses became the same. It is
highly likely they will be the same for a long time, but it might be they
will start to differ again in a far future. My proposal is to stop
treating TCP differently with netstat(1) and right now is a good opportunity
to do that, since there will be no behavior change at all. The kernel
change to tcp_inptoxtp() will go into stable/14 to make it compatible with
netstat(1) binary from stable/13. We can drop it later, probably together
with in_ppcb pointer from inpcb. The in_ppcb in xinpcb will stay for size
compatibility.

Reviewed by: tuexen, rrs
Differential Revision: https://reviews.freebsd.org/D39736


# 303246dc 18-Apr-2023 Randall Stewart <rrs@FreeBSD.org>

We have a TCP_LOG_CONNEND log that should come out at the very last log of every connection. This
holds some nice stats about why/how the connection ended. Though with the current code it does not
come out without accounting due to the placement of the ifdefs. Also we need to make sure the stacks
fini has ran before calling in from tcp_subr so we get all logs the stack may make at its ending.

Reviewed by: rscheff
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39693


# 3232b1f4 17-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: fix build

The recent 25685b75375 came in conflict with a540cdca318. Remove the
code that cleans up the old style input queue. Note that two lines
below we assert that the new style input queue is empty. The TCP
stacks that use the queue are supposed to flush it in their
tfb_tcp_fb_fini method.


# a540cdca 17-Apr-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_hpts: use queue(9) STAILQ for the input queue

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39574


# 25685b75 13-Apr-2023 Randall Stewart <rrs@FreeBSD.org>

TCP: Misc cleanups of tcp_subr.c

In going through all the TCP stacks I have found we have a few little bugs and niggles that need to
be cleaned up in tcp_subr.c including the following:

a) Set tcp_restoral_thresh to 450 (45%) not 550. This is a better proven value in my testing.
b) Lets track when we try to do pacing but fail via a counter for connections that do pace.
c) If a switch away from the default stack occurs and it fails we need to make sure the time
scale is in the right mode (just in case the other stack changed it but then failed).
d) Use the TP_RXTCUR() macro when starting the TT_REXMT timer.
e) When we end a default flow lets log that in BBlogs as well as cleanup any t_acktime (disable).
f) When we respond with a RST lets make sure to update the log_end_status properly.
g) When starting a new pcb lets assure that all LRO features are off.
h) When discarding a connection lets make sure that any t_in_pkt's that might be there are freed properly.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39501


# 2ba2849c 12-Apr-2023 Michael Tuexen <tuexen@FreeBSD.org>

tcp: fix typo in comment

Reported by: cc
MFC after: 1 week
Sponsored by: Netflix, Inc.


# c687f21a 12-Apr-2023 Michael Tuexen <tuexen@FreeBSD.org>

tcp: make net.inet.tcp.functions_default vnet specific
Reviewed by: cc, rrs
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D39516


# 73c48d9d 12-Apr-2023 Michael Tuexen <tuexen@FreeBSD.org>

tcp: fix deregistering stacks when vnets are used

This fixes a bug where stacks could not be deregistered when
end points in the non-default vnet are using it.

Reviewed by: glebius, zlei
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D39514


# 2169f712 11-Apr-2023 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: use IPV6_FLOWLABEL_LEN

Avoid magic numbers when handling the IPv6 flow ID for
DSCP and ECN fields and use the named variable instead.

Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D39503


# 945f9a7c 07-Apr-2023 Randall Stewart <rrs@FreeBSD.org>

tcp: misc cleanup of options for rack as well as socket option logging.

Both BBR and Rack have the ability to log socket options, which is currently disabled. Rack
has an experimental SaD (Sack Attack Detection) algorithm that should be made available. Also
there is a t_maxpeak_rate that needs to be removed (its un-used).

Reviewed by: tuexen, cc
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D39427


# 030434ac 04-Apr-2023 Randall Stewart <rrs@FreeBSD.org>

Update rack to the latest code used at NF.

There have been many changes to rack over the last couple of years, including:
a) Ability when switching stacks to have one stack query another.
b) Internal use of micro-second timers instead of ticks.
c) Many changes to pacing in forms of
1) Improvements to Dynamic Goodput Pacing (DGP)
2) Improvements to fixed rate paciing
3) A new feature called hybrid pacing where the requestor can
get a combination of DGP and fixed rate pacing with deadlines
for delivery that can dynamically speed things up.
d) All kinds of bugs found during extensive testing and use of the
rack stack for streaming video and in fact all data transferred
by NF

Reviewed by: glebius, gallatin, tuexen
Sponsored By: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D39402


# 73ee5756 31-Mar-2023 Randall Stewart <rrs@FreeBSD.org>

Fixes in the tcp infrastructure with respect to stack changes as well as other infrastructure updates for incoming rack features.

So stack switching as always been a bit of a issue. We currently use a break before make setup which means that
if something goes wrong you have to try to get back to a stack. This patch among a lot of other things changes that so
that it is a make before break. We also expand some of the function blocks in prep for new features in rack that will allow
more controlled pacing. We also add other abilities such as the pathway for a stack to query a previous stack to acquire from
it critical state information so things in flight don't get dropped or mis-handled when switching stacks. We also add the
concept of a timer granularity. This allows an alternate stack to change from the old ticks granularity to microseconds and
of course this even gives us a pathway to go to nanosecond timekeeping if we need to (something for the data center to consider
for sure).

Once all this lands I will then update rack to begin using all these new features.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D39210


# 69c7c811 16-Mar-2023 Randall Stewart <rrs@FreeBSD.org>

Move access to tcp's t_logstate into inline functions and provide new tracepoint and bbpoint capabilities.

The TCP stacks have long accessed t_logstate directly, but in order to do tracepoints and the new bbpoints
we need to move to using the new inline functions. This adds them and moves rack to now use
the tcp_tracepoints.

Reviewed by: tuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D38831


# 3d0d5b21 23-Jan-2023 Justin Hibbits <jhibbits@FreeBSD.org>

IfAPI: Explicitly include <net/if_private.h> in netstack

Summary:
In preparation of making if_t completely opaque outside of the netstack,
explicitly include the header. <net/if_var.h> will stop including the
header in the future.

Sponsored by: Juniper Networks, Inc.
Reviewed by: glebius, melifaro
Differential Revision: https://reviews.freebsd.org/D38200


# e2d14a04 26-Jan-2023 Michael Tuexen <tuexen@FreeBSD.org>

tcp: improve error handling of net.inet.tcp.udp_tunneling_port

In case the new port can't be set, set the port to 0.

MFC after: 3 days
Sponsored by: Netflix, Inc.


# d3acb974 26-Jan-2023 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: protect TCP over UDP configuration with a lock

The sysctl modifies global sockets without any locks. The removed
comment suggests that previously it relied on a lock that doesn't
exist today.


# eaabc937 14-Dec-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: retire TCPDEBUG

This subsystem is superseded by modern debugging facilities,
e.g. DTrace probes and TCP black box logging.

We intentionally leave SO_DEBUG in place, as many utilities may
set it on a socket. Also the tcp::debug DTrace probes look at
this flag on a socket.

Reviewed by: gnn, tuexen
Discussed with: rscheff, rrs, jtl
Differential revision: https://reviews.freebsd.org/D37694


# 446ccdd0 07-Dec-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: use single locked callout per tcpcb for the TCP timers

Use only one callout structure per tcpcb that is responsible for handling
all five TCP timeouts. Use locked version of callout, of course. The
callout function tcp_timer_enter() chooses soonest timer and executes it
with lock held. Unless the timer reports that the tcpcb has been freed,
the callout is rescheduled for next soonest timer, if there is any.

With single callout per tcpcb on connection teardown we should be able
to fully stop the callout and immediately free it, avoiding use of
callout_async_drain(). There is one gotcha here: callout_stop() can
actually touch our memory when a rare race condition happens. See
comment above tcp_timer_stop(). Synchronous stop of the callout makes
tcp_discardcb() the single entry point for tcpcb destructor, merging the
tcp_freecb() to the end of the function.

While here, also remove lots of lingering checks in the beginning of
TCP timer functions. With a locked callout they are unnecessary.

While here, clean unused parts of timer KPI for the pluggable TCP stacks.

While here, remove TCPDEBUG from tcp_timer.c, as this allows for more
simplification of TCP timers. The TCPDEBUG is scheduled for removal.

Move the DTrace probes in timers to the beginning of a function, where
a tcpcb is always existing.

Discussed with: rrs, tuexen, rscheff (the TCP part of the diff)
Reviewed by: hselasky, kib, mav (the callout part)
Differential revision: https://reviews.freebsd.org/D37321


# e68b3792 07-Dec-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: embed inpcb into tcpcb

For the TCP protocol inpcb storage specify allocation size that would
provide space to most of the data a TCP connection needs, embedding
into struct tcpcb several structures, that previously were allocated
separately.

The most import one is the inpcb itself. With embedding we can provide
strong guarantee that with a valid TCP inpcb the tcpcb is always valid
and vice versa. Also we reduce number of allocs/frees per connection.
The embedded inpcb is placed in the beginning of the struct tcpcb,
since in_pcballoc() requires that. However, later we may want to move
it around for cache line efficiency, and this can be done with a little
effort. The new intotcpcb() macro is ready for such move.

The congestion algorithm data, the TCP timers and osd(9) data are
also embedded into tcpcb, and temprorary struct tcpcb_mem goes away.
There was no extra allocation here, but we went through extra pointer
every time we accessed this data.

One interesting side effect is that now TCP data is allocated from
SMR-protected zone. Potentially this allows the TCP stacks or other
TCP related modules to utilize that for their own synchronization.

Large part of the change was done with sed script:

s/tp->ccv->/tp->t_ccv./g
s/tp->ccv/\&tp->t_ccv/g
s/tp->cc_algo/tp->t_cc/g
s/tp->t_timers->tt_/tp->tt_/g
s/CCV\(ccv, osd\)/\&CCV(ccv, t_osd)/g

Dependency side effect is that code that needs to know struct tcpcb
should also know struct inpcb, that added several <netinet/in_pcb.h>.

Differential revision: https://reviews.freebsd.org/D37127


# 0aa120d5 02-Dec-2022 Gleb Smirnoff <glebius@FreeBSD.org>

inpcb: allow to provide protocol specific pcb size

The protocol specific structure shall start with inpcb.

Differential revision: https://reviews.freebsd.org/D37126


# 8840ae22 08-Nov-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: don't store VNET in every tcpcb, take it from the inpcbinfo

Reviewed by: rscheff
Differential revision: https://reviews.freebsd.org/D37125


# ab0ef945 08-Nov-2022 Gleb Smirnoff <glebius@FreeBSD.org>

hpts: move inp initialization from the generic inpcb code to TCP

Differential revision: https://reviews.freebsd.org/D37124


# 9eb0e832 08-Nov-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: provide macros to access inpcb and socket from a tcpcb

There should be no functional changes with this commit.

Reviewed by: rscheff
Differential revision: https://reviews.freebsd.org/D37123


# f71cb9f7 08-Nov-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: inp_socket is valid through the lifetime of a TCP inpcb

The inp_socket is cleared only in in_pcbdetach(), which for TCP is
always accompanied with inp_pcbfree(). An inpcb that went through
in_pcbfree() shall never be returned by any kind of pcb lookup.

Reviewed by: tuexen
Differential revision: https://reviews.freebsd.org/D37062


# ada90cb9 08-Nov-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove INP_DROPPED check from notify functions

These functions tcp_notify(), tcp_drop_syn_sent() and tcp_mtudisc()
are called from tcp*_ctlinput*() right after successfull
in_pcblookup*(). They shall never get a pcb that is dropped.


# f567d55f 08-Nov-2022 Gleb Smirnoff <glebius@FreeBSD.org>

inpcb: don't return INP_DROPPED entries from pcb lookups

The in_pcbdrop() KPI, which is used solely by TCP, allows to remove a
pcb from hash list and mark it as dropped. The comment suggests that
such pcb won't be returned by lookups. Indeed, every call to
in_pcblookup*() is accompanied by a check for INP_DROPPED. Do what
comment suggests: never return such pcbs and remove unnecessary checks.

Reviewed by: tuexen
Differential revision: https://reviews.freebsd.org/D37061


# 83c1ec92 20-Oct-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: ECN preparations for ECN++, AccECN (tcp_respond)

tcp_respond is another function to build a tcp control packet
quickly. With ECN++ and AccECN, both the IP ECN header, and
the TCP ECN flags are supposed to reflect the correct state.

Also ensure that on receiving multiple ECN SYN-ACKs, the
responses triggered will reflect the latest state.

Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36973


# 53af6903 06-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove INP_TIMEWAIT flag

Mechanically cleanup INP_TIMEWAIT from the kernel sources. After
0d7445193ab, this commit shall not cause any functional changes.

Note: this flag was very often checked together with INP_DROPPED.
If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we
will be able to remove most of this checks and turn them to
assertions. Some of them can be turned into assertions right now,
but that should be carefully done on a case by case basis.

Differential revision: https://reviews.freebsd.org/D36400


# 0d744519 06-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove tcptw, the compressed timewait state structure

The memory savings the tcptw brought back in 2003 (see 340c35de6a2) no
longer justify the complexity required to maintain it. For longer
explanation please check out the email [1].

Surpisingly through almost 20 years the TCP stack functionality of
handling the TIME_WAIT state with a normal tcpcb did not bitrot. The
existing tcp_input() properly handles a tcpcb in TCPS_TIME_WAIT state,
which is confirmed by the packetdrill tcp-testsuite [2].

This change just removes tcptw and leaves INP_TIMEWAIT. The flag will
be removed in a separate commit. This makes it easier to review and
possibly debug the changes.

[1] https://lists.freebsd.org/archives/freebsd-net/2022-January/001206.html
[2] https://github.com/freebsd-net/tcp-testsuite

Differential revision: https://reviews.freebsd.org/D36398


# c2a808b9 04-Oct-2022 Hans Petter Selasky <hselasky@FreeBSD.org>

Fix kernel build after fcb3f813f379f544f9cd2a10d18045588da0e132 .

By adding missing ifdefs for INET6 .

Differential Revision: https://reviews.freebsd.org/D36731
Sponsored by: NVIDIA Networking


# 775e20c1 03-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: make tcp_drop_syn_sent() static


# fcb3f813 03-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

netinet*: remove PRC_ constants and streamline ICMP processing

In the original design of the network stack from the protocol control
input method pr_ctlinput was used notify the protocols about two very
different kinds of events: internal system events and receival of an
ICMP messages from outside. These events were coded with PRC_ codes.
Today these methods are removed from the protosw(9) and are isolated
to IPv4 and IPv6 stacks and are called only from icmp*_input(). The
PRC_ codes now just create a shim layer between ICMP codes and errors
or actions taken by protocols.

- Change ipproto_ctlinput_t to pass just pointer to ICMP header. This
allows protocols to not deduct it from the internal IP header.
- Change ip6proto_ctlinput_t to pass just struct ip6ctlparam pointer.
It has all the information needed to the protocols. In the structure,
change ip6c_finaldst fields to sockaddr_in6. The reason is that
icmp6_input() already has this address wrapped in sockaddr, and the
protocols want this address as sockaddr.
- For UDP tunneling control input, as well as for IPSEC control input,
change the prototypes to accept a transparent union of either ICMP
header pointer or struct ip6ctlparam pointer.
- In icmp_input() and icmp6_input() do only validation of ICMP header and
count bad packets. The translation of ICMP codes to errors/actions is
done by protocols.
- Provide icmp_errmap() and icmp6_errmap() as substitute to inetctlerrmap,
inet6ctlerrmap arrays.
- In protocol ctlinput methods either trust what icmp_errmap() recommend,
or do our own logic based on the ICMP header.

Differential revision: https://reviews.freebsd.org/D36731


# c0fc81e9 03-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

netinet*: remove dead code from TCP, UDP, SCTP control input

Now these functions are called only from icmp*_input(). The pointer
to the ICMP data is never NULL and cmd has a limited set of values.

In the past the functions were demultiplexing control messages from
ICMP layer, as well as internally generated events. In the latter
case the the pointer to IP would be NULL.

Differential revision: https://reviews.freebsd.org/D36729


# 7f3b00a8 03-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

netinet: filter out invalid ICMP responses in ip_icmp()

instead of doing that in every ipproto_ctlinput_t method.

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36728


# 43d39ca7 03-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

netinet*: de-void control input IP protocol methods

After decoupling of protosw(9) and IP wire protocols in 78b1fc05b205 for
IPv4 we got vector ip_ctlprotox[] that is executed only and only from
icmp_input() and respectively for IPv6 we got ip6_ctlprotox[] executed
only and only from icmp6_input(). This allows to use protocol specific
argument types in these methods instead of struct sockaddr and void.

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36727


# 46ddeb6b 03-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

netinet6: retire ip6protosw.h

The netinet/ipprotosw.h and netinet6/ip6protosw.h were KAME relics, with
the former removed in f0ffb944d25 in 2001 and the latter survived until
today. It has been reduced down to only one useful declaration that
moves to ip6_var.h

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36726


# 0924ae8f 03-Oct-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: allow window scale and timestamps to be toggled individually

Simple change to allow for the individual toggling of
RFC7323 window scaling and timestamp option.

Reviewed By: rrs, tuexen, glebius, guest-ccui, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36863


# 9453ec66 21-Sep-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: increment tcpstats in tcp_respond()

tcp_respond() crafts a packet and sends it directly to ip[6]output(),
bypassing tcp_output(). Hence it must increment TCP send statistics.

Reviewed by: rscheff, tuexen, rrs (implicitly)
Differential revision: https://reviews.freebsd.org/D36641


# c414347b 29-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

mbufs: isolate max_linkhdr and max_protohdr handling in the mbuf code

o Statically initialize max_linkhdr to default value without relying
on domain(9) code doing that.
o Statically initialize max_protohdr to a sane value, without relying
on TCP being always compiled in.
o Retire max_datalen. Set, but not used.
o Don't make the domain(9) system responsible in validating these
values and updating max_hdr. Instead provide KPI max_linkhdr_grow()
and max_protohdr_grow().
o Call max_linkhdr_grow() from IEEE802.11 and max_protohdr_grow() from
TCP. Those are the only protocols today that may want to grow.

Reviewed by: tuexen
Differential revision: https://reviews.freebsd.org/D36376


# e7d02be1 17-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protosw: refactor protosw and domain static declaration and load

o Assert that every protosw has pr_attach. Now this structure is
only for socket protocols declarations and nothing else.
o Merge struct pr_usrreqs into struct protosw. This was suggested
in 1996 by wollman@ (see 7b187005d18ef), and later reiterated
in 2006 by rwatson@ (see 6fbb9cf860dcd).
o Make struct domain hold a variable sized array of protosw pointers.
For most protocols these pointers are initialized statically.
Those domains that may have loadable protocols have spacers. IPv4
and IPv6 have 8 spacers each (andre@ dff3237ee54ea).
o For inetsw and inet6sw leave a comment noting that many protosw
entries very likely are dead code.
o Refactor pf_proto_[un]register() into protosw_[un]register().
o Isolate pr_*_notsupp() methods into uipc_domain.c

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36232


# a6b982e2 17-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: move tcp_drain() verbatim before tcp_init()


# 81a34d37 17-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protosw: retire pr_drain and use EVENTHANDLER(9) directly

The method was called for two different conditions: 1) the VM layer is
low on pages or 2) one of UMA zones of mbuf allocator exhausted.
This change 2) into a new event handler, but all affected network
subsystems modified to subscribe to both, so this change shall not
bring functional changes under different low memory situations.

There were three subsystems still using pr_drain: TCP, SCTP and frag6.
The latter had its protosw entry for the only reason to register its
pr_drain method.

Reviewed by: tuexen, melifaro
Differential revision: https://reviews.freebsd.org/D36164


# 78b1fc05 17-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protosw: separate pr_input and pr_ctlinput out of protosw

The protosw KPI historically has implemented two quite orthogonal
things: protocols that implement a certain kind of socket, and
protocols that are IPv4/IPv6 protocol. These two things do not
make one-to-one correspondence. The pr_input and pr_ctlinput methods
were utilized only in IP protocols. This strange duality required
IP protocols that doesn't have a socket to declare protosw, e.g.
carp(4). On the other hand developers of socket protocols thought
that they need to define pr_input/pr_ctlinput always, which lead to
strange dead code, e.g. div_input() or sdp_ctlinput().

With this change pr_input and pr_ctlinput as part of protosw disappear
and IPv4/IPv6 get their private single level protocol switch table
ip_protox[] and ip6_protox[] respectively, pointing at array of
ipproto_input_t functions. The pr_ctlinput that was used for
control input coming from the network (ICMP, ICMPv6) is now represented
by ip_ctlprotox[] and ip6_ctlprotox[].

ipproto_register() becomes the only official way to register in the
table. Those protocols that were always static and unlikely anybody
is interested in making them loadable, are now registered by ip_init(),
ip6_init(). An IP protocol that considers itself unloadable shall
register itself within its own private SYSINIT().

Reviewed by: tuexen, melifaro
Differential revision: https://reviews.freebsd.org/D36157


# d8596171 04-Jul-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: use only soref()/sorele() as socket reference count

o Retire SS_FDREF as it is basically a debug flag on top of already
existing soref()/sorele().
o Convert SS_PROTOREF into soref()/sorele().
o Change reference model for the listen queues, see below.
o Make sofree() private. The correct KPI to use is only sorele().
o Make soabort() respect the model and sorele() instead of sofree().

Note on listening queues. Until now the sockets on a queue had zero
reference count. And the reference were given only upon accept(2). The
assumption was that there is no way to see the queued socket from anywhere
except its head. This is not true, since queued sockets already have pcbs,
which are linked at least into the global pcb lists. With this change we
put the reference right in the sonewconn() and on accept(2) path we just
hand the existing reference to the file descriptor.

Differential revision: https://reviews.freebsd.org/D35679


# f328c46f 13-Apr-2022 John Baldwin <jhb@FreeBSD.org>

TCP sysctl handlers: fin and lin are only used for INET.


# 700a395c 13-Apr-2022 John Baldwin <jhb@FreeBSD.org>

tcp_log_vain/addrs: Use a const pointer for the IPv4 header.

The pointer to the IPv6 header was already const.


# 13ec6858 13-Apr-2022 John Baldwin <jhb@FreeBSD.org>

tcp_log_addr: ip is only used for INET.


# 742e7210 11-Apr-2022 Kristof Provost <kp@FreeBSD.org>

udp: allow udp_tun_func_t() to indicate it did not eat the packet

Allow udp tunnel functions to indicate they have not taken ownership of
the packet, and that normal UDP processing should continue.

This is especially useful for scenarios where the kernel has taken
ownership of a socket that was originally created by userspace. It
allows the tunnel function to pass through certain packets for userspace
processing.

The primary user of this is if_ovpn, when it receives messages from
unknown peers (which might be a new client).

Reviewed by: tuexen
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D34883


# ea9017fb 21-Feb-2022 Randall Stewart <rrs@FreeBSD.org>

tcp: Congestion control move to using reference counting.

In the transport call on 12/3 Gleb asked to move the CC modules towards
using reference counting to prevent folks from unloading a module in use.
It was also agreed that Michael would do a user space utility like tcp_drop
that could be used to move all connections that are using a specific CC
to some other CC.

This is the half I committed to doing, making it so that we maintain a refcount
on a cc module every time a pcb refers to it and decrementing that every
time a pcb no longer uses a cc module. This also helps us simplify the
whole unloading process by getting rid of tcp_ccunload() which munged
through all the tcb's. Instead we mark a module as being removed and
prevent further references to it. We also make sure that if a module is
marked as being removed it cannot be made as the default and also
the opposite of that, if its a default it fails and does not mark it as being
removed.

Reviewed by: Michael Tuexen, Gleb Smirnoff
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D33249


# a35bdd44 08-Feb-2022 Michael Tuexen <tuexen@FreeBSD.org>

tcp: add sysctl interface for setting socket options

This interface allows to set a socket option on a TCP endpoint,
which is specified by its inp_gencnt. This interface will be
used in an upcoming command line tool tcpsso.

Reviewed by: glebius, rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D34138


# 1ebf4607 03-Feb-2022 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: Access all 12 TCP header flags via inline function

In order to consistently provide access to all
(including reserved) TCP header flag bits,
use an accessor function tcp_get_flags and
tcp_set_flags. Also expand any flag variable from
uint8_t / char to uint16_t.

Reviewed By: hselasky, tuexen, glebius, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34130


# fec8a8c7 03-Jan-2022 Gleb Smirnoff <glebius@FreeBSD.org>

inpcb: use global UMA zones for protocols

Provide structure inpcbstorage, that holds zones and lock names for
a protocol. Initialize it with global protocol init using macro
INPCBSTORAGE_DEFINE(). Then, at VNET protocol init supply it as
the main argument to the in_pcbinfo_init(). Each VNET pcbinfo uses
its private hash, but they all use same zone to allocate and SMR
section to synchronize.

Note: there is kern.ipc.maxsockets sysctl, which controls UMA limit
on the socket zone, which was always global. Historically same
maxsockets value is applied also to every PCB zone. Important fact:
you can't create a pcb without a socket! A pcb may outlive its socket,
however. Given that there are multiple protocols, and only one socket
zone, the per pcb zone limits seem to have little value. Under very
special conditions it may trigger a little bit earlier than socket zone
limit, but in most setups the socket zone limit will be triggered
earlier. When VIMAGE was added to the kernel PCB zones became per-VNET.
This magnified existing disbalance further: now we have multiple pcb
zones in multiple vnets limited to maxsockets, but every pcb requires a
socket allocated from the global zone also limited by maxsockets.
IMHO, this per pcb zone limit doesn't bring any value, so this patch
drops it. If anybody explains value of this limit, it can be restored
very easy - just 2 lines change to in_pcbstorage_init().

Differential revision: https://reviews.freebsd.org/D33542


# 89128ff3 03-Jan-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protocols: init with standard SYSINIT(9) or VNET_SYSINIT

The historical BSD network stack loop that rolls over domains and
over protocols has no advantages over more modern SYSINIT(9).
While doing the sweep, split global and per-VNET initializers.

Getting rid of pr_init allows to achieve several things:
o Get rid of ifdef's that protect against double foo_init() when
both INET and INET6 are compiled in.
o Isolate initializers statically to the module they init.
o Makes code easier to understand and maintain.

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D33537


# a370832b 26-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove delayed drop KPI

No longer needed after tcp_output() can ask caller to drop.

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D33371


# f64dc2ab 26-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: TCP output method can request tcp_drop

The advanced TCP stacks (bbr, rack) may decide to drop a TCP connection
when they do output on it. The default stack never does this, thus
existing framework expects tcp_output() always to return locked and
valid tcpcb.

Provide KPI extension to satisfy demands of advanced stacks. If the
output method returns negative error code, it means that caller must
call tcp_drop().

In tcp_var() provide three inline methods to call tcp_output():
- tcp_output() is a drop-in replacement for the default stack, so that
default stack can continue using it internally without modifications.
For advanced stacks it would perform tcp_drop() and unlock and report
that with negative error code.
- tcp_output_unlock() handles the negative code and always converts
it to positive and always unlocks.
- tcp_output_nodrop() just calls the method and leaves the responsibility
to drop on the caller.

Sweep over the advanced stacks and use new KPI instead of using HPTS
delayed drop queue for that.

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D33370


# 40fa3e40 26-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: mechanically substitute call to tfb_tcp_output to new method.

Made with sed(1) execution:

sed -Ef sed -i "" $(grep --exclude tcp_var.h -lr tcp_output sys/)

sed:
s/tp->t_fb->tfb_tcp_output\(tp\)/tcp_output(tp)/
s/to tfb_tcp_output\(\)/to tcp_output()/

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D33366


# 5b08b46a 26-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: welcome back tcp_output() as the right way to run output on tcpcb.

Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D33365


# c2c8e360 04-Dec-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

tcp: virtualise net.inet.tcp.msl sysctl.

VNET teardown waits 2*MSL (60 seconds by default) before expiring
tcp PCBs. These PCBs holds references to nexthops, which, in turn,
reference ifnets. This chain results in VNET interfaces being destroyed
and moved to default VNET only after 60 seconds.
Allow tcp_msl to be set in jail by virtualising net.inet.tcp.msl sysctl,
permitting more predictable VNET tests outcomes.

MFC after: 1 week
Reviewed by: glebius
Differential Revision: https://reviews.freebsd.org/D33270


# 75add59a 17-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: allocate statistics in the main tcp_init()

No reason to have a separate SYSINIT.


# 36f42c5e 03-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_ccalgounload(): initialize the inpcb iterator when curvnet is set

Pointy hat to: glebius
Fixes: de2d47842e88


# 12ae3476 02-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_drain(): initialize the inpcb iterator when curvnet is set

Reported by: cy
Pointy hat to: glebius
Fixes: de2d47842e88


# db0ac6de 02-Dec-2021 Cy Schubert <cy@FreeBSD.org>

Revert "wpa: Import wpa_supplicant/hostapd commit 14ab4a816"

This reverts commit 266f97b5e9a7958e365e78288616a459b40d924a, reversing
changes made to a10253cffea84c0c980a36ba6776b00ed96c3e3b.

A mismerge of a merge to catch up to main resulted in files being
committed which should not have been.


# 2e27230f 02-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_hpts: rewrite inpcb synchronization

Just trust the pcb database, that if we did in_pcbref(), no way
an inpcb can go away. And if we never put a dropped inpcb on
our queue, and tcp_discardcb() always removes an inpcb to be
dropped from the queue, then any inpcb on the queue is valid.

Now, to solve LOR between inpcb lock and HPTS queue lock do the
following trick. When we are about to process a certain time
slot, take the full queue of the head list into on stack list,
drop the HPTS lock and work on our queue. This of course opens
a race when an inpcb is being removed from the on stack queue,
which was already mentioned in comments. To address this race
introduce generation count into queues. If we want to remove
an inpcb with generation count mismatch, we can't do that, we
can only mark it with desired new time slot or -1 for remove.

Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33026


# f971e791 02-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_hpts: rename input queue to drop queue and trim dead code

The HPTS input queue is in reality used only for "delayed drops".
When a TCP stack decides to drop a connection on the output path
it can't do that due to locking protocol between main tcp_output()
and stacks. So, rack/bbr utilize HPTS to drop the connection in
a different context.

In the past the queue could also process input packets in context
of HPTS thread, but now no stack uses this, so remove this
functionality.

Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33025


# de2d4784 02-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

SMR protection for inpcbs

With introduction of epoch(9) synchronization to network stack the
inpcb database became protected by the network epoch together with
static network data (interfaces, addresses, etc). However, inpcb
aren't static in nature, they are created and destroyed all the
time, which creates some traffic on the epoch(9) garbage collector.

Fairly new feature of uma(9) - Safe Memory Reclamation allows to
safely free memory in page-sized batches, with virtually zero
overhead compared to uma_zfree(). However, unlike epoch(9), it
puts stricter requirement on the access to the protected memory,
needing the critical(9) section to access it. Details:

- The database is already build on CK lists, thanks to epoch(9).
- For write access nothing is changed.
- For a lookup in the database SMR section is now required.
Once the desired inpcb is found we need to transition from SMR
section to r/w lock on the inpcb itself, with a check that inpcb
isn't yet freed. This requires some compexity, since SMR section
itself is a critical(9) section. The complexity is hidden from
KPI users in inp_smr_lock().
- For a inpcb list traversal (a pcblist sysctl, or broadcast
notification) also a new KPI is provided, that hides internals of
the database - inp_next(struct inp_iterator *).

Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33022


# 147bf5e9 26-Nov-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: Don't try to upgrade a read lock just for logging

Reviewed by: glebius, lstewart, rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D33098


# ff945008 18-Nov-2021 Gleb Smirnoff <glebius@FreeBSD.org>

Add tcp_freecb() - single place to free tcpcb.

Until this change there were two places where we would free tcpcb -
tcp_discardcb() in case if all timers are drained and tcp_timer_discard()
otherwise. They were pretty much copy-n-paste, except that in the
default case we would run tcp_hc_update(). Merge this into single
function tcp_freecb() and move new short version of tcp_timer_discard()
to tcp_timer.c and make it static.

Reviewed by: rrs, hselasky
Differential revision: https://reviews.freebsd.org/D32965


# 2f62f92e 14-Nov-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: Fix a locking issue related to logging

tcp_respond() is sometimes called with only a read lock.
The logging however, requires a write lock. So either
try to upgrade the lock if needed, or don't log the packet.

Reported by: syzbot+8151ef969c170f76706b@syzkaller.appspotmail.com
Reported by: syzbot+eb679adb3304c511c1e4@syzkaller.appspotmail.com
Reviewed by: markj, rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D32983


# 26cbd002 11-Nov-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Rack may still calculate long RTT on persists probes.

When a persists probe is lost, we will end up calculating a long
RTT based on the initial probe and when the response comes from the
second probe (or third etc). This means we have a minimum of a
confidence level of 3 on a incorrect probe. This commit will change it
so that we have one of two options
a) Just not count RTT of probes where we had a loss
<or>
b) Count them still but degrade the confidence to 0.

I have set in this the default being to just not measure them, but I am open
to having the default be otherwise.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32897


# b8d60729 11-Nov-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Congestion control cleanup.

NOTE: HEADS UP read the note below if your kernel config is not including GENERIC!!

This patch does a bit of cleanup on TCP congestion control modules. There were some rather
interesting surprises that one could get i.e. where you use a socket option to change
from one CC (say cc_cubic) to another CC (say cc_vegas) and you could in theory get
a memory failure and end up on cc_newreno. This is not what one would expect. The
new code fixes this by requiring a cc_data_sz() function so we can malloc with M_WAITOK
and pass in to the init function preallocated memory. The CC init is expected in this
case *not* to fail but if it does and a module does break the
"no fail with memory given" contract we do fall back to the CC that was in place at the time.

This also fixes up a set of common newreno utilities that can be shared amongst other
CC modules instead of the other CC modules reaching into newreno and executing
what they think is a "common and understood" function. Lets put these functions in
cc.c and that way we have a common place that is easily findable by future developers or
bug fixers. This also allows newreno to evolve and grow support for its features i.e. ABE
and HYSTART++ without having to dance through hoops for other CC modules, instead
both newreno and the other modules just call into the common functions if they desire
that behavior or roll there own if that makes more sense.

Note: This commit changes the kernel configuration!! If you are not using GENERIC in
some form you must add a CC module option (one of CC_NEWRENO, CC_VEGAS, CC_CUBIC,
CC_CDG, CC_CHD, CC_DCTCP, CC_HTCP, CC_HD). You can have more than one defined
as well if you desire. Note that if you create a kernel configuration that does not
define a congestion control module and includes INET or INET6 the kernel compile will
break. Also you need to define a default, generic adds 'options CC_DEFAULT=\"newreno\"
but you can specify any string that represents the name of the CC module (same names
that show up in the CC module list under net.inet.tcp.cc). If you fail to add the
options CC_DEFAULT in your kernel configuration the kernel build will also break.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
RELNOTES:YES
Differential Revision: https://reviews.freebsd.org/D32693


# e3ba94d4 09-Nov-2021 John Baldwin <jhb@FreeBSD.org>

Don't require the socket lock for sorele().

Previously, sorele() always required the socket lock and dropped the
lock if the released reference was not the last reference. Many
callers locked the socket lock just before calling sorele() resulting
in a wasted lock/unlock when not dropping the last reference.

Move the previous implementation of sorele() into a new
sorele_locked() function and use it instead of sorele() for various
places in uipc_socket.c that called sorele() while already holding the
socket lock.

The sorele() macro now uses refcount_release_if_not_last() try to drop
the socket reference without locking the socket. If that shortcut
fails, it locks the socket and calls sorele_locked().

Reviewed by: kib, markj
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32741


# bb91496a 03-Nov-2021 Gordon Bergling <gbe@FreeBSD.org>

netinet: Fix a common typo in source code comments

- s/writting/writing/

MFC after: 3 days


# f581a26e 25-Oct-2021 Gleb Smirnoff <glebius@FreeBSD.org>

Factor out tcp6_use_min_mtu() to handle IPV6_USE_MIN_MTU by TCP.

Pass control for IP/IP6 level options from generic tcp_ctloutput_set()
down to per-stack ctloutput.

Call tcp6_use_min_mtu() from tcp stack tcp_default_ctloutput().

Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D32655


# e2833083 25-Oct-2021 Peter Lei <peterlei@netflix.com>

tcp: socket option to get stack alias name

TCP stack sysctl nodes are currently inserted using the stack
name alias. Allow the user to get the current stack's alias to
allow for programatic sysctl access.

Obtained from: Netflix


# a36230f7 01-Oct-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Make dsack stats available in netstat and also make sure its aware of TLP's.

DSACK accounting has been for quite some time under a NETFLIX_STATS ifdef. Statistics
on DSACKs however are very useful in figuring out how much bad retransmissions you
are doing. This is further complicated, however, by stacks that do TLP. A TLP
when discovering a lost ack in the reverse path will cause the generation
of a DSACK. For this situation we introduce a new dsack-tlp-bytes as well
as the more traditional dsack-bytes and dsack-packets. These will now
all display in netstat -p tcp -s. This also updates all stacks that
are currently built to keep track of these stats.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32158


# ca1a7e10 12-Jul-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: TCP_LRO getting bad checksums and sending it in to TCP incorrectly.

In reviewing tcp_lro.c we have a possibility that some drives may send a mbuf into
LRO without making sure that the checksum passes. Some drivers actually are
aware of this and do not call lro when the csum failed, others do not do this and
thus could end up sending data up that we think has a checksum passing when
it does not.

This change will fix that situation by properly verifying that the mbuf
has the correct markings (CSUM VALID bits as well as csum in mbuf header
is set to 0xffff).

Reviewed by: tuexen, hselasky, gallatin
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31155


# 870af3f4 11-Jun-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: tolerate missing timestamps

Some TCP stacks negotiate TS support, but do not send TS at all
or not for keep-alive segments. Since this includes modern widely
deployed stacks, tolerate the violation of RFC 7323 per default.

Reviewed by: rgrimes, rrs, rscheff
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D30740
Sponsored by: Netflix, Inc.


# f4bb1869 14-Jun-2021 Mark Johnston <markj@FreeBSD.org>

Consistently use the SOLISTENING() macro

Some code was using it already, but in many places we were testing
SO_ACCEPTCONN directly. As a small step towards fixing some bugs
involving synchronization with listen(2), make the kernel consistently
use SOLISTENING(). No functional change intended.

MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# 631449d5 24-May-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: Fix an issue with the PUSH bit as well as fill in the missing mtu change for fsb's

The push bit itself was also not actually being properly moved to
the right edge. The FIN bit was incorrectly on the left edge. We
fix these two issues as well as plumb in the mtu_change for
alternate stacks.

Reviewed by: mtuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30413


# 0471a8c7 10-May-2021 Richard Scheffenegger <rscheff@FreeBSD.org>

tcp: SACK Lost Retransmission Detection (LRD)

Recover from excessive losses without reverting to a
retransmission timeout (RTO). Disabled by default, enable
with sysctl net.inet.tcp.do_lrd=1

Reviewed By: #transport, rrs, tuexen, #manpages
Sponsored by: Netapp, Inc.
Differential Revision: https://reviews.freebsd.org/D28931


# 9867224b 10-May-2021 Randall Stewart <rrs@FreeBSD.org>

tcp:Host cache and rack ending up with incorrect values.

The hostcache up to now as been updated in the discard callback
but without checking if we are all done (the race where there are
more than one calls and the counter has not yet reached zero). This
means that when the race occurs, we end up calling the hc_upate
more than once. Also alternate stacks can keep there srtt/rttvar
in different formats (example rack keeps its values in microseconds).
Since we call the hc_update *before* the stack fini() then the
values will be in the wrong format.

Rack on the other hand, needs to convert items pulled from the
hostcache into its internal format else it may end up with
very much incorrect values from the hostcache. In the process
lets commonize the update mechanism for srtt/rttvar since we
now have more than one place that needs to call it.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30172


# 5d8fd932 06-May-2021 Randall Stewart <rrs@FreeBSD.org>

This brings into sync FreeBSD with the netflix versions of rack and bbr.
This fixes several breakages (panics) since the tcp_lro code was
committed that have been reported. Quite a few new features are
now in rack (prefecting of DGP -- Dynamic Goodput Pacing among the
largest). There is also support for ack-war prevention. Documents
comming soon on rack..

Sponsored by: Netflix
Reviewed by: rscheff, mtuexen
Differential Revision: https://reviews.freebsd.org/D30036


# 01d74fe1 12-Apr-2021 Navdeep Parhar <np@FreeBSD.org>

Path MTU discovery hooks for offloaded TCP connections.

Notify the TOE driver when when an ICMP type 3 code 4 (Fragmentation
needed and DF set) message is received for an offloaded connection.
This gives the driver an opportunity to lower the path MTU for the
connection and resume transmission, much like what the kernel does for
the connections that it handles.

Reviewed by: glebius@
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D29755


# 9ca874cf 30-Mar-2021 Hans Petter Selasky <hselasky@FreeBSD.org>

Add TCP LRO support for VLAN and VxLAN.

This change makes the TCP LRO code more generic and flexible with regards
to supporting multiple different TCP encapsulation protocols and in general
lays the ground for broader TCP LRO support. The main job of the TCP LRO code is
to merge TCP packets for the same flow, to reduce the number of calls to upper
layers. This reduces CPU and increases performance, due to being able to send
larger TSO offloaded data chunks at a time. Basically the TCP LRO makes it
possible to avoid per-packet interaction by the host CPU.

Because the current TCP LRO code was tightly bound and optimized for TCP/IP
over ethernet only, several larger changes were needed. Also a minor bug was
fixed in the flushing mechanism for inactive entries, where the expire time,
"le->mtime" was not always properly set.

To avoid having to re-run time consuming regression tests for every change,
it was chosen to squash the following list of changes into a single commit:
- Refactor parsing of all address information into the "lro_parser" structure.
This easily allows to reuse parsing code for inner headers.
- Speedup header data comparison. Don't compare field by field, but
instead use an unsigned long array, where the fields get packed.
- Refactor the IPv4/TCP/UDP checksum computations, so that they may be computed
recursivly, only applying deltas as the result of updating payload data.
- Make smaller inline functions doing one operation at a time instead of
big functions having repeated code.
- Refactor the TCP ACK compression code to only execute once
per TCP LRO flush. This gives a minor performance improvement and
keeps the code simple.
- Use sbintime() for all time-keeping. This change also fixes flushing
of inactive entries.
- Try to shrink the size of the LRO entry, because it is frequently zeroed.
- Removed unused TCP LRO macros.
- Cleanup unused TCP LRO statistics counters while at it.
- Try to use __predict_true() and predict_false() to optimise CPU branch
predictions.

Bump the __FreeBSD_version due to changing the "lro_ctrl" structure.

Tested by: Netflix
Reviewed by: rrs (transport)
Differential Revision: https://reviews.freebsd.org/D29564
MFC after: 2 week
Sponsored by: Mellanox Technologies // NVIDIA Networking


# 9e644c23 18-Apr-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: add support for TCP over UDP

Adding support for TCP over UDP allows communication with
TCP stacks which can be implemented in userspace without
requiring special priviledges or specific support by the OS.
This is joint work with rrs.

Reviewed by: rrs
Sponsored by: Netflix, Inc.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D29469


# 86046cf5 16-Apr-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_respond(): fix assertion, should have been done in 08d9c920275.


# 08d9c920 18-Mar-2021 Gleb Smirnoff <glebius@FreeBSD.org>

tcp_input/syncache: acquire only read lock on PCB for SYN,!ACK packets

When packet is a SYN packet, we don't need to modify any existing PCB.
Normally SYN arrives on a listening socket, we either create a syncache
entry or generate syncookie, but we don't modify anything with the
listening socket or associated PCB. Thus create a new PCB lookup
mode - rlock if listening. This removes the primary contention point
under SYN flood - the listening socket PCB.

Sidenote: when SYN arrives on a synchronized connection, we still
don't need write access to PCB to send a challenge ACK or just to
drop. There is only one exclusion - tcptw recycling. However,
existing entanglement of tcp_input + stacks doesn't allow to make
this change small. Consider this patch as first approach to the problem.

Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D29576


# 69a34e8d 26-Jan-2021 Randall Stewart <rrs@FreeBSD.org>

Update the LRO processing code so that we can support
a further CPU enhancements for compressed acks. These
are acks that are compressed into an mbuf. The transport
has to be aware of how to process these, and an upcoming
update to rack will do so. You need the rack changes
to actually test and validate these since if the transport
does not support mbuf compression, then the old code paths
stay in place. We do in this commit take out the concept
of logging if you don't have a lock (which was quite
dangerous and was only for some early debugging but has
been left in the code).

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D28374


# bc7ee8e5 19-Jan-2021 Richard Scheffenegger <srichard@netapp.com>

Address panic with PRR due to missed initialization of recover_fs

Summary:
When using the base stack in conjunction with RACK, it appears that
infrequently, ++tp->t_dupacks is instantly larger than tcprexmtthresh.

This leaves the recover flightsize (sackhint.recover_fs) uninitialized,
leading to a div/0 panic.

Address this by properly initializing the variable just prior to first
use, if it is not properly initialized.

In order to prevent stale information from a prior recovery to
negatively impact the PRR calculations in this event, also clear
recover_fs once loss recovery is finished.

Finally, improve the readability of the initialization of recover_fs
when t_dupacks == tcprexmtthresh by adjusting the indentation and
using the max(1, snd_nxt - snd_una) macro.

Reviewers: rrs, kbowling, tuexen, jtl, #transport, gnn!, jmg, manu, #manpages

Reviewed By: rrs, kbowling, #transport

Subscribers: bdrewery, andrew, rpokala, ae, emaste, bz, bcran, #linuxkpi, imp, melifaro

Differential Revision: https://reviews.freebsd.org/D28114


# d2b3cedd 13-Jan-2021 Michael Tuexen <tuexen@FreeBSD.org>

tcp: add sysctl to tolerate TCP segments missing timestamps

When timestamp support has been negotiated, TCP segements received
without a timestamp should be discarded. However, there are broken
TCP implementations (for example, stacks used by Omniswitch 63xx and
64xx models), which send TCP segments without timestamps although
they negotiated timestamp support.
This patch adds a sysctl variable which tolerates such TCP segments
and allows to interoperate with broken stacks.

Reviewed by: jtl@, rscheff@
Differential Revision: https://reviews.freebsd.org/D28142
Sponsored by: Netflix, Inc.
PR: 252449
MFC after: 1 week


# ce398115 28-Oct-2020 John Baldwin <jhb@FreeBSD.org>

Save the current TCP pacing rate in t_pacing_rate.

Reviewed by: gallatin, gnn
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26875


# 54321200 09-Oct-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Extend netstat to display TCP stack and detailed congestion state (2)

Extend netstat to display TCP stack and detailed congestion state

Adding the "-c" option used to show detailed per-connection
congestion control state for TCP sessions.

This is one summary patch, which adds the relevant variables into
xtcpcb. As previous "spare" space is used, these changes are ABI
compatible.

Reviewed by: tuexen
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D26518


# e3995661 25-Sep-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

TCP: send full initial window when timestamps are in use

The fastpath in tcp_output tries to send out
full segments, and avoid sending partial segments by
comparing against the static t_maxseg variable.
That value does not consider tcp options like timestamps,
while the initial window calculation is using
the correct dynamic tcp_maxseg() function.

Due to this interaction, the last, full size segment
is considered too short and not sent out immediately.

Reviewed by: tuexen
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D26478


# 42d75607 13-Sep-2020 Michael Tuexen <tuexen@FreeBSD.org>

Export the name of the congestion control. This will be used by sockstat
and netstat.

Reviewed by: rscheff
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D26412


# 662c1305 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

net: clean up empty lines in .c and .h files


# 8315f1ea 31-Jul-2020 Randall Stewart <rrs@FreeBSD.org>

The recent changes to move the ref count increment
back from the end of the function created an issue.
If one of the routines returns NULL during setup
we have inp's with extra references (which is why
the increment was at the end).

Also the stack switch return code was being ignored
and actually has meaning if the stack cannot take over
it should return NULL.

Fix both of these situation by being sure to test the
return code and of course in any case of return NULL (there
are 3) make sure we properly reduce the ref count.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D25903


# c201ce0b 06-Jul-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Fix KASSERT during tcp_newtcpcb when low on memory

While testing with system default cc set to cubic, and
running a memory exhaustion validation, FreeBSD panics for a
missing inpcb reference / lock.

Reviewed by: rgrimes (mentor), tuexen (mentor)
Approved by: rgrimes (mentor), tuexen (mentor)
MFC after: 3 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D25583


# a37a5246 28-May-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Use fib[46]_lookup() in mtu calculations.

fib[46]_lookup_nh_ represents pre-epoch generation of fib api,
providing less guarantees over pointer validness and requiring
on-stack data copying.

Conversion is straight-forwarded, as the only 2 differences are
requirement of running in network epoch and the need to handle
RTF_GATEWAY case in the caller code.

Differential Revision: https://reviews.freebsd.org/D24974


# e570d231 27-Apr-2020 Randall Stewart <rrs@FreeBSD.org>

This change does a small prepratory step in getting the
latest rack and bbr in from the NF repo. When those come
in the OOB data handling will be fixed where Skyzaller crashes.

Differential Revision: https://reviews.freebsd.org/D24575


# 983066f0 25-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Convert route caching to nexthop caching.

This change is build on top of nexthop objects introduced in r359823.

Nexthops are separate datastructures, containing all necessary information
to perform packet forwarding such as gateway interface and mtu. Nexthops
are shared among the routes, providing more pre-computed cache-efficient
data while requiring less memory. Splitting the LPM code and the attached
data solves multiple long-standing problems in the routing layer,
drastically reduces the coupling with outher parts of the stack and allows
to transparently introduce faster lookup algorithms.

Route caching was (re)introduced to minimise (slow) routing lookups, allowing
for notably better performance for large TCP senders. Caching works by
acquiring rtentry reference, which is protected by per-rtentry mutex.
If the routing table is changed (checked by comparing the rtable generation id)
or link goes down, cache record gets withdrawn.

Nexthops have the same reference counting interface, backed by refcount(9).
This change merely replaces rtentry with the actual forwarding nextop as a
cached object, which is mostly mechanical. Other moving parts like cache
cleanup on rtable change remains the same.

Differential Revision: https://reviews.freebsd.org/D24340


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# 481be5de 12-Feb-2020 Randall Stewart <rrs@FreeBSD.org>

White space cleanup -- remove trailing tab's or spaces
from any line.

Sponsored by: Netflix Inc.


# 0452a1f3 21-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Add documenting NET_EPOCH_ASSERT() to tcp_drop().


# bab98355 21-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Add some documenting NET_EPOCH_ASSERTs.


# 4c69f60a 13-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Fix yet another regression from r354484. Error code from cr_cansee()
aliases with hard error from other operations.

Reported by: flo


# 334fc582 08-Jan-2020 Bjoern A. Zeeb <bz@FreeBSD.org>

vnet: virtualise more network stack sysctls.

Virtualise tcp_always_keepalive, TCP and UDP log_in_vain. All three are
set in the netoptions startup script, which we would love to run for VNETs
as well [1].

While virtualising the log_in_vain sysctls seems pointles at first for as
long as the kernel message buffer is not virtualised, it at least allows
an administrator to debug the base system or an individual jail if needed
without turning the logging on for all jails running on a system.

PR: 243193 [1]
MFC after: 2 weeks


# 1cf55767 17-Dec-2019 Randall Stewart <rrs@FreeBSD.org>

This commit is a bit of a re-arrange of deck chairs. It
gets both rack and bbr ready for the completion of the STATs
framework in FreeBSD. For now if you don't have both NF_stats and
stats on it disables them. As soon as the rest of the stats framework
lands we can remove that restriction and then just uses stats when
defined.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D22479


# e5a084d0 04-Dec-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Fix regression from r354484. Don't leak pcb lock if cr_canseeinpcb()
returns non-zero.

PR: 242415


# adc56f5a 02-Dec-2019 Edward Tomasz Napierala <trasz@FreeBSD.org>

Make use of the stats(3) framework in the TCP stack.

This makes it possible to retrieve per-connection statistical
information such as the receive window size, RTT, or goodput,
using a newly added TCP_STATS getsockopt(3) option, and extract
them using the stats_voistat_fetch(3) API.

See the net/tcprtt port for an example consumer of this API.

Compared to the existing TCP_INFO system, the main differences
are that this mechanism is easy to extend without breaking ABI,
and provides statistical information instead of raw "snapshots"
of values at a given point in time. stats(3) is more generic
and can be used in both userland and the kernel.

Reviewed by: thj
Tested by: thj
Obtained from: Netflix
Relnotes: yes
Sponsored by: Klara Inc, Netflix
Differential Revision: https://reviews.freebsd.org/D20655


# 032677ce 07-Nov-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Now that there is no R/W lock on PCB list the pcblist sysctls
handlers can be greatly simplified. All the previous double
cycling and complex locking was added to avoid these functions
holding global PCB locks for extended period of time, preventing
addition of new entries.


# d797164a 07-Nov-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Since r353292 on input path we are always in network epoch, when
we lookup PCBs. Thus, do not enter epoch recursively in
in_pcblookup_hash() and in6_pcblookup_hash(). Same applies to
tcp_ctlinput() and tcp6_ctlinput().

This leaves several sysctl(9) handlers that return PCB credentials
unprotected. Add epoch enter/exit to all of them.

Differential Revision: https://reviews.freebsd.org/D22197


# 1a496125 06-Nov-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER().
Remove few outdated comments and extraneous assertions. No
functional change here.


# 71e85612 28-Sep-2019 Michael Tuexen <tuexen@FreeBSD.org>

Replacing MD5 by SipHash improves the performance of the TCP time stamp
initialisation, which is important when the host is dealing with a
SYN flood.
This affects the computation of the initial TCP sequence number for
the client side.
This has been discussed with secteam@.

Reviewed by: gallatin@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21616


# af9b9e0d 06-Sep-2019 Randall Stewart <rrs@FreeBSD.org>

This adds in the missing counter initialization which
I had forgotten to bring over.. opps.

Differential Revision: https://reviews.freebsd.org/D21127


# b2e60773 26-Aug-2019 John Baldwin <jhb@FreeBSD.org>

Add kernel-side support for in-kernel TLS.

KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277


# d21036e0 23-Jul-2019 Michael Tuexen <tuexen@FreeBSD.org>

Add a sysctl variable ts_offset_per_conn to change the computation
of the TCP TS offset from taking the IP addresses and the TCP port
numbers into account to a version just taking only the IP addresses
into account. This works around broken middleboxes or endpoints.
The default is to keep the behaviour, which is also the behaviour
recommended in RFC 7323.

Reported by: devgs@ukr.net
Reviewed by: rrs@
MFC after: 2 weeks
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D20980


# 6b69072a 27-Jun-2019 John Baldwin <jhb@FreeBSD.org>

Reject attempts to register a TCP stack being unloaded.

Reviewed by: gallatin
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20617


# 0999766d 23-Mar-2019 Michael Tuexen <tuexen@FreeBSD.org>

Add sysctl variable net.inet.tcp.rexmit_initial for setting RTO.Initial
used by TCP.

Reviewed by: rrs@, 0mp@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D19355


# dbcc2000 27-Feb-2019 John Baldwin <jhb@FreeBSD.org>

Various cleanups to the management of multiple TCP stacks.

- Use strlcpy() with sizeof() instead of strncpy().

- Simplify initialization of TCP functions structures.

init_tcp_functions() was already called before the first call to
register a stack. Just inline the work in the SYSINIT and remove
the racy helper variable. Instead, KASSERT that the rw lock is
initialized when registering a stack.

- Protect the default stack via a direct pointer comparison.

The default stack uses the name "freebsd" instead of "default" so
this protection wasn't working for the default stack anyway.

Reviewed by: rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19152


# 79db6fe7 22-Nov-2018 Mark Johnston <markj@FreeBSD.org>

Plug some networking sysctl leaks.

Various network protocol sysctl handlers were not zero-filling their
output buffers and thus would export uninitialized stack memory to
userland. Fix a number of such handlers.

Reported by: Thomas Barabosch, Fraunhofer FKIE
Reviewed by: tuexen
MFC after: 3 days
Security: kernel memory disclosure
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18301


# 90ab3571 23-Aug-2018 Michael Tuexen <tuexen@FreeBSD.org>

Use arc4rand() instead of read_random() in the SCTP and TCP code.

This was suggested by jmg@.

Reviewed by: delphij@, jmg@, jtl@
MFC after: 1 month
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16860


# 4ba1513d 23-Aug-2018 Michael Tuexen <tuexen@FreeBSD.org>

Don't use the explicit number 32 for the length of the secrets,
use sizeof() or explicit #definesi instead. No functional change.
This was suggested by jmg@.

MFC after: 1 month
XMFC with: r338053
Sponsored by: Netflix, Inc.


# 5dff1c38 21-Aug-2018 Michael Tuexen <tuexen@FreeBSD.org>

Enabling the IPPROTO_IPV6 level socket option IPV6_USE_MIN_MTU on a TCP
socket resulted in sending fragmented IPV6 packets.

This is fixes by reducing the MSS to the appropriate value. In addtion,
if the socket option is set before the handshake happens, announce this
MSS to the peer. This is not stricly required, but done since TCP
is conservative.

PR: 173444
Reviewed by: bz@, rrs@
MFC after: 1 month
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16796


# c28440db 19-Aug-2018 Randall Stewart <rrs@FreeBSD.org>

This change represents a substantial restructure of the way we
reassembly inbound tcp segments. The old algorithm just blindly
dropped in segments without coalescing. This meant that every
segment could take up greater and greater room on the linked list
of segments. This of course is now subject to a tighter limit (100)
of segments which in a high BDP situation will cause us to be a
lot more in-efficent as we drop segments beyond 100 entries that
we receive. What this restructure does is cause the reassembly
buffer to coalesce segments putting an emphasis on the two
common cases (which avoid walking the list of segments) i.e.
where we add to the back of the queue of segments and where we
add to the front. We also have the reassembly buffer supporting
a couple of debug options (black box logging as well as counters
for code coverage). These are compiled out by default but can
be added by uncommenting the defines.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D16626


# 8e02b4e0 19-Aug-2018 Michael Tuexen <tuexen@FreeBSD.org>

Don't expose the uptime via the TCP timestamps.

The TCP client side or the TCP server side when not using SYN-cookies
used the uptime as the TCP timestamp value. This patch uses in all
cases an offset, which is the result of a keyed hash function taking
the source and destination addresses and port numbers into account.
The keyed hash function is the same a used for the initial TSN.

Reviewed by: rrs@
MFC after: 1 month
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16636


# 5f901c92 24-Jul-2018 Andrew Turner <andrew@FreeBSD.org>

Use the new VNET_DEFINE_STATIC macro when we are defining static VNET
variables.

Reviewed by: bz
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D16147


# 22699887 21-Jul-2018 Matt Macy <mmacy@FreeBSD.org>

NULL out cc_data in pluggable TCP {cc}_cb_destroy

When ABE was added (rS331214) to NewReno and leak fixed (rS333699) , it now has
a destructor (newreno_cb_destroy) for per connection state. Other congestion
controls may allocate and free cc_data on entry and exit, but the field is
never explicitly NULLed if moving back to NewReno which only internally
allocates stateful data (no entry contstructor) resulting in a situation where
newreno_cb_destory might be called on a junk pointer.

- NULL out cc_data in the framework after calling {cc}_cb_destroy
- free(9) checks for NULL so there is no need to perform not NULL checks
before calling free.
- Improve a comment about NewReno in tcp_ccalgounload

This is the result of a debugging session from Jason Wolfe, Jason Eggleston,
and mmacy@ and very helpful insight from lstewart@.

Submitted by: Kevin Bowling
Reviewed by: lstewart
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16282


# 6573d758 03-Jul-2018 Matt Macy <mmacy@FreeBSD.org>

epoch(9): allow preemptible epochs to compose

- Add tracker argument to preemptible epochs
- Inline epoch read path in kernel and tied modules
- Change in_epoch to take an epoch as argument
- Simplify tfb_tcp_do_segment to not take a ti_locked argument,
there's no longer any benefit to dropping the pcbinfo lock
and trying to do so just adds an error prone branchfest to
these functions
- Remove cases of same function recursion on the epoch as
recursing is no longer free.
- Remove the the TAILQ_ENTRY and epoch_section from struct
thread as the tracker field is now stack or heap allocated
as appropriate.

Tested by: pho and Limelight Networks
Reviewed by: kbowling at llnw dot com
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16066


# 99208b82 01-Jul-2018 Matt Macy <mmacy@FreeBSD.org>

inpcb: don't gratuitously defer frees

Don't defer frees in sysctl handlers. It isn't necessary
and it just confuses things.
revert: r333911, r334104, and r334125

Requested by: jtl


# b8ab6593 27-Jun-2018 Gleb Smirnoff <glebius@FreeBSD.org>

Check the inp_flags under inp lock. Looks like the race was hidden
before, the conversion of tcbinfo to CK_LIST have uncovered it.


# b872626d 12-Jun-2018 Matt Macy <mmacy@FreeBSD.org>

mechanical CK macro conversion of inpcbinfo lists

This is a dependency for converting the inpcbinfo hash and info rlocks
to epoch.


# fe524329 23-May-2018 Matt Macy <mmacy@FreeBSD.org>

convert allocations to INVARIANTS M_ZERO


# 246a6199 23-May-2018 Matt Macy <mmacy@FreeBSD.org>

epoch: allow for conditionally asserting that the epoch context fields
are unused by zeroing on INVARIANTS builds


# 056b40e2 19-May-2018 Matt Macy <mmacy@FreeBSD.org>

inpcb: consolidate possible deletion in pcblist functions in to epoch
deferred context.


# 7a3c5b05 18-May-2018 Matt Macy <mmacy@FreeBSD.org>

tcp sysctl fix may be uninitialized


# 7875017c 24-Apr-2018 Sean Bruno <sbruno@FreeBSD.org>

Revert r332894 at the request of the submitter.

Submitted by: Johannes Lundberg <johalun0_gmail.com>
Sponsored by: Limelight Networks


# 7b7796ee 23-Apr-2018 Sean Bruno <sbruno@FreeBSD.org>

Load balance sockets with new SO_REUSEPORT_LB option

This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

Limitations
As DragonflyBSD, a load balance group is limited to 256 pcbs
(256 programs or threads sharing the same socket).

Submitted by: Johannes Lundberg <johanlun0@gmail.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D11003


# fd389e7c 19-Apr-2018 Randall Stewart <rrs@FreeBSD.org>

These two modules need the tcp_hpts.h file for
when the option is enabled (not sure how LINT/build-universe
missed this) opps.

Sponsored by: Netflix Inc


# 3ee9c3c4 19-Apr-2018 Randall Stewart <rrs@FreeBSD.org>

This commit brings in the TCP high precision timer system (tcp_hpts).
It is the forerunner/foundational work of bringing in both Rack and BBR
which use hpts for pacing out packets. The feature is optional and requires
the TCPHPTS option to be enabled before the feature will be active. TCP
modules that use it must assure that the base component is compile in
the kernel in which they are loaded.

MFC after: Never
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D15020


# 8b8718b5 10-Apr-2018 Jonathan T. Looney <jtl@FreeBSD.org>

Modify the net.inet.tcp.function_ids sysctl introduced in r331347.
Export additional information which may be helpful to userspace
consumers and rename the sysctl to net.inet.tcp.function_info.

Sponsored by: Netflix, Inc.


# 6469bdcd 06-Apr-2018 Brooks Davis <brooks@FreeBSD.org>

Move most of the contents of opt_compat.h to opt_global.h.

opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941


# e24e5683 23-Mar-2018 Jonathan T. Looney <jtl@FreeBSD.org>

Make the TCP blackbox code committed in r331347 be an optional feature
controlled by the TCP_BLACKBOX option.

Enable this as part of amd64 GENERIC. For now, leave it disabled on
other platforms.

Sponsored by: Netflix, Inc.


# 2529f56e 22-Mar-2018 Jonathan T. Looney <jtl@FreeBSD.org>

Add the "TCP Blackbox Recorder" which we discussed at the developer
summits at BSDCan and BSDCam in 2017.

The TCP Blackbox Recorder allows you to capture events on a TCP connection
in a ring buffer. It stores metadata with the event. It optionally stores
the TCP header associated with an event (if the event is associated with a
packet) and also optionally stores information on the sockets.

It supports setting a log ID on a TCP connection and using this to correlate
multiple connections that share a common log ID.

You can log connections in different modes. If you are doing a coordinated
test with a particular connection, you may tell the system to put it in
mode 4 (continuous dump). Or, if you just want to monitor for errors, you
can put it in mode 1 (ring buffer) and dump all the ring buffers associated
with the connection ID when we receive an error signal for that connection
ID. You can set a default mode that will be applied to a particular ratio
of incoming connections. You can also manually set a mode using a socket
option.

This commit includes only basic probes. rrs@ has added quite an abundance
of probes in his TCP development work. He plans to commit those soon.

There are user-space programs which we plan to commit as ports. These read
the data from the log device and output pcapng files, and then let you
analyze the data (and metadata) in the pcapng files.

Reviewed by: gnn (previous version)
Obtained from: Netflix, Inc.
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D11085


# 18a75309 25-Feb-2018 Patrick Kelsey <pkelsey@FreeBSD.org>

Greatly reduce the number of #ifdefs supporting the TCP_RFC7413 kernel option.

The conditional compilation support is now centralized in
tcp_fastopen.h and tcp_var.h. This doesn't provide the minimum
theoretical code/data footprint when TCP_RFC7413 is disabled, but
nearly all the TFO code should wind up being removed by the optimizer,
the additional footprint in the syncache entries is a single pointer,
and the additional overhead in the tcpcb is at the end of the
structure.

This enables the TCP_RFC7413 kernel option by default in amd64 and
arm64 GENERIC.

Reviewed by: hiren
MFC after: 1 month
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14048


# c560df6f 25-Feb-2018 Patrick Kelsey <pkelsey@FreeBSD.org>

This is an implementation of the client side of TCP Fast Open (TFO)
[RFC7413]. It also includes a pre-shared key mode of operation in
which the server requires the client to be in possession of a shared
secret in order to successfully open TFO connections with that server.

The names of some existing fastopen sysctls have changed (e.g.,
net.inet.tcp.fastopen.enabled -> net.inet.tcp.fastopen.server_enable).

Reviewed by: tuexen
MFC after: 1 month
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14047


# 7a5c7305 22-Nov-2017 Mark Johnston <markj@FreeBSD.org>

Use the right variable for the IP header parameter to tcp:::send.

This addresses a regression from r311225.

MFC after: 1 week


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# 779f106a 08-Jun-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Listening sockets improvements.

o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().

o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.

o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.

o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770


# dc6a41b9 08-Jun-2017 Jonathan T. Looney <jtl@FreeBSD.org>

Add the infrastructure to support loading multiple versions of TCP
stack modules.

It adds support for mangling symbols exported by a module by prepending
a string to them. (This avoids overlapping symbols in the kernel linker.)

It allows the use of a macro as the module name in the DECLARE_MACRO()
and MACRO_VERSION() macros.

It allows the code to register stack aliases (e.g. both a generic name
["default"] and version-specific name ["default_10_3p1"]).

With these changes, it is trivial to compile TCP stack modules with
the name defined in the Makefile and to load multiple versions of the
same stack simultaneously. This functionality can be used to enable
side-by-side testing of an old and new version of the same TCP stack.
It also could support upgrading the TCP stack without a reboot.

Reviewed by: gnn, sjg (makefiles only)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D11086


# 8cb5a8e9 03-Jun-2017 Michael Tuexen <tuexen@FreeBSD.org>

Fix the ICMP6 handling for TCP.

The ICMP6 packets might not be contained in a single mbuf. So don't
assume this. Keep the IPv4 and IPv6 code in sync and make explicit
that the syncache code only need the TCP sequence number, not the
complete TCP header.

MFC after: 3 days
Sponsored by: Netflix, Inc.


# cc487c16 15-May-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Reduce in_pcbinfo_init() by two params. No users supply any flags to this
function (they used to say UMA_ZONE_NOFREE), so flag parameter goes away.
The zone_fini parameter also goes away. Previously no protocols (except
divert) supplied zone_fini function, so inpcb locks were leaked with slabs.
This was okay while zones were allocated with UMA_ZONE_NOFREE flag, but now
this is a leak. Fix that by suppling inpcb_fini() function as fini method
for all inpcb zones.


# cc65eb4e 21-Mar-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Hide struct inpcb, struct tcpcb from the userland.

This is a painful change, but it is needed. On the one hand, we avoid
modifying them, and this slows down some ideas, on the other hand we still
eventually modify them and tools like netstat(1) never work on next version of
FreeBSD. We maintain a ton of spares in them, and we already got some ifdef
hell at the end of tcpcb.

Details:
- Hide struct inpcb, struct tcpcb under _KERNEL || _WANT_FOO.
- Make struct xinpcb, struct xtcpcb pure API structures, not including
kernel structures inpcb and tcpcb inside. Export into these structures
the fields from inpcb and tcpcb that are known to be used, and put there
a ton of spare space.
- Make kernel and userland utilities compilable after these changes.
- Bump __FreeBSD_version.

Reviewed by: rrs, gnn
Differential Revision: D10018


# fbbd9655 28-Feb-2017 Warner Losh <imp@FreeBSD.org>

Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96


# fcf59617 06-Feb-2017 Andrey V. Elsukov <ae@FreeBSD.org>

Merge projects/ipsec into head/.

Small summary
-------------

o Almost all IPsec releated code was moved into sys/netipsec.
o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel
option IPSEC_SUPPORT added. It enables support for loading
and unloading of ipsec.ko and tcpmd5.ko kernel modules.
o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by
default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type
support was removed. Added TCP/UDP checksum handling for
inbound packets that were decapsulated by transport mode SAs.
setkey(8) modified to show run-time NAT-T configuration of SA.
o New network pseudo interface if_ipsec(4) added. For now it is
build as part of ipsec.ko module (or with IPSEC kernel).
It implements IPsec virtual tunnels to create route-based VPNs.
o The network stack now invokes IPsec functions using special
methods. The only one header file <netipsec/ipsec_support.h>
should be included to declare all the needed things to work
with IPsec.
o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed.
Now these protocols are handled directly via IPsec methods.
o TCP_SIGNATURE support was reworked to be more close to RFC.
o PF_KEY SADB was reworked:
- now all security associations stored in the single SPI namespace,
and all SAs MUST have unique SPI.
- several hash tables added to speed up lookups in SADB.
- SADB now uses rmlock to protect access, and concurrent threads
can do SA lookups in the same time.
- many PF_KEY message handlers were reworked to reflect changes
in SADB.
- SADB_UPDATE message was extended to support new PF_KEY headers:
SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They
can be used by IKE daemon to change SA addresses.
o ipsecrequest and secpolicy structures were cardinally changed to
avoid locking protection for ipsecrequest. Now we support
only limited number (4) of bundled SAs, but they are supported
for both INET and INET6.
o INPCB security policy cache was introduced. Each PCB now caches
used security policies to avoid SP lookup for each packet.
o For inbound security policies added the mode, when the kernel does
check for full history of applied IPsec transforms.
o References counting rules for security policies and security
associations were changed. The proper SA locking added into xform
code.
o xform code was also changed. Now it is possible to unregister xforms.
tdb_xxx structures were changed and renamed to reflect changes in
SADB/SPDB, and changed rules for locking and refcounting.

Reviewed by: gnn, wblock
Obtained from: Yandex LLC
Relnotes: yes
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D9352


# ec93ed8d 03-Feb-2017 Patrick Kelsey <pkelsey@FreeBSD.org>

Fix VIMAGE-related bugs in TFO. The autokey callout vnet context was
not being initialized, and the per-vnet fastopen context was only
being initialized for the default vnet.

PR: 216613
Reported by: Alex Deiter <alex dot deiter at gmail dot com>
MFC after: 1 week


# 2b9c9984 03-Jan-2017 George V. Neville-Neil <gnn@FreeBSD.org>

Fix DTrace TCP tracepoints to not use mtod() as it is both unnecessary and
dangerous. Those wanting data from an mbuf should use DTrace itself to get
the data.

PR: 203409
Reviewed by: hiren
MFC after: 1 week
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D9035


# 35dfb8cb 19-Nov-2016 Michael Tuexen <tuexen@FreeBSD.org>

Ensure that TCP state changes to state-closing are reported via dtrace.
This does not cover state changes from TIME-WAIT.

Reviewed by: gnn
MFC after: 3 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8443


# 8432fa5f 05-Nov-2016 Andrey V. Elsukov <ae@FreeBSD.org>

Initialize ip6 pointer before use.

PR: 214169
MFC after: 1 week


# 3e146575 21-Oct-2016 Michael Tuexen <tuexen@FreeBSD.org>

Make ICMPv6 hard error handling for TCP consistent with the ICMPv4
handling. Ensure that:
* Protocol unreachable errors are handled by indicating ECONNREFUSED
to the TCP user for both IPv4 and IPv6. These were ignored for IPv6.
* Communication prohibited errors are handled by indicating ECONNREFUSED
to the TCP user for both IPv4 and IPv6. These were ignored for IPv6.
* Hop Limited exceeded errors are handled by indicating EHOSTUNREACH
to the TCP user for both IPv4 and IPv6.
For IPv6 the TCP connected was dropped but errno wasn't set.

Reviewed by: gallatin, rrs
MFC after: 1 month
Sponsored by: Netflix
Differential Revision: 7904


# 82676a28 14-Oct-2016 Jonathan T. Looney <jtl@FreeBSD.org>

r307082 added the TCP_HHOOK kernel option and made some existing code only
compile when that option is configured. In tcp_destroy(), the error
variable is now only used in code enclosed in an '#ifdef TCP_HHOOK' block.
This broke the build for VNET images.

Enclose the error variable itself in an #ifdef block.

Submitted by: Shawn Webb <shawn.webb at hardenedbsd.org>
Reported by: Shawn Webb <shawn.webb at hardenedbsd.org>
PointyHat to: jtl


# bd79708d 11-Oct-2016 Jonathan T. Looney <jtl@FreeBSD.org>

In the TCP stack, the hhook(9) framework provides hooks for kernel modules
to add actions that run when a TCP frame is sent or received on a TCP
session in the ESTABLISHED state. In the base tree, this functionality is
only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd,
and cc_vegas congestion control modules.

Presently, we incur overhead to check for hooks each time a TCP frame is
sent or received on an ESTABLISHED TCP session.

This change adds a new compile-time option (TCP_HHOOK) to determine whether
to include the hhook(9) framework for TCP. To retain backwards
compatibility, I added the TCP_HHOOK option to every configuration file that
already defined "options INET". (Therefore, this patch introduces no
functional change. In order to see a functional difference, you need to
compile a custom kernel without the TCP_HHOOK option.) This change will
allow users to easily exclude this functionality from their kernel, should
they wish to do so.

Note that any users who use a custom kernel configuration and use one of the
congestion control modules listed above will need to add the TCP_HHOOK
option to their kernel configuration.

Reviewed by: rrs, lstewart, hiren (previous version), sjg (makefiles only)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8185


# 3ac12506 06-Oct-2016 Jonathan T. Looney <jtl@FreeBSD.org>

Remove "long" variables from the TCP stack (not including the modular
congestion control framework).

Reviewed by: gnn, lstewart (partial)
Sponsored by: Juniper Networks, Netflix
Differential Revision: (multiple)
Tested by: Limelight, Netflix


# 587d67c0 16-Aug-2016 Randall Stewart <rrs@FreeBSD.org>

Here we update the modular tcp to be able to switch to an
alternate TCP stack in other then the closed state (pre-listen/connect).
The idea is that *if* that is supported by the alternate stack, it
is asked if its ok to switch. If it approves the "handoff" then we
allow the switch to happen. Also the fini() function now gets a flag
to tell if you are switching away *or* the tcb is destroyed. The
init() call into the alternate stack is moved to the end so the
tcb is more fully formed before the init transpires.

Sponsored by: Netflix Inc.
Differential Revision: D6790


# d4c22202 01-Aug-2016 Andrew Gallatin <gallatin@FreeBSD.org>

Rework IPV6 TCP path MTU discovery to match IPv4

- Re-write tcp_ctlinput6() to closely mimic the IPv4 tcp_ctlinput()

- Now that tcp_ctlinput6() updates t_maxseg, we can allow ip6_output()
to send TCP packets without looking at the tcp host cache for every
single transmit.

- Make the icmp6 code mimic the IPv4 code & avoid returning
PRC_HOSTDEAD because it is so expensive.

Without these changes in place, every TCP6 pmtu discovery or host
unreachable ICMP resulted in a call to in6_pcbnotify() which walks the
tcbinfo table with the write lock held. Because the tcbinfo table is
shared between IPv4 and IPv6, this causes huge scalabilty issues on
servers with lots of (~100K) TCP connections, to the point where even
a small percent of IPv6 traffic had a disproportionate impact on
overall throughput.

Reviewed by: bz, rrs, ae (all earlier versions), lstewart (in Netflix's tree)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D7272


# 0e3b8919 28-Jul-2016 Andrew Gallatin <gallatin@FreeBSD.org>

Call tcp_notify() directly to shoot down routes, rather than
calling in_pcbnotifyall().

This avoids lock contention on tcbinfo due to in_pcbnotifyall()
holding the tcbinfo write lock while walking all connections.

Reviewed by: rrs, karels
MFC after: 2 weeks
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D7251


# 24b9bb56 06-Jul-2016 Jonathan T. Looney <jtl@FreeBSD.org>

The TCPPCAP debugging feature caches recently-used mbufs for use in
debugging TCP connections. This commit provides a mechanism to free those
mbufs when the system is under memory pressure.

Because this will result in lost debugging information, the behavior is
controllable by a sysctl. The default setting is to free the mbufs.

Reviewed by: gnn
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D6931
Input from: novice_techie.com


# 42644eb8 23-Jun-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

Try to avoid a 2nd conditional by re-writing the loop, pause, and
escape clause another time.

Submitted by: jhb
Approved by: re (gjb)
MFC after: 12 days


# 361279a4 23-Jun-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

In VNET TCP teardown Do not sleep unconditionally but only if we
have any TCP connections left.

Submitted by: zec
Approved by: re (hrs)
MFC after: 13 days


# b54e08e1 22-Jun-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

Check the V_tcbinfo.ipi_count to hit 0 before doing the full TCP cleanup.
That way timers can finish cleanly and we do not gamble with a DELAY().

Reviewed by: gnn, jtl
Approved by: re (gjb)
Obtained from: projects/vnet
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D6923


# 3f58662d 01-Jun-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

The pr_destroy field does not allow us to run the teardown code in a
specific order. VNET_SYSUNINITs however are doing exactly that.
Thus remove the VIMAGE conditional field from the domain(9) protosw
structure and replace it with VNET_SYSUNINITs.
This also allows us to change some order and to make the teardown functions
file local static.
Also convert divert(4) as it uses the same mechanism ip(4) and ip6(4) use
internally.

Slightly reshuffle the SI_SUB_* fields in kernel.h and add a new ones, e.g.,
for pfil consumers (firewalls), partially for this commit and for others
to come.

Reviewed by: gnn, tuexen (sctp), jhb (kernel.h)
Obtained from: projects/vnet
MFC after: 2 weeks
X-MFC: do not remove pr_destroy
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D6652


# 052a5418 26-May-2016 John Baldwin <jhb@FreeBSD.org>

Don't reuse the source mbuf in tcp_respond() if it is not writable.

Not all mbufs passed up from device drivers are M_WRITABLE(). In
particular, the Chelsio T4/T5 driver uses a feature called "buffer packing"
to receive multiple frames in a single receive buffer. The mbufs for
these frames all share the same external storage so are treated as
read-only by the rest of the stack when multiple frames are in flight.
Previously tcp_respond() would blindly overwrite read-only mbufs when
INVARIANTS was disabled or panic with an assertion failure if INVARIANTS
was enabled. Note that the new case is a bit of a mix of the two other
cases in tcp_respond(). The TCP and IP headers must be copied explicitly
into the new mbuf instead of being inherited (similar to the m == NULL
case), but the addresses and ports must be swapped in the reply (similar
to the m != NULL case).

Reviewed by: glebius


# f59d975e 17-May-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Tiny refactor of r294869/r296881: use defines to mask the VNET() macro.

Suggested by: bz


# a4641f4e 03-May-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/net*: minor spelling fixes.

No functional change.


# e5ad6456 28-Apr-2016 Randall Stewart <rrs@FreeBSD.org>

This cleans up the timers code in TCP to start using the new
async_drain functionality. This as been tested in NF as well as
by Verisign. Still to do in here is to remove all the old flags. They
are currently left being maintained but probably are no longer needed.

Sponsored by: Netflix Inc.
Differential Revision: http://reviews.freebsd.org/D5924


# 806929d5 08-Apr-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

Mfp: r296310,r296343

It looks like as with the safety belt of DELAY() fastened (*) we can
completely tear down and free all memory for TCP (after r281599).

(*) in theory a few ticks should be good enough to make sure the timers
are all really gone. Could we use a better matric here and check a
tcbcb count as an optimization?

PR: 164763
Reviewed by: gnn, emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5734


# 8586a963 09-Apr-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

Mfp: r296260

The tcp_inpcb (pcbinfo) zone should be safe to destroy.

PR: 164763
Reviewed by: gnn
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5732


# f254aeda 09-Apr-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

Mfp: r296259

We attach the "counter" to the tcpcbs. Thus don't free the
TCP Fastopen zone before the tcpcbs are gone, as otherwise
the zone won't be empty.
With that it should be safe to destroy the "tfo" zone without
leaking the memory.

PR: 164763
Reviewed by: gnn
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5731


# 35030a5d 29-Mar-2016 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove some NULL checks for M_WAITOK allocations.

MFC after: 1 month
Sponsored by: The FreeBSD Foundation


# 4f321dbd 24-Mar-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

Fix compile errors after r297225:

- properly V_irtualise variable access unbreaking VIMAGE kernels.
- remove the volatile from the function return type to make architecture
using gcc happy [-Wreturn-type]
"type qualifiers ignored on function return type"
I am not entirely happy with this solution putting the u_int there
but it will do for now.


# 84cc0778 24-Mar-2016 George V. Neville-Neil <gnn@FreeBSD.org>

FreeBSD previously provided route caching for TCP (and UDP). Re-add
route caching for TCP, with some improvements. In particular, invalidate
the route cache if a new route is added, which might be a better match.
The cache is automatically invalidated if the old route is deleted.

Submitted by: Mike Karels
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D4306


# bf840a17 14-Mar-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Redo r294869. The array of counters for TCP states doesn't belong to
struct tcpstat, because the structure can be zeroed out by netstat(1) -z,
and of course running connection counts shouldn't be touched.

Place running connection counts into separate array, and provide
separate read-only sysctl oid for it.


# 737d4f6c 07-Mar-2016 Jonathan T. Looney <jtl@FreeBSD.org>

As reported on the transport@ and current@ mailing lists, the FreeBSD TCP
stack is not compliant with RFC 7323, which requires that TCP stacks send
a timestamp option on all packets (except, optionally, RSTs) after the
session is established.

This patch adds that support. It also adds a TCP signature option to the
packet, if appropriate.

PR: 206047
Differential Revision: https://reviews.freebsd.org/D4808
Reviewed by: hiren
MFC after: 2 weeks
Sponsored by: Juniper Networks


# 9cbade8f 07-Mar-2016 Jonathan T. Looney <jtl@FreeBSD.org>

Some cleanup in tcp_respond() in preparation for another change:
- Reorder variables by size
- Move initializer closer to where it is used
- Remove unneeded variable

Differential Revision: https://reviews.freebsd.org/D4808
Reviewed by: hiren
MFC after: 2 weeks
Sponsored by: Juniper Networks


# e79cb051 03-Mar-2016 George V. Neville-Neil <gnn@FreeBSD.org>

Fix dtrace probes (introduced in 287759): debug__input was used
for output and drop; connect didn't always fire a user probe
some probes were missing in fastpath

Submitted by: Hannes Mehnert
Sponsored by: REMS, EPSRC
Differential Revision: https://reviews.freebsd.org/D5525


# 6971a637 23-Feb-2016 Bryan Drewery <bdrewery@FreeBSD.org>

Fix build after r29592.


# 6e0efc6a 23-Feb-2016 Randall Stewart <rrs@FreeBSD.org>

This fixes the fastpath code to have a better module initialization sequence when
included in loader.conf. It also fixes it so that no matter if some one incorrectly
specifies a load order, the lists and such will be initialized on demand at that
time so no one can make that mistake.

Reviewed by: hiren
Differential Revision: D5189


# 4644fda3 27-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Rename netinet/tcp_cc.h to netinet/cc/cc.h.

Discussed with: lstewart


# 75dd79d9 26-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Grab a snap amount of TCP connections in syncache from tcpstat.


# 57a78e3b 26-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Augment struct tcpstat with tcps_states[], which is used for book-keeping
the amount of TCP connections by state. Provides a cheap way to get
connection count without traversing the whole pcb list.

Sponsored by: Netflix


# 0645c604 26-Jan-2016 Hiren Panchasara <hiren@FreeBSD.org>

Persist timers TCPTV_PERSMIN and TCPTV_PERSMAX are hardcoded with 5 seconds and
60 seconds, respectively. Turn them into sysctls that can be tuned live. The
default values of 5 seconds and 60 seconds have been retained.

Submitted by: Jason Wolfe (j at nitrology dot com)
Reviewed by: gnn, rrs, hiren, bz
MFC after: 1 week
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D5024


# 0d6a516e 25-Jan-2016 Alexander V. Chernikov <melifaro@FreeBSD.org>

Convert TCP mtu checks to the new routing KPI.


# 8bdb5261 22-Jan-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

Correct function arguments for SYSUNINITs.

Obtained from: p4 @180885
Sponsored by: The FreeBSD Foundation


# 1f12da0e 22-Jan-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

Just checkpoint the WIP in order to be able to make the tree update
easier. Note: this is currently not in a usable state as certain
teardown parts are not called and the DOMAIN rework is missing.
More to come soon and find its way to head.

Obtained from: P4 //depot/user/bz/vimage/...
Sponsored by: The FreeBSD Foundation


# 2de3e790 21-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

- Rename cc.h to more meaningful tcp_cc.h.
- Declare it a kernel only include, which it already is.
- Don't include tcp.h implicitly from tcp_cc.h


# ea8d1492 09-Jan-2016 Alexander V. Chernikov <melifaro@FreeBSD.org>

Remove sys/eventhandler.h from net/route.h

Reviewed by: ae


# 0c39d38d 06-Jan-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Historically we have two fields in tcpcb to describe sender MSS: t_maxopd,
and t_maxseg. This dualism emerged with T/TCP, but was not properly cleaned
up after T/TCP removal. After all permutations over the years the result is
that t_maxopd stores a minimum of peer offered MSS and MTU reduced by minimum
protocol header. And t_maxseg stores (t_maxopd - TCPOLEN_TSTAMP_APPA) if
timestamps are in action, or is equal to t_maxopd otherwise. That's a very
rough estimate of MSS reduced by options length. Throughout the code it
was used in places, where preciseness was not important, like cwnd or
ssthresh calculations.

With this change:

- t_maxopd goes away.
- t_maxseg now stores MSS not adjusted by options.
- new function tcp_maxseg() is provided, that calculates MSS reduced by
options length. The functions gives a better estimate, since it takes
into account SACK state as well.

Reviewed by: jtl
Differential Revision: https://reviews.freebsd.org/D3593


# 281a0fd4 24-Dec-2015 Patrick Kelsey <pkelsey@FreeBSD.org>

Implementation of server-side TCP Fast Open (TFO) [RFC7413].

TFO is disabled by default in the kernel build. See the top comment
in sys/netinet/tcp_fastopen.c for implementation particulars.

Reviewed by: gnn, jch, stas
MFC after: 3 days
Sponsored by: Verisign, Inc.
Differential Revision: https://reviews.freebsd.org/D4350


# 616bc4f4 22-Dec-2015 Bjoern A. Zeeb <bz@FreeBSD.org>

If bootverbose is enabled every vnet startup and virtual interface
creation will print extra lines on the console. We are generally not
interested in this (repeated) information for each VNET. Thus only
print it for the default VNET. Virtual interfaces on the base system
will remain printing information, but e.g. each loopback in each vnet
will no longer cause a "bpf attached" line.

Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D4531


# c54a41de 22-Dec-2015 Jonathan T. Looney <jtl@FreeBSD.org>

Fix a panic when launching VNETs after the commit of r292309.

Differential Revision: https://reviews.freebsd.org/D4645
Reviewed by: rrs
Reported by: kp
Tested by: kp
Sponsored by: Juniper Networks


# 55bceb1e 15-Dec-2015 Randall Stewart <rrs@FreeBSD.org>

First cut of the modularization of our TCP stack. Still
to do is to clean up the timer handling using the async-drain.
Other optimizations may be coming to go with this. Whats here
will allow differnet tcp implementations (one included).
Reviewed by: jtl, hiren, transports
Sponsored by: Netflix Inc.
Differential Revision: D4055


# 26882b42 26-Oct-2015 George V. Neville-Neil <gnn@FreeBSD.org>

Turning on IPSEC used to introduce a slight amount of performance
degradation (7%) for host host TCP connections over 10Gbps links,
even when there were no secuirty policies in place. There is no
change in performance on 1Gbps network links. Testing GENERIC vs.
GENERIC-NOIPSEC vs. GENERIC with this change shows that the new
code removes any overhead introduced by having IPSEC always in the
kernel.

Differential Revision: D3993
MFC after: 1 month
Sponsored by: Rubicon Communications (Netgate)


# 86a996e6 13-Oct-2015 Hiren Panchasara <hiren@FreeBSD.org>

There are times when it would be really nice to have a record of the last few
packets and/or state transitions from each TCP socket. That would help with
narrowing down certain problems we see in the field that are hard to reproduce
without understanding the history of how we got into a certain state. This
change provides just that.

It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is
destroyed, the list is freed. I thought this was likely to be more
performance-friendly than saving copies of the tcpcb. Plus, with the packets,
you should be able to reverse-engineer what happened to the tcpcb.

To enable the feature, you will need to compile a kernel with the TCPPCAP
option. Even then, the feature defaults to being deactivated. You can activate
it by setting a positive value for the number of captured packets. You can do
that on either a global basis or on a per-socket basis (via a setsockopt call).

There is no way to get the packets out of the kernel other than using kmem or
getting a coredump. I thought that would help some of the legal/privacy concerns
regarding such a feature. However, it should be possible to add a future effort
to export them in PCAP format.

I tested this at low scale, and found that there were no mbuf leaks and the peak
mbuf usage appeared to be unchanged with and without the feature.

The main performance concern I can envision is the number of mbufs that would be
used on systems with a large number of sockets. If you save five packets per
direction per socket and have 3,000 sockets, that will consume at least 30,000
mbufs just to keep these packets. I tried to reduce the concerns associated with
this by limiting the number of clusters (not mbufs) that could be used for this
feature. Again, in my testing, that appears to work correctly.

Differential Revision: D3100
Submitted by: Jonathan Looney <jlooney at juniper dot net>
Reviewed by: gnn, hiren


# 794ac423 29-Sep-2015 Gleb Smirnoff <glebius@FreeBSD.org>

When processing ICMP need frag message, ignore the suggested MTU unless it
is smaller than the current one for this connection. This is behavior
specified by RFC 1191, and this is how original BSD stack behaved, but this
was unintentionally regressed in r182851.

Reported & tested by: Richard Russo <russor whatsapp.com>
Differential Revision: D3567
Sponsored by: Nginx, Inc.


# 399fbd0e 17-Sep-2015 Gleb Smirnoff <glebius@FreeBSD.org>

Use proper byteswap macro. This isn't a functional change.


# db642c8e 16-Sep-2015 Gleb Smirnoff <glebius@FreeBSD.org>

In tcp_ctlinput() separate the (ip == NULL) block from the rest of the
function to reduce so many levels of indentation. Style the lines that
got now indentation reduced. No functional change.

Checked with: md5


# 5d06879a 13-Sep-2015 George V. Neville-Neil <gnn@FreeBSD.org>

dd DTrace probe points, translators and a corresponding script
to provide the TCPDEBUG functionality with pure DTrace.

Reviewed by: rwatson
MFC after: 2 weeks
Sponsored by: Limelight Networks
Differential Revision: D3530


# 24067db8 03-Sep-2015 Gleb Smirnoff <glebius@FreeBSD.org>

Make tcp_mtudisc() static and void. No functional changes.

Sponsored by: Nginx, Inc.


# 079672cb 08-Aug-2015 Julien Charbon <jch@FreeBSD.org>

Fix a kernel assertion issue introduced with r286227:
Avoid too strict INP_INFO_RLOCK_ASSERT checks due to
tcp_notify() being called from in6_pcbnotify().

Reported by: Larry Rosenman <ler@lerctr.org>
Submitted by: markj, jch


# ff9b006d 02-Aug-2015 Julien Charbon <jch@FreeBSD.org>

Decompose TCP INP_INFO lock to increase short-lived TCP connections scalability:

- The existing TCP INP_INFO lock continues to protect the global inpcb list
stability during full list traversal (e.g. tcp_pcblist()).

- A new INP_LIST lock protects inpcb list actual modifications (inp allocation
and free) and inpcb global counters.

It allows to use TCP INP_INFO_RLOCK lock in critical paths (e.g. tcp_input())
and INP_INFO_WLOCK only in occasional operations that walk all connections.

PR: 183659
Differential Revision: https://reviews.freebsd.org/D2599
Reviewed by: jhb, adrian
Tested by: adrian, nitroboost-gmail.com
Sponsored by: Verisign, Inc.


# 4741bfcb 29-Jul-2015 Patrick Kelsey <pkelsey@FreeBSD.org>

Revert r265338, r271089 and r271123 as those changes do not handle
non-inline urgent data and introduce an mbuf exhaustion attack vector
similar to FreeBSD-SA-15:15.tcp, but not requiring VNETs.

Address the issue described in FreeBSD-SA-15:15.tcp.

Reviewed by: glebius
Approved by: so
Approved by: jmallett (mentor)
Security: FreeBSD-SA-15:15.tcp
Sponsored by: Norse Corp, Inc.


# fd90e2ed 22-May-2015 Jung-uk Kim <jkim@FreeBSD.org>

CALLOUT_MPSAFE has lost its meaning since r141428, i.e., for more than ten
years for head. However, it is continuously misused as the mpsafe argument
for callout_init(9). Deprecate the flag and clean up callout_init() calls
to make them more consistent.

Differential Revision: https://reviews.freebsd.org/D2613
Reviewed by: jhb
MFC after: 2 weeks


# 1ffc12bc 24-Apr-2015 Andrey V. Elsukov <ae@FreeBSD.org>

Fix possible reference leak.

Sponsored by: Yandex LLC


# 5571f9cf 16-Apr-2015 Julien Charbon <jch@FreeBSD.org>

Fix an old and well-documented use-after-free race condition in
TCP timers:
- Add a reference from tcpcb to its inpcb
- Defer tcpcb deletion until TCP timers have finished

Differential Revision: https://reviews.freebsd.org/D2079
Submitted by: jch, Marc De La Gueronniere <mdelagueronniere@verisign.com>
Reviewed by: imp, rrs, adrian, jhb, bz
Approved by: jhb
Sponsored by: Verisign, Inc.


# d1f79a3b 10-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Remove kernel handling of ICMP_SOURCEQUENCH.
It hasn't been used for a very long time.
Additionally, it was deprecated by RFC 6633.


# 6df8a710 07-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Remove SYSCTL_VNET_* macros, and simply put CTLFLAG_VNET where needed.

Sponsored by: Nginx, Inc.


# 4003643d 04-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Convert tcp_maxmtu6() to use new routing api.


# 257480b8 04-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Convert netinet6/ to use new routing API.

* Remove &ifpp from ip6_output() in favor of ri->ri_nh_info
* Provide different wrappers to in6_selectsrc:
Currently it is used by 2 differenct type of customers:
- socket-based one, which all are unsure about provided
address scope and
- in-kernel ones (ND code mostly), which don't have
any sockets, options, crededentials, etc.
So, we provide two different wrappers to in6_selectsrc()
returning select source.
* Make different versions of selectroute():
Currenly selectroute() is used in two scenarios:
- SAS, via in6_selecsrc() -> in6_selectif() -> selectroute()
- output, via in6_output -> wrapper -> selectroute()
Provide different versions for each customer:
- fib6_lookup_nh_basic()-based in6_selectif() which is
capable of returning interface only, without MTU/NHOP/L2
calculations
- full-blown fib6_selectroute() with cached route/multipath/
MTU/L2
* Stop using routing table for link-local address lookups
* Add in6_ifawithifp_lla() to make for-us check faster for link-local
* Add in6_splitscope / in6_setllascope for faster embed/deembed scopes


# 9f65116c 25-Oct-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

* Increase nh_flags to be u16 thus reducing nhop payload to be 48 bytes
* Use NHF_ namespace for all nhop flags
* Rename nhop_data -> nhop_prepend
* Rename fib4_lookup_nh_extended -> fib4_lookup_nh_ext
* Add "flags" argument to fib4_lookup_nh_ext() to specify whether we want
returned nh_ext structure to be refcounted or not.


# f5070664 23-Oct-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Add new fib4_lookup_nh_extended() which fills in nhop4_extended
structure without doinf L2 resolve. It also requires freeing
references by calling fib4_free_nh_ext().

Convert in_pcbladdr() to use it.
Convert tcp_maxmtu() to use it.


# 29c47f18 27-Sep-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

* Split tcp_signature_compute() into 2 pieces:
- tcp_get_sav() - SADB key lookup
- tcp_signature_do_compute() - actual computation
* Fix TCP signature case for listening socket:
do not assume EVERY connection coming to socket
with TCP_SIGNATURE set to be md5 signed regardless
of SADB key existance for particular address. This
fixes the case for routing software having _some_
BGP sessions secured by md5.
* Simplify TCP_SIGNATURE handling in tcp_input()

MFC after: 2 weeks


# 9fd573c3 22-Sep-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Improve transmit sending offload, TSO, algorithm in general.

The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.

Reviewed by: adrian, rmacklem
Sponsored by: Mellanox Technologies
MFC after: 1 week


# 07e845a3 04-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Fixes for tcp_respond() comment.


# af3b2549 27-Jun-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Pull in r267961 and r267973 again. Fix for issues reported will follow.


# 37a107a4 27-Jun-2014 Glen Barber <gjb@FreeBSD.org>

Revert r267961, r267973:

These changes prevent sysctl(8) from returning proper output,
such as:

1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory


# 3da1cf1e 27-Jun-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after: 2 weeks
Sponsored by: Mellanox Technologies


# e407b67b 04-May-2014 Gleb Smirnoff <glebius@FreeBSD.org>

The FreeBSD-SA-14:08.tcp was a lesson on not doing acrobatics with
mixing on stack memory and UMA memory in one linked list.

Thus, rewrite TCP reassembly code in terms of memory usage. The
algorithm remains unchanged.

We actually do not need extra memory to build a reassembly queue.
Arriving mbufs are always packet header mbufs. So we got the length
of data as pkthdr.len. We got m_nextpkt for linkage. And we need
only one pointer to point at the tcphdr, use PH_loc for that.

In tcpcb the t_segq fields becomes mbuf pointer. The t_segqlen
field now counts not packets, but bytes in the queue. This gives
us more precision when comparing to socket buffer limits.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 2aa76dba 21-Apr-2014 Rick Macklem <rmacklem@FreeBSD.org>

Add {} braces so that the code conforms to the indentation.
Fortunately, I don't think doing the assignment of cap->tsomax
unconditionally causes any problem.

Reviewed by: glebius
MFC after: 2 weeks


# e3a7aa6f 04-Mar-2014 Gleb Smirnoff <glebius@FreeBSD.org>

- Remove rt_metrics_lite and simply put its members into rtentry.
- Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This
removes another cache trashing ++ from packet forwarding path.
- Create zini/fini methods for the rtentry UMA zone. Via initialize
mutex and counter in them.
- Fix reporting of rmx_pksent to routing socket.
- Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode.

The change is mostly targeted for stable/10 merge. For head,
rt_pksent is expected to just disappear.

Discussed with: melifaro
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# d9fae5ab 26-Nov-2013 Andriy Gapon <avg@FreeBSD.org>

dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE

In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by: markj
MFC after: 3 weeks


# fa22ce15 25-Nov-2013 Adrian Chadd <adrian@FreeBSD.org>

Convert over the TCP probes to use mtod() rather than directly
dereferencing m->m_data.

Sponsored by: Netflix, Inc.


# 54366c0b 25-Nov-2013 Attilio Rao <attilio@FreeBSD.org>

- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip


# a3985bdd 17-Nov-2013 Mikolaj Golub <trociny@FreeBSD.org>

Deregister helper hooks on vnet destroy.


# 76039bc8 26-Oct-2013 Gleb Smirnoff <glebius@FreeBSD.org>

The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 57f60867 25-Aug-2013 Mark Johnston <markj@FreeBSD.org>

Implement the ip, tcp, and udp DTrace providers. The probe definitions use
dynamic translation so that their arguments match the definitions for
these providers in Solaris and illumos. Thus, existing scripts for these
providers should work unmodified on FreeBSD.

Tested by: gnn, hiren
MFC after: 1 month


# 3c914c54 02-Jun-2013 Andre Oppermann <andre@FreeBSD.org>

Allow drivers to specify a maximum TSO length in bytes if they are
limited in the amount of data they can handle at once.

Drivers can set ifp->if_hw_tsomax before calling ether_ifattach() to
change the limit.

The lowest allowable size is IP_MAXPACKET / 8 (8192 bytes) as anything
less wouldn't be very useful anymore. The upper limit is still at
IP_MAXPACKET (65536 bytes). Raising it requires further auditing of
the IPv4/v6 code path's as the length field in the IP header would
overflow leading to confusion in firewalls and others packet handler on
the real size of the packet.

The placement into "struct ifnet" is a bit hackish but the best place
that was found. When the stack/driver boundary is updated it should
be handled in a better way.

Submitted by: cperciva (earlier version)
Reviewed by: cperciva
Tested by: cperciva
MFC after: 1 week (using spare struct members to preserve ABI)


# d13fc995 13-May-2013 Jim Harris <jimharris@FreeBSD.org>

Fix typo in net.inet.tcp.minmss sysctl description.

MFC after: 3 days


# f89d4c3a 06-May-2013 Andre Oppermann <andre@FreeBSD.org>

Back out r249318, r249320 and r249327 due to a heisenbug most
likely related to a race condition in the ipi_hash_lock with
the exact cause currently unknown but under investigation.


# 8fb3bbe7 17-Apr-2013 Gabor Kovesdan <gabor@FreeBSD.org>

- Corrrect mispellings of word useful

Submitted by: Christoph Mallon <christoph.mallon@gmx.de> (via private mail)


# e8b3186b 09-Apr-2013 Andre Oppermann <andre@FreeBSD.org>

Change certain heavily used network related mutexes and rwlocks to
reside on their own cache line to prevent false sharing with other
nearby structures, especially for those in the .bss segment.

NB: Those mutexes and rwlocks with variables next to them that get
changed on every invocation do not benefit from their own cache line.
Actually it may be net negative because two cache misses would be
incurred in those cases.


# dc4ad05e 14-Mar-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Use m_get/m_gethdr instead of compat macros.

Sponsored by: Nginx, Inc.


# 6acd596e 07-Dec-2012 Pawel Jakub Dawidek <pjd@FreeBSD.org>

More warnings for zones that depend on the kern.ipc.maxsockets limit.

Obtained from: WHEEL Systems


# eb1b1807 05-Dec-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually


# 08373e0b 26-Nov-2012 Alfred Perlstein <alfred@FreeBSD.org>

Auto size the tcbhashsize structure based on max sockets.

While here, also make the code that enforces power-of-two more
forgiving, instead of just resetting to 512, graciously round-down
to the next lower power of two.


# ec89d039 07-Nov-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

Cleanup some whitspace in this file to get it out of an upcoming patch.

MFC after: 10 days


# 8f134647 22-Oct-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Switch the entire IPv4 stack to keep the IP packet header
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.

After this change a packet processed by the stack isn't
modified at all[2] except for TTL.

After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.

[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.

[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.

Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>


# d6d3f01e 08-Sep-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Merge the projects/pf/head branch, that was worked on for last six months,
into head. The most significant achievements in the new code:

o Fine grained locking, thus much better performance.
o Fixes to many problems in pf, that were specific to FreeBSD port.

New code doesn't have that many ifdefs and much less OpenBSDisms, thus
is more attractive to our developers.

Those interested in details, can browse through SVN log of the
projects/pf/head branch. And for reference, here is exact list of
revisions merged:

r232043, r232044, r232062, r232148, r232149, r232150, r232298, r232330,
r232332, r232340, r232386, r232390, r232391, r232605, r232655, r232656,
r232661, r232662, r232663, r232664, r232673, r232691, r233309, r233782,
r233829, r233830, r233834, r233835, r233836, r233865, r233866, r233868,
r233873, r234056, r234096, r234100, r234108, r234175, r234187, r234223,
r234271, r234272, r234282, r234307, r234309, r234382, r234384, r234456,
r234486, r234606, r234640, r234641, r234642, r234644, r234651, r235505,
r235506, r235535, r235605, r235606, r235826, r235991, r235993, r236168,
r236173, r236179, r236180, r236181, r236186, r236223, r236227, r236230,
r236252, r236254, r236298, r236299, r236300, r236301, r236397, r236398,
r236399, r236499, r236512, r236513, r236525, r236526, r236545, r236548,
r236553, r236554, r236556, r236557, r236561, r236570, r236630, r236672,
r236673, r236679, r236706, r236710, r236718, r237154, r237155, r237169,
r237314, r237363, r237364, r237368, r237369, r237376, r237440, r237442,
r237751, r237783, r237784, r237785, r237788, r237791, r238421, r238522,
r238523, r238524, r238525, r239173, r239186, r239644, r239652, r239661,
r239773, r240125, r240130, r240131, r240136, r240186, r240196, r240212.

I'd like to thank people who participated in early testing:

Tested by: Florian Smeets <flo freebsd.org>
Tested by: Chekaluk Vitaly <artemrts ukr.net>
Tested by: Ben Wilber <ben desync.com>
Tested by: Ian FREISLICH <ianf cloudseed.co.za>


# 09fe6320 19-Jun-2012 Navdeep Parhar <np@FreeBSD.org>

- Updated TOE support in the kernel.

- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs.
These are available as t3_tom and t4_tom modules that augment cxgb(4)
and cxgbe(4) respectively. The cxgb/cxgbe drivers continue to work as
usual with or without these extra features.

- iWARP driver for Terminator 3 ASIC (kernel verbs). T4 iWARP in the
works and will follow soon.

Build-tested with make universe.

30s overview
============
What interfaces support TCP offload? Look for TOE4 and/or TOE6 in the
capabilities of an interface:
# ifconfig -m | grep TOE

Enable/disable TCP offload on an interface (just like any other ifnet
capability):
# ifconfig cxgbe0 toe
# ifconfig cxgbe0 -toe

Which connections are offloaded? Look for toe4 and/or toe6 in the
output of netstat and sockstat:
# netstat -np tcp | grep toe
# sockstat -46c | grep toe

Reviewed by: bz, gnn
Sponsored by: Chelsio communications.
MFC after: ~3 months (after 9.1, and after ensuring MFC is feasible)


# 7f63ba51 04-Jun-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Remove completely the m_addr_changed() hack, and support of reverse
pointer in pf_state_ket, that ware 'if 0' since beginning of
SMP-friendly pf project. In the new locking scheme we can't reference
state keys from mbuf tags, nor a key can reference another key.


# 356ab07e 28-May-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

It turns out that too many drivers are not only parsing the L2/3/4
headers for TSO but also for generic checksum offloading. Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.

To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.

Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.

This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.

Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.

Individual driver updates will have to follow, as will SCTP.

Reported by: gallatin, dim, ..
Reviewed by: gallatin (glanced at?)
MFC after: 3 days
X-MFC with: r235961,235959,235958


# 45747ba5 24-May-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

MFp4 bz_ipv6_fast:

Add code to handle pre-checked TCP checksums as indicated by mbuf
flags to save the entire computation for validation if not needed.

In the IPv6 TCP output path only compute the pseudo-header checksum,
set the checksum offset in the mbuf field along the appropriate flag
as done in IPv4.

In tcp_respond() just initialize the IPv6 payload length to 0 as
ip6_output() will properly set it.

Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems

Reviewed by: gnn (as part of the whole)
MFC After: 3 days


# ef341ee1 16-Apr-2012 Gleb Smirnoff <glebius@FreeBSD.org>

When we receive an ICMP unreach need fragmentation datagram, we take
proposed MTU value from it and update the TCP host cache. Then
tcp_mss_update() is called on the corresponding tcpcb. It finds the
just allocated entry in the TCP host cache and updates MSS on the
tcpcb. And then we do a fast retransmit of what we have in the tcp
send buffer.

This sequence gets broken if the TCP host cache is exausted. In this
case allocation fails, and later called tcp_mss_update() finds nothing
in cache. The fast retransmit is done with not reduced MSS and is
immidiately replied by remote host with new ICMP datagrams and the
cycle repeats. This ping-pong can go up to wirespeed.

To fix this:
- tcp_mss_update() gets new parameter - mtuoffer, that is like
offer, but needs to have min_protoh subtracted.
- tcp_mtudisc() as notification method renamed to tcp_mtudisc_notify().
- tcp_mtudisc() now accepts not a useless error argument, but proposed
MTU value, that is passed to tcp_mss_update() as mtuoffer.

Reported by: az
Reported by: Andrey Zonov <andrey zonov.org>
Reviewed by: andre (previous version of patch)


# 2454a7ca 27-Mar-2012 Marko Zec <zec@FreeBSD.org>

Permit tcpdrop in VNET jails.

Submitted by: Miljenko Mikuc
MFC after: 3 days


# 81d5d46b 03-Feb-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

Add multi-FIB IPv6 support to the core network stack supplementing
the original IPv4 implementation from r178888:

- Use RT_DEFAULT_FIB in the IPv4 implementation where noticed.
- Use rt*fib() KPI with explicit RT_DEFAULT_FIB where applicable in
the NFS code.
- Use the new in6_rt* KPI in TCP, gif(4), and the IPv6 network stack
where applicable.
- Split in6_rtqtimo() and in6_mtutimo() as done in IPv4 and equally
prevent multiple initializations of callouts in in6_inithead().
- Use wrapper functions where needed to preserve the current KPI to
ease MFCs. Use BURN_BRIDGES to indicate expected future cleanup.
- Fix (related) comments (both technical or style).
- Convert to rtinit() where applicable and only use custom loops where
currently not possible otherwise.
- Multicast group, most neighbor discovery address actions and faith(4)
are locked to the default FIB. Individual IPv6 addresses will only
appear in the default FIB, however redirect information and prefixes
of connected subnets are automatically propagated to all FIBs by
default (mimicking IPv4 behavior as closely as possible).

Sponsored by: Cisco Systems, Inc.


# dceced71 14-Jul-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Unbreak no-INET kernels after r223839 adding the needed #ifdef INET.

MFC after: 4 weeks


# 1c6e7fa7 07-Jul-2011 Andre Oppermann <andre@FreeBSD.org>

Remove the TCP_SORECEIVE_STREAM compile time option. The use of
soreceive_stream() for TCP still has to be enabled with the loader
tuneable net.inet.tcp.soreceive_stream.

Suggested by: trociny and others


# e6c90582 04-Jul-2011 Ermal Luçi <eri@FreeBSD.org>

pf(4) tags now store the state key but tcp_respond tries to reuse a mbuf as an optimization.
This makes pf find the wrong state and cause errors reported with state mismatches.
Clear the cached state link on the pf(4) tag to avoid the state mismatches.

Approved by: bz


# 52cd27cb 05-Jun-2011 Robert Watson <rwatson@FreeBSD.org>

Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.

Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).

Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.

Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.

Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).

Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.

Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.

Reviewed by: bz
Sponsored by: Juniper Networks, Inc.


# fa046d87 30-May-2011 Robert Watson <rwatson@FreeBSD.org>

Decompose the current single inpcbinfo lock into two locks:

- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).

- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.

Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.

A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:

INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb

Callers must pass exactly one of these flags (for the time being).

Some notes:

- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).

This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.

Reviewed by: bz
Sponsored by: Juniper Networks, Inc.


# bc7d18ae 09-May-2011 Alexander Motin <mav@FreeBSD.org>

Refactor TCP ISN increment logic. Instead of firing callout at 100Hz to
keep constant ISN growth rate, do the same directly inside tcp_new_isn(),
taking into account how much time (ticks) passed since the last call.

On my test systems this decreases idle interrupt rate from 140Hz to 70Hz.


# b287c6c7 30-Apr-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Make the TCP code compile without INET. Sort #includes and add #ifdef INETs.
Add some comments at #endifs given more nestedness. To make the compiler
happy, some default initializations were added in accordance with the style
on the files.

Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 4 days


# 2903309a 25-Apr-2011 Attilio Rao <attilio@FreeBSD.org>

Add the possibility to verify MD5 hash of incoming TCP packets.
As long as this is a costy function, even when compiled in (along with
the option TCP_SIGNATURE), it can be disabled via the
net.inet.tcp.signature_verify_input sysctl.

Sponsored by: Sandvine Incorporated
Reviewed by: emaste, bz
MFC after: 2 weeks


# 6bccea7c 21-Feb-2011 Rebecca Cran <brucec@FreeBSD.org>

Fix typos - remove duplicate "the".

PR: bin/154928
Submitted by: Eitan Adler <lists at eitanadler.com>
MFC after: 3 days


# 79c3d51b 18-Jan-2011 Matthew D Fleming <mdf@FreeBSD.org>

Specify a CTLTYPE_FOO so that a future sysctl(8) change does not need
to rely on the format string. For SYSCTL_PROC instances that I
noticed a discrepancy between the CTLTYPE and the format specifier,
fix the CTLTYPE.


# f88910cd 12-Jan-2011 Matthew D Fleming <mdf@FreeBSD.org>

sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly.

Commit the net* piece.


# 39bc9de5 27-Dec-2010 Lawrence Stewart <lstewart@FreeBSD.org>

- Add some helper hook points to the TCP stack. The hooks allow Khelp modules to
access inbound/outbound events and associated data for established TCP
connections. The hooks only run if at least one hook function is registered
for the hook point, ensuring the impact on the stack is effectively nil when
no TCP Khelp modules are loaded. struct tcp_hhook_data is passed as contextual
data to any registered Khelp module hook functions.

- Add an OSD (Object Specific Data) pointer to struct tcpcb to allow Khelp
modules to associate per-connection data with the TCP control block.

- Bump __FreeBSD_version and add a note to UPDATING regarding to ABI changes
introduced by this commit and r216753.

In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: FreeBSD Foundation
Reviewed by: bz, others along the way
MFC after: 3 months


# 22968a7d 27-Dec-2010 Lawrence Stewart <lstewart@FreeBSD.org>

Fix a whitespace nit introduced in r215166.

Sponsored by: FreeBSD Foundation
Spotted by: bz
MFC after: 5 weeks
X-MFC with: r215166


# 3e288e62 22-Nov-2010 Dimitry Andric <dim@FreeBSD.org>

After some off-list discussion, revert a number of changes to the
DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various
people working on the affected files. A better long-term solution is
still being considered. This reversal may give some modules empty
set_pcpu or set_vnet sections, but these are harmless.

Changes reverted:

------------------------------------------------------------------------
r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines

Instead of unconditionally emitting .globl's for the __start_set_xxx and
__stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu
sections are actually defined.

------------------------------------------------------------------------
r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.

------------------------------------------------------------------------
r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines

Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.


# 99065ae6 16-Nov-2010 Lawrence Stewart <lstewart@FreeBSD.org>

Move protocol specific implementation detail out of the core CC framework.

Sponsored by: FreeBSD Foundation
Tested by: Mikolaj Golub <to.my.trociny at gmail com>
MFC after: 11 weeks
X-MFC with: r215166


# 14f57a8b 16-Nov-2010 Lawrence Stewart <lstewart@FreeBSD.org>

cc_init() should only be run once on system boot, but with VIMAGE kernels it
runs on boot and each time a vnet jail is created. Running cc_init() multiple
times results in a panic when attempting to initialise the cc_list lock again,
and so r215166 effectively broke the use of vnet jails.

Switch to using a SYSINIT to run cc_init() on boot. CC algorithm modules loaded
on boot register in the same SI_SUB_PROTO_IFATTACHDOMAIN category as is used in
this patch, so cc_init() is run at SI_ORDER_FIRST to ensure the framework is
initialised before module registration is attempted.

Sponsored by: FreeBSD Foundation
Reported and tested by: Mikolaj Golub <to.my.trociny at gmail com>
MFC after: 11 weeks
X-MFC with: r215166


# 31c6a003 14-Nov-2010 Dimitry Andric <dim@FreeBSD.org>

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.


# dbc42409 11-Nov-2010 Lawrence Stewart <lstewart@FreeBSD.org>

This commit marks the first formal contribution of the "Five New TCP Congestion
Control Algorithms for FreeBSD" FreeBSD Foundation funded project. More details
about the project are available at: http://caia.swin.edu.au/freebsd/5cc/

- Add a KPI and supporting infrastructure to allow modular congestion control
algorithms to be used in the net stack. Algorithms can maintain per-connection
state if required, and connections maintain their own algorithm pointer, which
allows different connections to concurrently use different algorithms. The
TCP_CONGESTION socket option can be used with getsockopt()/setsockopt() to
programmatically query or change the congestion control algorithm respectively
from within an application at runtime.

- Integrate the framework with the TCP stack in as least intrusive a manner as
possible. Care was also taken to develop the framework in a way that should
allow integration with other congestion aware transport protocols (e.g. SCTP)
in the future. The hope is that we will one day be able to share a single set
of congestion control algorithm modules between all congestion aware transport
protocols.

- Introduce a new congestion recovery (TF_CONGRECOVERY) state into the TCP stack
and use it to decouple the meaning of recovery from a congestion event and
recovery from packet loss (TF_FASTRECOVERY) a la RFC2581. ECN and delay based
congestion control protocols don't generally need to recover from packet loss
and need a different way to note a congestion recovery episode within the
stack.

- Remove the net.inet.tcp.newreno sysctl, which simplifies some portions of code
and ensures the stack always uses the appropriate mechanisms for recovering
from packet loss during a congestion recovery episode.

- Extract the NewReno congestion control algorithm from the TCP stack and
massage it into module form. NewReno is always built into the kernel and will
remain the default algorithm for the forseeable future. Implementations of
additional different algorithms will become available in the near future.

- Bump __FreeBSD_version to 900025 and note in UPDATING that rebuilding code
that relies on the size of "struct tcpcb" is required.

Many thanks go to the Cisco University Research Program Fund at Community
Foundation Silicon Valley and the FreeBSD Foundation. Their support of our work
at the Centre for Advanced Internet Architectures, Swinburne University of
Technology is greatly appreciated.

In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: Cisco URP, FreeBSD Foundation
Reviewed by: rpaulo
Tested by: David Hayes (and many others over the years)
MFC after: 3 months


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 0c236c4e 24-Sep-2010 Lawrence Stewart <lstewart@FreeBSD.org>

Internalise reassembly queue related functionality and variables which should
not be used outside of the reassembly queue implementation. Provide a new
function to flush all segments from a reassembly queue and call it from the
appropriate places instead of manipulating the queue directly.

Sponsored by: FreeBSD Foundation
Reviewed by: andre, gnn, rpaulo
MFC after: 2 weeks


# 1c18314d 16-Sep-2010 Andre Oppermann <andre@FreeBSD.org>

Remove the TCP inflight bandwidth limiter as announced in r211315
to give way for the pluggable congestion control framework. It is
the task of the congestion control algorithm to set the congestion
window and amount of inflight data without external interference.

In 'struct tcpcb' the variables previously used by the inflight
limiter are renamed to spares to keep the ABI intact and to have
some more space for future extensions.

In 'struct tcp_info' the variable 'tcpi_snd_bwnd' is not removed to
preserve the ABI. It is always set to 0.

In siftr.c in 'struct pkt_node' the variable 'snd_bwnd' is not removed
to preserve the ABI. It is always set to 0.

These unused variable in the various structures may be reused in the
future or garbage collected before the next release or at some other
point when an ABI change happens anyway for other reasons.

No MFC is planned. The inflight bandwidth limiter stays disabled by
default in the other branches but remains available.


# 98b9eb0d 27-Aug-2010 John Baldwin <jhb@FreeBSD.org>

Simplify the tcp pcblist estimate logic slightly.

MFC after: 3 days


# b7d747ec 18-Aug-2010 Andre Oppermann <andre@FreeBSD.org>

Untangle the net.inet.tcp.log_in_vain and net.inet.tcp.log_debug
sysctl's and remove any side effects.

Both sysctl's share the same backend infrastructure and due to the
way it was implemented enabling net.inet.tcp.log_in_vain would also
cause log_debug output to be generated. This was surprising and
eventually annoying to the user.

The log output backend is kept the same but a little shim is inserted
to properly separate log_in_vain and log_debug and to remove any side
effects.

PR: kern/137317
MFC after: 1 week


# 2278f992 18-Aug-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

When calculating the expected memory size for userspace, also take the
number of syncache entries into account for the surplus we add to account
for a possible increase of records in the re-entry window.

Discussed with: jhb, silby
MFC after: 1 week


# c007b96a 17-Aug-2010 John Baldwin <jhb@FreeBSD.org>

Ensure a minimum "slop" of 10 extra pcb structures when providing a
memory size estimate to userland for pcb list sysctls. The previous
behavior of a "slop" of n/8 does not work well for small values of n
(e.g. no slop at all if you have less than 8 open UDP connections).

Reviewed by: bz
MFC after: 1 week


# e4e92660 15-Aug-2010 Andre Oppermann <andre@FreeBSD.org>

Fix the interaction between 'ICMP fragmentation needed' MTU updates,
path MTU discovery and the tcp_minmss limiter for very small MTU's.

When the MTU suggested by the gateway via ICMP, or if there isn't
any the next smaller step from ip_next_mtu(), is lower than the
floor enforced by net.inet.tcp.minmss (default 216) the value is
ignored and the default MSS (512) is used instead. However the
DF flag in the IP header is still set in tcp_output() preventing
fragmentation by the gateway.

Fix this by using tcp_minmss as the MSS and clear the DF flag if
the suggested MTU is too low. This turns off path MTU dissovery
for the remainder of the session and allows fragmentation to be
done by the gateway.

Only MTU's smaller than 256 are affected. The smallest official
MTU specified is for AX.25 packet radio at 256 octets.

PR: kern/146628
Tested by: Matthew Luckie <mjl-at-luckie org nz>
MFC after: 1 week


# bee4e5af 14-Aug-2010 Andre Oppermann <andre@FreeBSD.org>

Disable TCP inflight limiter by default.

It was experimental and interferes with the normal congestion control
algorithms by instating a separate, possibly lower, ceiling for the
amount of data that is in flight to the remote host. With high speed
internet connections the inflight limit frequently has been estimated
too low due to the noisy nature of the RTT measurements.

This code gives way for the upcoming pluggable congestion control
framework. It is the task of the congestion control algorithm to
set the congestion window and amount of inflight data without external
interference.

Reviewed by: lstewart
MFC after: 1 week
Removal after: 1 month


# 480d7c6c 06-May-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

MFC r207369:
MFP4: @176978-176982, 176984, 176990-176994, 177441

"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by: jhb
Discussed with: rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH


# 82cea7e6 29-Apr-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

MFP4: @176978-176982, 176984, 176990-176994, 177441

"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by: jhb
Discussed with: rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
MFC after: 6 days


# 397069f2 27-Mar-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

MFC r205251:

Add pcb reference counting to the pcblist sysctl handler functions
to ensure type stability while caching the pcb pointers for the
copyout.

Reviewed by: rwatson


# 62f500d0 27-Mar-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

MFC r204838:

Destroy TCP UMA zones (empty or not) upon network stack teardown
to not leak them, otherwise making UMA/vmstat unhappy with every
stoped vnet.
We will still leak pages (especially for zones marked NOFREE).

Reshuffle cleanup order in tcp_destroy() to get rid of what we can
easily free first.

Reviewed by: rwatson


# d0e157f6 17-Mar-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

Add pcb reference counting to the pcblist sysctl handler functions
to ensure type stability while caching the pcb pointers for the
copyout.

Reviewed by: rwatson
MFC after: 7 days


# 9bcd427b 14-Mar-2010 Robert Watson <rwatson@FreeBSD.org>

Abstract out initialization of most aspects of struct inpcbinfo from
their calling contexts in {IP divert, raw IP sockets, TCP, UDP} and
create new helper functions: in_pcbinfo_init() and in_pcbinfo_destroy()
to do this work in a central spot. As inpcbinfo becomes more complex
due to ongoing work to add connection groups, this will reduce code
duplication.

MFC after: 1 month
Reviewed by: bz
Sponsored by: Juniper Networks


# 376aadf8 07-Mar-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

Destroy TCP UMA zones (empty or not) upon network stack teardown
to not leak them, otherwise making UMA/vmstat unhappy with every stoped vnet.
We will still leak pages (especially for zones marked NOFREE).

Reshuffle cleanup order in tcp_destroy() to get rid of what we can
easily free first.

Sponsored by: ISPsystem
Reviewed by: rwatson
MFC after: 5 days


# 2bf3ce08 07-Mar-2010 Robert Watson <rwatson@FreeBSD.org>

Add comment in tcp_discardcb() talking about how we don't, but should,
address TCP races relating to not calling tcp_drain() on stopped callouts.

Discussed with: bz


# b8614722 15-Sep-2009 Mike Silbersack <silby@FreeBSD.org>

Add the ability to see TCP timers via netstat -x. This can be a useful
feature when you have a seemingly stuck socket and want to figure
out why it has not been closed yet.

No plans to MFC this, as it changes the netstat sysctl ABI.

Reviewed by: andre, rwatson, Eric Van Gyzen


# 11c99a6d 15-Sep-2009 Andre Oppermann <andre@FreeBSD.org>

-Put the optimized soreceive_stream() under a compile time option called
TCP_SORECEIVE_STREAM for the time being.

Requested by: brooks

Once compiled in make it easily switchable for testers by using a tuneable
net.inet.tcp.soreceive_stream
and a corresponding read-only sysctl to report the current state.

Suggested by: rwatson

MFC after: 2 days


# 530c0060 01-Aug-2009 Robert Watson <rwatson@FreeBSD.org>

Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by: bz
Approved by: re (vimage blanket)


# a08362ce 21-Jul-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

sysctl_msec_to_ticks is used with both virtualized and
non-vrtiualized sysctls so we cannot used one common function.

Add a macro to convert the arg1 in the virtualized case to
vnet.h to not expose the maths to all over the code.

Add a wrapper for the single virtualized call, properly handling
arg1 and call the default implementation from there.

Convert the two over places to use the new macro.

Reviewed by: rwatson
Approved by: re (kib)


# 5ee847d3 19-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Reimplement and/or implement vnet list locking by replacing a mostly
unused custom mutex/condvar-based sleep locks with two locks: an
rwlock (for non-sleeping use) and sxlock (for sleeping use). Either
acquired for read is sufficient to stabilize the vnet list, but both
must be acquired for write to modify the list.

Replace previous no-op read locking macros, used in various places
in the stack, with actual locking to prevent race conditions. Callers
must declare when they may perform unbounded sleeps or not when
selecting how to lock.

Refactor vnet sysinits so that the vnet list and locks are initialized
before kernel modules are linked, as the kernel linker will use them
for modules loaded by the boot loader.

Update various consumers of these KPIs based on whether they may sleep
or not.

Reviewed by: bz
Approved by: re (kib)


# 1e77c105 16-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Remove unused VNET_SET() and related macros; only VNET_GET() is
ever actually used. Rename VNET_GET() to VNET() to shorten
variable references.

Discussed with: bz, julian
Reviewed by: bz
Approved by: re (kensmith, kib)


# eddfbb76 14-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)


# ebd8672c 17-Jun-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Add explicit includes for jail.h to the files that need them and
remove the "hidden" one from vimage.h.


# 9ed47d01 15-Jun-2009 Jamie Gritton <jamie@FreeBSD.org>

Get vnets from creds instead of threads where they're available, and from
passed threads instead of curthread.

Reviewed by: zec, julian
Approved by: bz (mentor)


# bc29160d 08-Jun-2009 Marko Zec <zec@FreeBSD.org>

Introduce an infrastructure for dismantling vnet instances.

Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state. The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.

While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers. Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.

Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels. Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.

Bump __FreeBSD_version to 800097.
Reviewed by: bz, julian
Approved by: rwatson, kib (re), julian (mentor)


# bcf11e8d 05-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# 6d453b10 23-May-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

For UDP with introducing the UDP control block, the uma zone had to
be named "udp_inpcb" to avoid a naming conflict with tcp[1].
For consistency rename the uma zone for TCP from "inpcb" to "tcp_inpcb".

Found by: rwatson [1]
Discussed with: rwatson


# f6dfe47a 30-Apr-2009 Marko Zec <zec@FreeBSD.org>

Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance. Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:

1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables. As an example, V_ifnet becomes:

options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet
default build: vnet_net_0._ifnet
options VIMAGE_GLOBALS: ifnet

2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:

INIT_VNET_NET(ifp->if_vnet); becomes

struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];

3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals. If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.

4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet. options VIMAGE builds
will fill in those fields as required.

5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.

6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod. SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.

Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.

Reviewed by: bz, rwatson
Approved by: julian (mentor)


# 093f25f8 26-Apr-2009 Marko Zec <zec@FreeBSD.org>

In preparation for turning on options VIMAGE in next commits,
rearrange / replace / adjust several INIT_VNET_* initializer
macros, all of which currently resolve to whitespace.

Reviewed by: bz (an older version of the patch)
Approved by: julian (mentor)


# 78b50714 11-Apr-2009 Robert Watson <rwatson@FreeBSD.org>

Update stats in struct tcpstat using two new macros, TCPSTAT_ADD() and
TCPSTAT_INC(), rather than directly manipulating the fields across the
kernel. This will make it easier to change the implementation of
these statistics, such as using per-CPU versions of the data structures.

MFC after: 3 days


# 1ed81b73 06-Apr-2009 Marko Zec <zec@FreeBSD.org>

First pass at separating per-vnet initializer functions
from existing functions for initializing global state.

At this stage, the new per-vnet initializer functions are
directly called from the existing global initialization code,
which should in most cases result in compiler inlining those
new functions, hence yielding a near-zero functional change.

Modify the existing initializer functions which are invoked via
protosw, like ip_init() et. al., to allow them to be invoked
multiple times, i.e. per each vnet. Global state, if any,
is initialized only if such functions are called within the
context of vnet0, which will be determined via the
IS_DEFAULT_VNET(curvnet) check (currently always true).

While here, V_irtualize a few remaining global UMA zones
used by net/netinet/netipsec networking code. While it is
not yet clear to me or anybody else whether this is the right
thing to do, at this stage this makes the code more readable,
and makes it easier to track uncollected UMA-zone-backed
objects on vnet removal. In the long run, it's quite possible
that some form of shared use of UMA zone pools among multiple
vnets should be considered.

Bump __FreeBSD_version due to changes in layout of structs
vnet_ipfw, vnet_inet and vnet_net.

Approved by: julian (mentor)


# 34f27ade 21-Mar-2009 Juli Mallett <jmallett@FreeBSD.org>

Remove local in6_addr variables for local and foreign addresses in sysctl_drop,
they were passed uninitialized to in6_pcblookup_hash. Instead, do as is done
for IPv4 and use the addresses within the sockaddr structure, which are
correctly populated.

This fixes tcpdrop(8) for IPv6 address pairs.

Reviewed by: bz


# ad71fe3c 15-Mar-2009 Robert Watson <rwatson@FreeBSD.org>

Correct a number of evolved problems with inp_vflag and inp_flags:
certain flags that should have been in inp_flags ended up in inp_vflag,
meaning that they were inconsistently locked, and in one case,
interpreted. Move the following flags from inp_vflag to gaps in the
inp_flags space (and clean up the inp_flags constants to make gaps
more obvious to future takers):

INP_TIMEWAIT
INP_SOCKREF
INP_ONESBCAST
INP_DROPPED

Some aspects of this change have no effect on kernel ABI at all, as these
are UDP/TCP/IP-internal uses; however, netstat and sockstat detect
INP_TIMEWAIT when listing TCP sockets, so any MFC will need to take this
into account.

MFC after: 1 week (or after dependencies are MFC'd)
Reviewed by: bz


# d685b6ee 13-Feb-2009 Luigi Rizzo <luigi@FreeBSD.org>

Use uint32_t instead of n_long and n_time, and uint16_t instead of n_short.
Add a note next to fields in network format.

The n_* types are not enough for compiler checks on endianness, and their
use often requires an otherwise unnecessary #include <netinet/in_systm.h>

The typedef in in_systm.h are still there.


# 97aa4a51 08-Feb-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Try to remove/assimilate as much of formerly IPv4/6 specific
(duplicate) code in sys/netipsec/ipsec.c and fold it into
common, INET/6 independent functions.

The file local functions ipsec4_setspidx_inpcb() and
ipsec6_setspidx_inpcb() were 1:1 identical after the change
in r186528. Rename to ipsec_setspidx_inpcb() and remove the
duplicate.

Public functions ipsec[46]_get_policy() were 1:1 identical.
Remove one copy and merge in the factored out code from
ipsec_get_policy() into the other. The public function left
is now called ipsec_get_policy() and callers were adapted.

Public functions ipsec[46]_set_policy() were 1:1 identical.
Rename file local ipsec_set_policy() function to
ipsec_set_policy_internal().
Remove one copy of the public functions, rename the other
to ipsec_set_policy() and adapt callers.

Public functions ipsec[46]_hdrsiz() were logically identical
(ignoring one questionable assert in the v6 version).
Rename the file local ipsec_hdrsiz() to ipsec_hdrsiz_internal(),
the public function to ipsec_hdrsiz(), remove the duplicate
copy and adapt the callers.
The v6 version had been unused anyway. Cleanup comments.

Public functions ipsec[46]_in_reject() were logically identical
apart from statistics. Move the common code into a file local
ipsec46_in_reject() leaving vimage+statistics in small AF specific
wrapper functions. Note: unfortunately we already have a public
ipsec_in_reject().

Reviewed by: sam
Discussed with: rwatson (renaming to *_internal)
MFC after: 26 days
X-MFC: keep wrapper functions for public symbols?


# 24cb0f22 14-Jan-2009 Lawrence Stewart <lstewart@FreeBSD.org>

Add TCP Appropriate Byte Counting (RFC 3465) support to kernel.

The new behaviour is on by default, and can be disabled by setting the
net.inet.tcp.rfc3465 sysctl to 0 to obtain previous behaviour.

The patch changes struct tcpcb in sys/netinet/tcp_var.h which breaks
the ABI. Bump __FreeBSD_version to 800061 accordingly. User space tools
that rely on the size of struct tcpcb (e.g. sockstat) need to be recompiled.

Reviewed by: rpaulo, gnn
Approved by: gnn, kmacy (mentors)
Sponsored by: FreeBSD Foundation


# dcdb4371 16-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Use inc_flags instead of the inc_isipv6 alias which so far
had been the only flag with random usage patterns.
Switch inc_flags to be used as a real bit field by using
INC_ISIPV6 with bitops to check for the 'isipv6' condition.

While here fix a place or two where in case of v4 inc_flags
were not properly initialized before.[1]

Found by: rwatson during review [1]
Discussed with: rwatson
Reviewed by: rwatson
MFC after: 4 weeks


# fc384fa5 15-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Another step assimilating IPv[46] PCB code - directly use
the inpcb names rather than the following IPv6 compat macros:
in6pcb,in6p_sp, in6p_ip6_nxt,in6p_flowinfo,in6p_vflag,
in6p_flags,in6p_socket,in6p_lport,in6p_fport,in6p_ppcb and
sotoin6pcb().

Apart from removing duplicate code in netipsec, this is a pure
whitespace, not a functional change.

Discussed with: rwatson
Reviewed by: rwatson (version before review requested changes)
MFC after: 4 weeks (set the timer and see then)


# 6e6b3f7c 14-Dec-2008 Qing Li <qingli@FreeBSD.org>

This main goals of this project are:
1. separating L2 tables (ARP, NDP) from the L3 routing tables
2. removing as much locking dependencies among these layers as
possible to allow for some parallelism in the search operations
3. simplify the logic in the routing code,

The most notable end result is the obsolescent of the route
cloning (RTF_CLONING) concept, which translated into code reduction
in both IPv4 ARP and IPv6 NDP related modules, and size reduction in
struct rtentry{}. The change in design obsoletes the semantics of
RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland
applications such as "arp" and "ndp" have been modified to reflect
those changes. The output from "netstat -r" shows only the routing
entries.

Quite a few developers have contributed to this project in the
past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and
Andre Oppermann. And most recently:

- Kip Macy revised the locking code completely, thus completing
the last piece of the puzzle, Kip has also been conducting
active functional testing
- Sam Leffler has helped me improving/refactoring the code, and
provided valuable reviews
- Julian Elischer setup the perforce tree for me and has helped
me maintaining that branch before the svn conversion


# bccd4139 13-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

De-virtualize the MD5 context for TCP initial seq number generation
and make it a function local variable like we do almost everywhere
inside the kernel.

Discussed with: rwatson, silby
MFC after: 4 weeks


# 0750c2ed 11-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Use the correct INIT_VNET_INET() as the virtualized variable here
are in vinet.h not in vinet6.h

Sponsored by: The FreeBSD Foundation


# 385195c0 10-Dec-2008 Marko Zec <zec@FreeBSD.org>

Conditionally compile out V_ globals while instantiating the appropriate
container structures, depending on VIMAGE_GLOBALS compile time option.

Make VIMAGE_GLOBALS a new compile-time option, which by default will not
be defined, resulting in instatiations of global variables selected for
V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be
effectively compiled out. Instantiate new global container structures
to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0,
vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0.

Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_
macros resolve either to the original globals, or to fields inside
container structures, i.e. effectively

#ifdef VIMAGE_GLOBALS
#define V_rt_tables rt_tables
#else
#define V_rt_tables vnet_net_0._rt_tables
#endif

Update SYSCTL_V_*() macros to operate either on globals or on fields
inside container structs.

Extend the internal kldsym() lookups with the ability to resolve
selected fields inside the virtualization container structs. This
applies only to the fields which are explicitly registered for kldsym()
visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently
this is done only in sys/net/if.c.

Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code,
and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in
turn result in proper code being generated depending on VIMAGE_GLOBALS.

De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c
which were prematurely V_irtualized by automated V_ prepending scripts
during earlier merging steps. PF virtualization will be done
separately, most probably after next PF import.

Convert a few variable initializations at instantiation to
initialization in init functions, most notably in ipfw. Also convert
TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in
initializer functions.

Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 4b79449e 02-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation


# 3b6fe5fc 28-Nov-2008 Dag-Erling Smørgrav <des@FreeBSD.org>

missing V_


# 97021c24 26-Nov-2008 Marko Zec <zec@FreeBSD.org>

Merge more of currently non-functional (i.e. resolving to
whitespace) macros from p4/vimage branch.

Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.

De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.

Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 44e33a07 19-Nov-2008 Marko Zec <zec@FreeBSD.org>

Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions. As a rule,
initialization at instatiation for such variables should never be
introduced again from now on. Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact. In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 91d6cfa6 06-Nov-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Fix a bug introduced with r182851 splitting tcp_mss() into
tcp_mss() and tcp_mss_update() so that tcp_mtudisc() could
re-use the same code.

Move the TSO logic back to tcp_mss() and out of tcp_mss_update().
We tried to avoid that initially but if were are called from
tcp_output() with EMSGSIZE, we cleared the TSO flag on the tcpcb
there, called into tcp_mtudisc() and tcp_mss_update() which
then would reenable TSO on the tcpcb based on TSO capabilities
of the interface as learnt in tcp_maxmtu/6().
So if TSO was enabled on the (possibly new) outgoing interface
it was turned back on, which lead to an endless loop between
tcp_output() and tcp_mtudisc() until we overflew the stack.

Reported by: kmacy
MFC after: 2 months (along with r182851)


# 4b3f4d38 05-Nov-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Adopt the comment for tcp_maxmtu(); we are returning a number
not a pointer. While here update the rest of the comment to
better match what we have these days.

MFC after: 2 months


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# f08ef6c5 17-Oct-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Add cr_canseeinpcb() doing checks using the cached socket
credentials from inp_cred which is also available after the
socket is gone.
Switch cr_canseesocket consumers to cr_canseeinpcb.
This removes an extra acquisition of the socket lock.

Reviewed by: rwatson
MFC after: 3 months (set timer; decide then)


# 86d02c5c 04-Oct-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Cache so_cred as inp_cred in the inpcb.
This means that inp_cred is always there, even after the socket
has gone away. It also means that it is constant for the lifetime
of the inp.
Both facts lead to simpler code and possibly less locking.

Suggested by: rwatson
Reviewed by: rwatson
MFC after: 6 weeks
X-MFC Note: use a inp_pspare for inp_cred


# 8b615593 02-Oct-2008 Marko Zec <zec@FreeBSD.org>

Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 3418daf2 13-Sep-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Implement IPv6 support for TCP MD5 Signature Option (RFC 2385)
the same way it has been implemented for IPv4.

Reviewed by: bms (skimmed)
Tested by: Nick Hilliard (nick netability.ie) (with more changes)
MFC after: 2 months


# 00db174b 07-Sep-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

To my reading there are no real consumers of ip6_plen (IPv6
Payload Length) as set in tcpip_fillheaders().
ip6_output() will calculate it based of the length from the
mbuf packet header itself.
So initialize the value in tcpip_fillheaders() in correct
(network) byte order.

With the above change, to my reading, all places calling tcp_trace()
pass in the ip6 header via ipgen as serialized in the mbuf and with
ip6_plen in network byte order.
Thus convert the IPv6 payload length to host byte order before printing.

MFC after: 2 months


# 3cee92e0 07-Sep-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Split tcp_mss() in tcp_mss() and tcp_mss_update() where the former
calls the latter.

Merge tcp_mss_update() with code from tcp_mtudisc() basically
doing the same thing.

This gives us one central place where we calcuate and check mss values
to update t_maxopd (maximum mss + options length) instead of two slightly
different but almost equal implementations to maintain.

PR: kern/118455
Reviewed by: silby (back in March)
MFC after: 2 months


# ebe54269 07-Sep-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

V_irtualize SVN r182846 tcp_mssdflt/tcp_v6mssdflt procedure based
sysctl implementations for VIMAGE the same way we did elsewhere:
update the implementation but leave the globals and the SYSCTL
statement untouched.


# 4cdf3bed 07-Sep-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Convert SYSCTL_INTs for tcp_mssdflt and tcp_v6mssdflt to
SYSCTL_PROCs and check that the default mss for neither v4 nor
v6 goes below the minimum MSS constant (216).

This prevents people from shooting themselves in the foot.

PR: kern/118455 (remotely related)
Reviewed by: silby (as part of a larger patch in March)
MFC after: 2 months


# 5ed3800e 19-Aug-2008 Julian Elischer <julian@FreeBSD.org>

Fix some of the formatting fixes.. It's amazing how some thing stand out
in a commit message.


# ac957cd2 19-Aug-2008 Julian Elischer <julian@FreeBSD.org>

A bunch of formatting fixes brough to light by, or created by the Vimage commit
a few days ago.


# 603724d3 17-Aug-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch


# 53640b0e 02-Jun-2008 Robert Watson <rwatson@FreeBSD.org>

When allocating temporary storage to hold a TCP/IP packet header
template, use an M_TEMP malloc(9) allocation rather than an mbuf
with mtod(9) and dtom(9). This eliminates the last use of
dtom(9) in TCP.

MFC after: 3 weeks


# c28cb4d8 29-May-2008 Robert Watson <rwatson@FreeBSD.org>

Read lock rather than write lock TCP inpcbs in monitoring sysctls. In
some cases, add explicit inpcb locking rather than relying on the global
lock, as we dereference inp_socket, but also allowing us to drop the
global lock more quickly.

MFC after: 1 week


# 8b07e49a 09-May-2008 Julian Elischer <julian@FreeBSD.org>

Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.

Constraints:
------------

I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.

One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".

One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.

This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.

To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.

The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.

The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.

In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.

One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).

You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.

This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.

Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.

Packets fall into one of a number of classes.

1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..

setfib -3 ping target.example.com # will use fib 3 for ping.

It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.

2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)

3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).

4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.

5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.

6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.

Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)

In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.

In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.

Early testing experience:
-------------------------

Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.

For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.

Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.

ipfw has grown 2 new keywords:

setfib N ip from anay to any
count ip from any to any fib N

In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.

SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.

Where to next:
--------------------

After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.

Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.

My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.

When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.

Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.

This work was sponsored by Ironport Systems/Cisco

Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco


# 8501a69c 17-Apr-2008 Robert Watson <rwatson@FreeBSD.org>

Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.

This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.

MFC after: 3 months
Tested by: kris (superset of committered patch)


# bc65987a 18-Dec-2007 Kip Macy <kmacy@FreeBSD.org>

Incorporate TCP offload hooks in to core TCP code.
- Rename output routines tcp_gen_* -> tcp_output_*.
- Rename notification routines that turn in to no-ops in the absence of TOE
from tcp_gen_* -> tcp_offload_*.
- Fix some minor comment nits.
- Add a /* FALLTHROUGH */

Reviewed by: Sam Leffler, Robert Watson, and Mike Silbersack


# abebe6db 28-Nov-2007 Bjoern A. Zeeb <bz@FreeBSD.org>

Correctly get the authentication key for TCP-MD5 from the SA.

Submitted by: Nick Hilliard on net@
MFC after: 8 weeks


# 2b19cb1b 24-Nov-2007 Robert Watson <rwatson@FreeBSD.org>

More carefully handle various cases in sysctl_drop(), such as unlocking
the inpcb when there's an inpcb without associated timewait state, and
not unlocking when the inpcb has been freed. This avoids a kernel panic
when tcpdrop(8) is run on a socket in the TIMEWAIT state.

MFC after: 3 days
Reported by: Rako <rako29 at gmail dot com>


# 30d239bc 24-Oct-2007 Robert Watson <rwatson@FreeBSD.org>

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# 4b421e2d 07-Oct-2007 Mike Silbersack <silby@FreeBSD.org>

Add FBSDID to all files in netinet so that people can more
easily include file version information in bug reports.

Approved by: re (kensmith)


# 0fb651b1 05-Oct-2007 Robert Watson <rwatson@FreeBSD.org>

Disable TCP syncache debug logging by default. While useful in debugging
problems with the syncache, it produces a lot of console noise and has led
to quite a few false positive bug reports. It can be selectively
re-enabled when debugging specific problems by frobbing the same sysctl.

Discussed with: silby
Approved by: re (gnn)


# e2f2059f 23-Sep-2007 Mike Silbersack <silby@FreeBSD.org>

Two changes:

- Reintegrate the ANSI C function declaration change
from tcp_timer.c rev 1.92

- Reorganize the tcpcb structure so that it has a single
pointer to the "tcp_timer" structure which contains all
of the tcp timer callouts. This change means that when
the single tcp timer change is reintegrated, tcpcb will
not change in size, and therefore the ABI between
netstat and the kernel will not change.

Neither of these changes should have any functional
impact.

Reviewed by: bmah, rrs
Approved by: re (bmah)


# 85d94372 07-Sep-2007 Robert Watson <rwatson@FreeBSD.org>

Back out tcp_timer.c:1.93 and associated changes that reimplemented the many
TCP timers as a single timer, but retain the API changes necessary to
reintroduce this change. This will back out the source of at least two
reported problems: lock leaks in certain timer edge cases, and TCP timers
continuing to fire after a connection has closed (a bug previously fixed and
then reintroduced with the timer rewrite).

In a follow-up commit, some minor restylings and comment changes performed
after the TCP timer rewrite will be reapplied, and a further change to allow
the TCP timer rewrite to be added back without disturbing the ABI. The new
design is believed to be a good thing, but the outstanding issues are
leading to significant stability/correctness problems that are holding
up 7.0.

This patch was generated by silby, but is being committed by proxy due to
poor network connectivity for silby this week.

Approved by: re (kensmith)
Submitted by: silby
Tested by: rwatson, kris
Problems reported by: peter, kris, others


# 8cb5ba02 15-Aug-2007 Qing Li <qingli@FreeBSD.org>

Use the sequence number comparison macro to compare
projected_offset against isn_offset to account for
wrap around.

Reviewed by: gnn, kmacy, silby
Submitted by: yusheng.huang@bluecoat.com
Approved by: re
MFC: 3 days


# c4a184bd 31-Jul-2007 Peter Wemm <peter@FreeBSD.org>

Change TCPTV_MIN to be independent of HZ. While it was documented to
be in ticks "for algorithm stability" when originally committed, it turns
out that it has a significant impact in timing out connections. When we
changed HZ from 100 to 1000, this had a big effect on reducing the time
before dropping connections.

To demonstrate, boot with kern.hz=100. ssh to a box on local ethernet
and establish a reliable round-trip-time (ie: type a few commands).
Then unplug the ethernet and press a key. Time how long it takes to
drop the connection.

The old behavior (with hz=100) caused the connection to typically drop
between 90 and 110 seconds of getting no response.

Now boot with kern.hz=1000 (default). The same test causes the ssh session
to drop after just 9-10 seconds. This is a big deal on a wifi connection.

With kern.hz=1000, change sysctl net.inet.tcp.rexmit_min from 3 to 30.
Note how it behaves the same as when HZ was 100. Also, note that when
booting with hz=100, net.inet.tcp.rexmit_min *used* to be 30.

This commit changes TCPTV_MIN to be scaled with hz. rexmit_min should
always be about 30. If you set hz to Really Slow(TM), there is a safety
feature to prevent a value of 0 being used.

This may be revised in the future, but for the time being, it restores the
old, pre-hz=1000 behavior, which is significantly less annoying.

As a workaround, to avoid rebooting or rebuilding a kernel, you can run
"sysctl net.inet.tcp.rexmit_min=30" and add "net.inet.tcp.rexmit_min=30"
to /etc/sysctl.conf. This is safe to run from 6.0 onwards.

Approved by: re (rwatson)
Reviewed by: andre, silby


# 773673c1 27-Jul-2007 Andre Oppermann <andre@FreeBSD.org>

Provide a sysctl to toggle reporting of TCP debug logging:

sys.net.inet.tcp.log_debug = 1

It defaults to enabled for the moment and is to be turned off for
the next release like other diagnostics from development branches.

It is important to note that sysctl sys.net.inet.tcp.log_in_vain
uses the same logging function as log_debug. Enabling of the former
also causes the latter to engage, but not vice versa.

Use consistent terminology in tcp log messages:

"ignored" means a segment contains invalid flags/information and
is dropped without changing state or issuing a reply.

"rejected" means a segments contains invalid flags/information but
is causing a reply (usually RST) and may cause a state change.

Approved by: re (rwatson)


# c6b28997 28-Jul-2007 Robert Watson <rwatson@FreeBSD.org>

Replace references to NET_CALLOUT_MPSAFE with CALLOUT_MPSAFE, and remove
definition of NET_CALLOUT_MPSAFE, which is no longer required now that
debug.mpsafenet has been removed.

The once over: bz
Approved by: re (kensmith)


# c325962b 26-Jul-2007 Mike Silbersack <silby@FreeBSD.org>

Export the contents of the syncache to netstat.

Approved by: re (kensmith)
MFC after: 2 weeks


# 477d44c4 05-Jul-2007 Peter Wemm <peter@FreeBSD.org>

Fix a second warning, introduced by my last "fix". I committed the wrong
diff from the wrong machine.

Pointy hat to: peter
Approved by: re (rwatson - blanket, several days ago)


# 9fb5d4c0 04-Jul-2007 Peter Wemm <peter@FreeBSD.org>

Fix cast-qualifiers warning when INET6 is not present

Approved by: re (rwatson)


# b2630c29 02-Jul-2007 George V. Neville-Neil <gnn@FreeBSD.org>

Commit the change from FAST_IPSEC to IPSEC. The FAST_IPSEC
option is now deprecated, as well as the KAME IPsec code.
What was FAST_IPSEC is now IPSEC.

Approved by: re
Sponsored by: Secure Computing


# 2cb64cb2 01-Jul-2007 George V. Neville-Neil <gnn@FreeBSD.org>

Commit IPv6 support for FAST_IPSEC to the tree.
This commit includes only the kernel files, the rest of the files
will follow in a second commit.

Reviewed by: bz
Approved by: re
Supported by: Secure Computing


# 32f9753c 11-Jun-2007 Robert Watson <rwatson@FreeBSD.org>

Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in
some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.

Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.

We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths. Do, however, move those prototypes to priv.h.

Reviewed by: csjp
Obtained from: TrustedBSD Project


# b312d4b0 27-May-2007 Robert Watson <rwatson@FreeBSD.org>

Don't assign sp to the value of s when we're about to assign it instead to
s + strlen(s).

Found with: Coverity Prevent(tm)
CID: 2243


# ec05a173 23-May-2007 Andre Oppermann <andre@FreeBSD.org>

In tcp_log_addrs():
o add the hex output of the th_flags field to the example log
line in comments
o simplify the log line length calculation and make it less
evil
o correct the test for the length panic; the line isn't on
the stack but malloc'ed


# df541e5f 18-May-2007 Andre Oppermann <andre@FreeBSD.org>

Add tcp_log_addrs() function to generate and standardized TCP log line
for use thoughout the tcp subsystem.

It is IPv4 and IPv6 aware creates a line in the following format:

"TCP: [1.2.3.4]:50332 to [1.2.3.4]:80 tcpflags <RST>"

A "\n" is not included at the end. The caller is supposed to add
further information after the standard tcp log header.

The function returns a NUL terminated string which the caller has
to free(s, M_TCPLOG) after use. All memory allocation is done
with M_NOWAIT and the return value may be NULL in memory shortage
situations.

Either struct in_conninfo || (struct tcphdr && (struct ip || struct
ip6_hdr) have to be supplied.

Due to ip[6].h header inclusion limitations and ordering issues the
struct ip and struct ip6_hdr parameters have to be casted and passed
as void * pointers.

tcp_log_addrs(struct in_conninfo *inc, struct tcphdr *th, void *ip4hdr,
void *ip6hdr)

Usage example:

struct ip *ip;
char *tcplog;

if (tcplog = tcp_log_addrs(NULL, th, (void *)ip, NULL)) {
log(LOG_DEBUG, "%s; %s: Connection attempt to closed port\n",
tcplog, __func__);
free(s, M_TCPLOG);
}


# 2104448f 16-May-2007 Andre Oppermann <andre@FreeBSD.org>

Move TIME_WAIT related functions and timer handling from files
other than repo copied tcp_subr.c into tcp_timewait.c#1.284:

tcp_input.c#1.350 tcp_timewait() -> tcp_twcheck()

tcp_timer.c#1.92 tcp_timer_2msl_reset() -> tcp_tw_2msl_reset()
tcp_timer.c#1.92 tcp_timer_2msl_stop() -> tcp_tw_2msl_stop()
tcp_timer.c#1.92 tcp_timer_2msl_tw() -> tcp_tw_2msl_scan()

This is a mechanical move with appropriate renames and making
them static if used only locally.

The tcp_tw_2msl_scan() cleanup function is still run from the
tcp_slowtimo() in tcp_timer.c.


# ec9c7553 13-May-2007 Andre Oppermann <andre@FreeBSD.org>

Complete the (mechanical) move of the TCP reassembly and timewait
functions from their origininal place to their own files.

TCP Reassembly from tcp_input.c -> tcp_reass.c
TCP Timewait from tcp_subr.c -> tcp_timewait.c


# 0489b64c 11-May-2007 Andre Oppermann <andre@FreeBSD.org>

Make the TCP timer callout obtain Giant if the network stack is marked
as non-mpsafe.

This change is to be removed when all protocols are mp-safe.


# 504abdc6 11-May-2007 Andre Oppermann <andre@FreeBSD.org>

Add the timestamp offset to struct tcptw so we can generate proper
ACKs in TIME_WAIT state that don't get dropped by the PAWS check
on the receiver.


# f2565d68 10-May-2007 Robert Watson <rwatson@FreeBSD.org>

Move universally to ANSI C function declarations, with relatively
consistent style(9)-ish layout.


# 434a0d24 07-May-2007 Robert Watson <rwatson@FreeBSD.org>

When setting up timewait state for a TCP connection, don't hold the
socket lock over a crhold() of so_cred: so_cred is constant after
socket creation, so doesn't require locking to read.


# 3529149e 06-May-2007 Andre Oppermann <andre@FreeBSD.org>

Use existing TF_SACK_PERMIT flag in struct tcpcb t_flags field instead of
a decdicated sack_enable int for this bool. Change all users accordingly.


# 712fc218 30-Apr-2007 Robert Watson <rwatson@FreeBSD.org>

Rename some fields of struct inpcbinfo to have the ipi_ prefix,
consistent with the naming of other structure field members, and
reducing improper grep matches. Clean up and comment structure
fields in structure definition.


# bbf4e1cb 18-Apr-2007 Andre Oppermann <andre@FreeBSD.org>

Make tcp_twrespond() use tcp_addoptions() instead of a home grown version.


# b8152ba7 11-Apr-2007 Andre Oppermann <andre@FreeBSD.org>

Change the TCP timer system from using the callout system five times
directly to a merged model where only one callout, the next to fire,
is registered.

Instead of callout_reset(9) and callout_stop(9) the new function
tcp_timer_activate() is used which then internally manages the callout.

The single new callout is a mutex callout on inpcb simplifying the
locking a bit.

tcp_timer() is the called function which handles all race conditions
in one place and then dispatches the individual timer functions.

Reviewed by: rwatson (earlier version)


# 5dd9dfef 04-Apr-2007 Andre Oppermann <andre@FreeBSD.org>

Retire unused TCP_SACK_DEBUG.


# ad3f9ab3 21-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

ANSIfy function declarations and remove register keywords for variables.
Consistently apply style to all function declarations.


# e406f5a1 21-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

Remove tcp_minmssoverload DoS detection logic. The problem it tried to
protect us from wasn't really there and it only bloats the code. Should
the problem surface in the future we can simply resurrect it from cvs
history.


# 6489fe65 19-Mar-2007 Andre Oppermann <andre@FreeBSD.org>

Match up SYSCTL declaration style.


# 8d0d6d11 16-Mar-2007 Robert Watson <rwatson@FreeBSD.org>

Remove unused and #if 0'd net.inet.tcp.tcp_rttdflt sysctl.


# 7c72af87 26-Feb-2007 Mohan Srinivasan <mohans@FreeBSD.org>

Reap FIN_WAIT_2 connections marked SOCANTRCVMORE faster. This mitigate
potential issues where the peer does not close, potentially leaving
thousands of connections in FIN_WAIT_2. This is controlled by a new sysctl
fast_finwait2_recycle, which is disabled by default.

Reviewed by: gnn, silby.


# 54e3607d 30-Dec-2006 John Baldwin <jhb@FreeBSD.org>

Whitespace fix and remove an extra cast.


# acd3428b 06-Nov-2006 Robert Watson <rwatson@FreeBSD.org>

Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges. These may
require some future tweaking.

Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>


# aed55708 22-Oct-2006 Robert Watson <rwatson@FreeBSD.org>

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# acc03ac6 29-Sep-2006 Maxim Konovalov <maxim@FreeBSD.org>

o Convert w/spaces to tabs in the previous commit.


# d4bdcb16 29-Sep-2006 Mike Silbersack <silby@FreeBSD.org>

Rather than autoscaling the number of TIME_WAIT sockets to maxsockets / 5,
scale it to min(ephemeral port range / 2, maxsockets / 5) so that people
with large gobs of memory and/or large maxsockets settings will not
exhaust their entire ephemeral port range with sockets in the TIME_WAIT
state during periods of heavy load.

Those who wish to tweak the size of the TIME_WAIT zone can still do so with
net.inet.tcp.maxtcptw.

Reviewed by: glebius, ru


# 3e630ef9 08-Sep-2006 Gleb Smirnoff <glebius@FreeBSD.org>

Add a sysctl net.inet.tcp.nolocaltimewait that allows to suppress
creating a compress TIME WAIT states, if both connection endpoints
are local. Default is off.


# 751dea29 07-Sep-2006 Ruslan Ermilov <ru@FreeBSD.org>

Back when we had T/TCP support, we used to apply different
timeouts for TCP and T/TCP connections in the TIME_WAIT
state, and we had two separate timed wait queues for them.
Now that is has gone, the timeout is always 2*MSL again,
and there is no reason to keep two queues (the first was
unused anyway!).

Also, reimplement the remaining queue using a TAILQ (it
was technically impossible before, with two queues).


# 233dcce1 06-Sep-2006 Andre Oppermann <andre@FreeBSD.org>

First step of TSO (TCP segmentation offload) support in our network stack.

o add IFCAP_TSO[46] for drivers to announce this capability for IPv4 and IPv6
o add CSUM_TSO flag to mbuf pkthdr csum_flags field
o add tso_segsz field to mbuf pkthdr
o enhance ip_output() packet length check to allow for large TSO packets
o extend tcp_maxmtu[46]() with a flag pointer to pass interface capabilities
o adjust all callers of tcp_maxmtu[46]() accordingly

Discussed on: -current, -net
Sponsored by: TCP/IP Optimization Fundraise 2005


# 2c857a9b 06-Sep-2006 Gleb Smirnoff <glebius@FreeBSD.org>

o Backout rev. 1.125 of in_pcb.c. It appeared to behave extremely
bad under high load. For example with 40k sockets and 25k tcptw
entries, connect() syscall can run for seconds. Debugging showed
that it iterates the cycle millions times and purges thousands of
tcptw entries at a time.
Besides practical unusability this change is architecturally
wrong. First, in_pcblookup_local() is used in connect() and bind()
syscalls. No stale entries purging shouldn't be done here. Second,
it is a layering violation.
o Return back the tcptw purging cycle to tcp_timer_2msl_tw(),
that was removed in rev. 1.78 by rwatson. The commit log of this
revision tells nothing about the reason cycle was removed. Now
we need this cycle, since major cleaner of stale tcptw structures
is removed.
o Disable probably necessary, but now unused
tcp_twrecycleable() function.

Reviewed by: ru


# c3e07bf8 05-Sep-2006 Gleb Smirnoff <glebius@FreeBSD.org>

Finally fix rev. 1.256

Pointy hat to: glebius


# 23ebab41 04-Sep-2006 Gleb Smirnoff <glebius@FreeBSD.org>

Remove extra parenthesis in last commit.

Nitpicked by: ru


# 1f1f90c3 04-Sep-2006 Gleb Smirnoff <glebius@FreeBSD.org>

- Make net.inet.tcp.maxtcptw modifiable at run time.
- If net.inet.tcp.maxtcptw was ever set explicitly, do
not change it if kern.ipc.maxsockets is changed.


# 2374501c 26-Aug-2006 Mohan Srinivasan <mohans@FreeBSD.org>

Fix for a bug that causes the computation of "len" in tcp_output() to
get messed up, resulting in an inconsistency between the TCP state
and so_snd.


# 464469c7 11-Aug-2006 Mohan Srinivasan <mohans@FreeBSD.org>

Fixes an edge case bug in timewait handling where ticks rolling over causing
the timewait expiry to be exactly 0 corrupts the timewait queues (and that entry).
Reviewed by: silby


# e8504752 02-Aug-2006 Robert Watson <rwatson@FreeBSD.org>

Move soisdisconnected() in tcp_discardcb() to one of its calling contexts,
tcp_twstart(), but not to the other, tcp_detach(), as the socket is
already being torn down and therefore there are no listeners. This avoids
a panic if kqueue state is registered on the socket at close(), and
eliminates to XXX comments. There is one case remaining in which
tcp_discardcb() reaches up to the socket layer as part of the TCP host
cache, which would be good to avoid.

Reported by: Goran Gajic <ggajic at afrodita dot rcub dot bg dot ac dot yu>


# a152f8a3 21-Jul-2006 Robert Watson <rwatson@FreeBSD.org>

Change semantics of socket close and detach. Add a new protocol switch
function, pru_close, to notify protocols that the file descriptor or
other consumer of a socket is closing the socket. pru_abort is now a
notification of close also, and no longer detaches. pru_detach is no
longer used to notify of close, and will be called during socket
tear-down by sofree() when all references to a socket evaporate after
an earlier call to abort or close the socket. This means detach is now
an unconditional teardown of a socket, whereas previously sockets could
persist after detach of the protocol retained a reference.

This faciliates sharing mutexes between layers of the network stack as
the mutex is required during the checking and removal of references at
the head of sofree(). With this change, pru_detach can now assume that
the mutex will no longer be required by the socket layer after
completion, whereas before this was not necessarily true.

Reviewed by: gnn


# d915b280 18-Jul-2006 Stephan Uphoff <ups@FreeBSD.org>

Fix race conditions on enumerating pcb lists by moving the initialization
( and where appropriate the destruction) of the pcb mutex to the init/finit
functions of the pcb zones.
This allows locking of the pcb entries and race condition free comparison
of the generation count.
Rearrange locking a bit to avoid extra locking operation to update the generation
count in in_pcballoc(). (in_pcballoc now returns the pcb locked)

I am planning to convert pcb list handling from a type safe to a reference count
model soon. ( As this allows really freeing the PCBs)

Reviewed by: rwatson@, mohans@
MFC after: 1 week


# 10702a28 25-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Abstract inpcb drop logic, previously just setting of INP_DROPPED in TCP,
into in_pcbdrop(). Expand logic to detach the inpcb from its bound
address/port so that dropping a TCP connection releases the inpcb resource
reservation, which since the introduction of socket/pcb reference count
updates, has been persisting until the socket closed rather than being
released implicitly due to prior freeing of the inpcb on TCP drop.

MFC after: 3 months


# 9106a6d6 22-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Replace isn_mtx direct use with ISN_*() lock macros so that locking
details/strategy can be changed without touching every use.

MFC after: 3 months


# 4c0e8f41 22-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Introduce a new TCP mutex, isn_mtx, which protects the initial sequence
number state, rather than re-using pcbinfo. This introduces some
additional mutex operations during isn query, but avoids hitting the TCP
pcbinfo lock out of yet another frequently firing TCP timer.

MFC after: 3 months


# 4f590175 21-Apr-2006 Paul Saab <ps@FreeBSD.org>

Allow for nmbclusters and maxsockets to be increased via sysctl.
An eventhandler is used to update all the various zones that depend
on these values.


# a73b6567 04-Apr-2006 Gleb Smirnoff <glebius@FreeBSD.org>

Add a tunable net.inet.tcp.maxtcptw, that allows to set a limit
on tcptw zone independently from setting a limit on socket zone.


# ae0e7143 03-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Before dereferencing intotw() when INP_TIMEWAIT, check for inp_ppcb being
NULL. We currently do allow this to happen, but may want to remove that
possibility in the future. This case can occur when a socket is left
open after TCP wraps up, and the timewait state is recycled. This will
be cleaned up in the future.

Found by: Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after: 3 months


# cb895fb9 03-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

In TCP notify routines, check inpcb for INP_TIMEWAIT and INP_DROPPED.
The INP_DROPPED check replaces the current NULL checks; the INP_TIMEWAIT
checks appear to have always been required, but not been there, which
is/was a bug. This avoids unconditionally casting of in_ppcb to a tcpcb,
when it may be a twtcb, which may have resulted in obscure ICMP-related
panics in earlier releases.

MFC after: 3 months


# afa39e25 03-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Change inp_ppcb from caddr_t to void *, fix/remove associated related
casts.

Consistently use intotw() to cast inp_ppcb pointers to struct tcptw *
pointers.

Consistently use intotcpcb() to cast inp_ppcb pointers to struct tcpcb *
pointers.

Don't assign tp to the results to intotcpcb() during variable declation
at the top of functions, as that is before the asserts relating to
locking have been performed. Do this later in the function after
appropriate assertions have run to allow that operation to be conisdered
safe.

MFC after: 3 months


# 43f56a32 02-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Style tweaks: convert to ANSI from K&R function prototypes.

MFC after: 3 months


# 2fc5ae87 02-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Update comment on tcp_close() for new world order.

MFC after: 3 months


# fa38deac 03-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Fix up locking surrounding tcp_drop sysctl: in the new world order, we
don't free inpcbs until after the socket is closed, so we always need
to unlock an inpcb after calling tcp_drop() on it.

MFC after: 3 months


# 34af7bae 01-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Properly handle an edge case previously not handled correctly: a
socket can have a tcp connection that has entered time wait
attached to it, in the event that shutdown() is called on the
socket and the FINs properly exchange before close(). In this
case we don't detach or free the inpcb, just leave the tcptw
detached and freed, but we must release the inpcb lock (which we
didn't previously).

MFC after: 3 months


# 623dce13 01-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():

- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.

- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.

- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.

- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.

- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.

- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.

- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.

- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.

- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.

These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.

MFC after: 3 months


# eaf80179 16-Feb-2006 Andre Oppermann <andre@FreeBSD.org>

Have TCP Inflight disable itself if the RTT is below a certain
threshold. Inflight doesn't make sense on a LAN as it has
trouble figuring out the maximal bandwidth because of the coarse
tick granularity.

The sysctl net.inet.tcp.inflight.rttthresh specifies the threshold
in milliseconds below which inflight will disengage. It defaults
to 10ms.

Tested by: Joao Barros <joao.barros-at-gmail.com>,
Rich Murphey <rich-at-whiteoaklabs.com>
Sponsored by: TCP/IP Optimization Fundraise 2005


# 34333b16 02-Nov-2005 Andre Oppermann <andre@FreeBSD.org>

Retire MT_HEADER mbuf type and change its users to use MT_DATA.

Having an additional MT_HEADER mbuf type is superfluous and redundant
as nothing depends on it. It only adds a layer of confusion. The
distinction between header mbuf's and data mbuf's is solely done
through the m->m_flags M_PKTHDR flag.

Non-native code is not changed in this commit. For compatibility
MT_HEADER is mapped to MT_DATA.

Sponsored by: TCP/IP Optimization Fundraise 2005


# 7691747a 12-Oct-2005 Philip Paeps <philip@FreeBSD.org>

Unbreak the net.inet6.tcp6.getcred sysctl.

This makes inetd/auth work again in IPv6 setups.

Pointy hat to: ume/KAME


# ac827533 02-Oct-2005 Maxim Konovalov <maxim@FreeBSD.org>

o Teach sysctl_drop() how to deal with the sockets in TIME_WAIT state.
This is a special case because tcp_twstart() destroys a tcp control
block via tcp_discardcb() so we cannot call tcp_drop(struct *tcpcb) on
such connections. Use tcp_twclose() instead.

MFC after: 5 days


# ffabe3dc 10-Sep-2005 Andre Oppermann <andre@FreeBSD.org>

In tcp_ctlinput() do not swap ip->ip_len a second time. It
has been done in icmp_input() already.

This fixes the ICMP_UNREACH_NEEDFRAG case where no MTU was
proposed in the ICMP reply.

PR: kern/81813
Submitted by: Vitezslav Novy <vita at fio.cz>
MFC after: 3 days


# e0aec682 30-Aug-2005 Andre Oppermann <andre@FreeBSD.org>

Use the correct mbuf type for MGET().


# 4dad226e 31-Jul-2005 Hajimu UMEMOTO <ume@FreeBSD.org>

recover the line which was wrongly disappeared during scope cleanup.
tcpdrop(8) should work for IPv6, again.


# a1f7e5f8 24-Jul-2005 Hajimu UMEMOTO <ume@FreeBSD.org>

scope cleanup. with this change
- most of the kernel code will not care about the actual encoding of
scope zone IDs and won't touch "s6_addr16[1]" directly.
- similarly, most of the kernel code will not care about link-local
scoped addresses as a special case.
- scope boundary check will be stricter. For example, the current
*BSD code allows a packet with src=::1 and dst=(some global IPv6
address) to be sent outside of the node, if the application do:
s = socket(AF_INET6);
bind(s, "::1");
sendto(s, some_global_IPv6_addr);
This is clearly wrong, since ::1 is only meaningful within a single
node, but the current implementation of the *BSD kernel cannot
reject this attempt.

Submitted by: JINMEI Tatuya <jinmei__at__isl.rdc.toshiba.co.jp>
Obtained from: KAME


# f59a9ebf 18-Jul-2005 Robert Watson <rwatson@FreeBSD.org>

Remove no-op spl's and most comment references to spls, as TCP locking
is believed to be basically done (modulo any remaining bugs).

MFC after: 3 days


# 482ac968 01-Jul-2005 Paul Saab <ps@FreeBSD.org>

Fix for a bug in the change that defers sack option processing until
after PAWS checks. The symptom of this is an inconsistency in the cached
sack state, caused by the fact that the sack scoreboard was not being
updated for an ACK handled in the header prediction path.

Found by: Andrey Chernov.
Submitted by: Noritoshi Demizu, Raja Mukerji.
Approved by: re


# e3d5315d 31-May-2005 Robert Watson <rwatson@FreeBSD.org>

Assert tcbinfo lock in tcp_drop() due to its call of tcp_close()
Assert tcbinfo lock in tcp_close() due to its call to in{,6}_detach()
Assert tcbinfo lock in tcp_drop_syn_sent() due to its call to tcp_drop()

MFC after: 7 days


# fe2eee82 06-May-2005 Colin Percival <cperciva@FreeBSD.org>

Fix two issues which were missed in FreeBSD-SA-05:08.kmem.

Reported by: Uwe Doering


# 9e4ca631 04-May-2005 Andre Oppermann <andre@FreeBSD.org>

If we don't get a suggested MTU during path MTU discovery
look up the packet size of the packet that generated the
response, step down the MTU by one step through ip_next_mtu()
and try again.

Suggested by: dwmalone


# a6235da6 21-Apr-2005 Paul Saab <ps@FreeBSD.org>

- Make the sack scoreboard logic use the TAILQ macros. This improves
code readability and facilitates some anticipated optimizations in
tcp_sack_option().
- Remove tcp_print_holes() and TCP_SACK_DEBUG.

Submitted by: Raja Mukerji.
Reviewed by: Mohan Srinivasan, Noritoshi Demizu.


# 1aedbd9c 21-Apr-2005 Andre Oppermann <andre@FreeBSD.org>

Move Path MTU discovery ICMP processing from icmp_input() to
tcp_ctlinput() and subject it to active tcpcb and sequence
number checking. Previously any ICMP unreachable/needfrag
message would cause an update to the TCP hostcache. Now only
ICMP PMTU messages belonging to an active TCP session with
the correct src/dst/port and sequence number will update the
hostcache and complete the path MTU discovery process.

Note that we don't entirely implement the recommended counter
measures of Section 7.2 of the paper. However we close down
the possible degradation vector from trivially easy to really
complex and resource intensive. In addition we have limited
the smallest acceptable MTU with net.inet.tcp.minmss sysctl
for some time already, further reducing the effect of any
degradation due to an attack.

Security: draft-gont-tcpm-icmp-attacks-03.txt Section 7.2
MFC after: 3 days


# 1600372b 20-Apr-2005 Andre Oppermann <andre@FreeBSD.org>

Ignore ICMP Source Quench messages for TCP sessions. Source Quench is
ineffective, depreciated and can be abused to degrade the performance
of active TCP sessions if spoofed.

Replace a bogus call to tcp_quench() in tcp_output() with the direct
equivalent tcpcb variable assignment.

Security: draft-gont-tcpm-icmp-attacks-03.txt Section 7.1
MFC after: 3 days


# e346eeff 09-Apr-2005 Paul Saab <ps@FreeBSD.org>

- If the reassembly queue limit was reached or if we couldn't allocate
a reassembly queue state structure, don't update (receiver) sack
report.
- Similarly, if tcp_drain() is called, freeing up all items on the
reassembly queue, clean the sack report.

Found, Submitted by: Noritoshi Demizu <demizu at dd dot iij4u dot or dot jp>
Reviewed by: Mohan Srinivasan (mohans at yahoo-inc dot com),
Raja Mukerji (raja at moselle dot com).


# 31199c84 28-Feb-2005 Gleb Smirnoff <glebius@FreeBSD.org>

Use NET_CALLOUT_MPSAFE macro.


# 9945c0e2 14-Feb-2005 Maxim Konovalov <maxim@FreeBSD.org>

o Add handling of an IPv4-mapped IPv6 address.
o Use SYSCTL_IN() macro instead of direct call of copyin(9).

Submitted by: ume

o Move sysctl_drop() implementation to sys/netinet/tcp_subr.c where
most of tcp sysctls live.
o There are net.inet[6].tcp[6].getcred sysctls already, no needs in
a separate struct tcp_ident_mapping.

Suggested by: ume


# 6d0a982b 04-Feb-2005 Hajimu UMEMOTO <ume@FreeBSD.org>

teach scope of IPv6 address to net.inet6.tcp6.getcred.

MFC after: 1 week


# 06456da2 30-Jan-2005 Robert Watson <rwatson@FreeBSD.org>

Update an additional reference to the rate of ISN tick callouts that was
missed in tcp_subr.c:1.216: projected_offset must also reflect how often
the tcp_isn_tick() callout will fire.

MFC after: 2 weeks
Submitted by: silby


# 54082796 30-Jan-2005 Robert Watson <rwatson@FreeBSD.org>

Have tcp_isn_tick() fire 100 times a second, rather than HZ times a
second; since the default hz has changed to 1000 times a second,
this resulted in unecessary work being performed.

MFC after: 2 weeks
Discussed with: phk, cperciva
General head nod: silby


# c398230b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# 452d9f5b 22-Dec-2004 Robert Watson <rwatson@FreeBSD.org>

Attempt to consistently use () around return values in calls to
return() in newer code (sysctl, ISN, timewait).


# 06da46b2 22-Dec-2004 Robert Watson <rwatson@FreeBSD.org>

Remove an XXXRW comment relating to whether or not the TCP timers are
MPSAFE: they are now believed to be.

Correct a typo in a second comment.

MFC after: 2 weeks


# b9155d92 05-Dec-2004 Robert Watson <rwatson@FreeBSD.org>

Assert inpcb lock in:

tcpip_fillheaders()
tcp_discardcb()
tcp_close()
tcp_notify()
tcp_new_isn()
tcp_xmit_bandwidth_limit()

Fix a locking comment in tcp_twstart(): the pcbinfo will be locked (and
is asserted).

MFC after: 2 weeks


# cce83ffb 23-Nov-2004 Robert Watson <rwatson@FreeBSD.org>

tcp_timewait() performs multiple non-atomic reads on the tcptw
structure, so assert the inpcb lock associated with the tcptw.
Also assert the tcbinfo lock, as tcp_timewait() may call
tcp_twclose() or tcp_2msl_rest(), which require it. Since
tcp_timewait() is already called with that lock from tcp_input(),
this doesn't change current locking, merely documents reasons for
it.

In tcp_twstart(), assert the tcbinfo lock, as tcp_timer_2msl_rest()
is called, which requires that lock.

In tcp_twclose(), assert the tcbinfo lock, as tcp_timer_2msl_stop()
is called, which requires that lock.

Document the locking strategy for the time wait queues in tcp_timer.c,
which consists of protecting the time wait queues in the same manner
as the tcbinfo structure (using the tcbinfo lock).

In tcp_timer_2msl_reset(), assert the tcbinfo lock, as the time wait
queues are modified.

In tcp_timer_2msl_stop(), assert the tcbinfo lock, as the time wait
queues may be modified.

In tcp_timer_2msl_tw(), assert the tcbinfo lock, as the time wait
queues may be modified.

MFC after: 2 weeks


# 7258e91f 23-Nov-2004 Robert Watson <rwatson@FreeBSD.org>

Assert the inpcb lock in tcp_twstart(), which does both read-modify-write
on the tcpcb, but also calls into tcp_close() and tcp_twrespond().

Annotate that tcp_twrecycleable() requires the inpcb lock because it does
a series of non-atomic reads of the tcpcb, but is currently called
without the inpcb lock by the caller. This is a bug.

Assert the inpcb lock in tcp_twclose() as it performs a read-modify-write
of the timewait structure/inpcb, and calls in_pcbdetach() which requires
the lock.

Assert the inpcb lock in tcp_twrespond(), as it performs multiple
non-atomic reads of the tcptw and inpcb structures, as well as calling
mac_create_mbuf_from_inpcb(), tcpip_fillheaders(), which require the
inpcb lock.

MFC after: 2 weeks


# 8263bab3 23-Nov-2004 Robert Watson <rwatson@FreeBSD.org>

Assert inpcb lock in tcp_quench(), tcp_drop_syn_sent(), tcp_mtudisc(),
and tcp_drop(), due to read-modify-write of TCP state variables.

MFC after: 2 weeks


# 8438db0f 23-Nov-2004 Robert Watson <rwatson@FreeBSD.org>

Assert the tcbinfo write lock in tcp_new_isn(), as the tcbinfo lock
protects access to the ISN state variables.

Acquire the tcbinfo write lock in tcp_isn_tick() to synchronize
timer-driven isn bumping.

Staticize internal ISN variables since they're not used outside of
tcp_subr.c.

MFC after: 2 weeks


# 3d54848f 08-Nov-2004 SUZUKI Shinsuke <suz@FreeBSD.org>

support TCP-MD5(IPv4) in KAME-IPSEC, too.

MFC after: 3 week


# c94c54e4 02-Nov-2004 Andre Oppermann <andre@FreeBSD.org>

Remove RFC1644 T/TCP support from the TCP side of the network stack.

A complete rationale and discussion is given in this message
and the resulting discussion:

http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706

Note that this commit removes only the functional part of T/TCP
from the tcp_* related functions in the kernel. Other features
introduced with RFC1644 are left intact (socket layer changes,
sendmsg(2) on connection oriented protocols) and are meant to
be reused by a simpler and less intrusive reimplemention of the
previous T/TCP functionality.

Discussed on: -arch


# 81158452 18-Oct-2004 Robert Watson <rwatson@FreeBSD.org>

Push acquisition of the accept mutex out of sofree() into the caller
(sorele()/sotryfree()):

- This permits the caller to acquire the accept mutex before the socket
mutex, avoiding sofree() having to drop the socket mutex and re-order,
which could lead to races permitting more than one thread to enter
sofree() after a socket is ready to be free'd.

- This also covers clearing of the so_pcb weak socket reference from
the protocol to the socket, preventing races in clearing and
evaluation of the reference such that sofree() might be called more
than once on the same socket.

This appears to close a race I was able to easily trigger by repeatedly
opening and resetting TCP connections to a host, in which the
tcp_close() code called as a result of the RST raced with the close()
of the accepted socket in the user process resulting in simultaneous
attempts to de-allocate the same socket. The new locking increases
the overhead for operations that may potentially free the socket, so we
will want to revise the synchronization strategy here as we normalize
the reference counting model for sockets. The use of the accept mutex
in freeing of sockets that are not listen sockets is primarily
motivated by the potential need to remove the socket from the
incomplete connection queue on its parent (listen) socket, so cleaning
up the reference model here may allow us to substantially weaken the
synchronization requirements.

RELENG_5_3 candidate.

MFC after: 3 days
Reviewed by: dwhite
Discussed with: gnn, dwhite, green
Reported by: Marc UBM Bocklet <ubm at u-boot-man dot de>
Reported by: Vlad <marchenko at gmail dot com>


# a55db2b6 05-Oct-2004 Paul Saab <ps@FreeBSD.org>

- Estimate the amount of data in flight in sack recovery and use it
to control the packets injected while in sack recovery (for both
retransmissions and new data).
- Cleanups to the sack codepaths in tcp_output.c and tcp_sack.c.
- Add a new sysctl (net.inet.tcp.sack.initburst) that controls the
number of sack retransmissions done upon initiation of sack recovery.

Submitted by: Mohan Srinivasan <mohans@yahoo-inc.com>


# b5d47ff5 04-Sep-2004 John-Mark Gurney <jmg@FreeBSD.org>

fix up socket/ip layer violation... don't assume/know that
SO_DONTROUTE == IP_ROUTETOIF and SO_BROADCAST == IP_ALLOWBROADCAST...


# 6f2d4ea6 19-Aug-2004 Andre Oppermann <andre@FreeBSD.org>

For IPv6 access pointer to tcpcb only after we have checked it is valid.

Found by: Coverity's automated analysis (via Ted Unangst)


# a4f757cd 16-Aug-2004 Robert Watson <rwatson@FreeBSD.org>

White space cleanup for netinet before branch:

- Trailing tab/space cleanup
- Remove spurious spaces between or before tabs

This change avoids touching files that Andre likely has in his working
set for PFIL hooks changes for IPFW/DUMMYNET.

Approved by: re (scottl)
Submitted by: Xin LI <delphij@frontfree.net>


# 84911266 12-Aug-2004 David Malone <dwmalone@FreeBSD.org>

In tcp6_ctlinput, lock tcbinfo around the call to syncache_unreach
so that the locks held are the same as the IPv4 case.

Reviewed by: rwatson


# 420a2811 11-Aug-2004 Andre Oppermann <andre@FreeBSD.org>

Backout removal of UMA_ZONE_NOFREE flag for all zones which are established
for structures with timers in them. It might be that a timer might fire
even when the associated structure has already been free'd. Having type-
stable storage in this case is beneficial for graceful failure handling and
debugging.

Discussed with: bosko, tegge, rwatson


# 4efb805c 11-Aug-2004 Andre Oppermann <andre@FreeBSD.org>

Remove the UMA_ZONE_NOFREE flag to all uma_zcreate() calls in the IP and
TCP code. This flag would have prevented giving back excessive free slabs
to the global pool after a transient peak usage.


# f31f65a7 05-Aug-2004 Robert Watson <rwatson@FreeBSD.org>

Pass pcbinfo structures to in6_pcbnotify() rather than pcbhead
structures, allowing in6_pcbnotify() to lock the pcbinfo and each
inpcb that it notifies of ICMPv6 events. This prevents inpcb
assertions from firing when IPv6 generates and delievers event
notifications for inpcbs.

Reported by: kuriyama
Tested by: kuriyama


# 24a098ea 03-Aug-2004 Andre Oppermann <andre@FreeBSD.org>

o Move the inflight sysctls to their own sub-tree under net.inet.tcp to be
more consistent with the other sysctls around it.


# 56f21b9d 26-Jul-2004 Colin Percival <cperciva@FreeBSD.org>

Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.

The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)

Discussed with: rwatson, scottl
Requested by: jhb


# 04f0d9a0 19-Jul-2004 Jayanth Vijayaraghavan <jayanth@FreeBSD.org>

Let IN_FASTREOCOVERY macro decide if we are in recovery mode.

Nuke sackhole_limit for now. We need to add it back to limit the total
number of sack blocks in the system.


# 76947e32 23-Jun-2004 Paul Saab <ps@FreeBSD.org>

Move the sack sysctl's under net.inet.tcp.sack

net.inet.tcp.do_sack -> net.inet.tcp.sack.enable
net.inet.tcp.sackhole_limit -> net.inet.tcp.sack.sackhole_limit

Requested by: wollman


# 6d90faf3 23-Jun-2004 Paul Saab <ps@FreeBSD.org>

Add support for TCP Selective Acknowledgements. The work for this
originated on RELENG_4 and was ported to -CURRENT.

The scoreboarding code was obtained from OpenBSD, and many
of the remaining changes were inspired by OpenBSD, but not
taken directly from there.

You can enable/disable sack using net.inet.tcp.do_sack. You can
also limit the number of sack holes that all senders can have in
the scoreboard with net.inet.tcp.sackhole_limit.

Reviewed by: gnn
Obtained from: Yahoo! (Mohan Srinivasan, Jayanth Vijayaraghavan)


# d330008e 20-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

If debug.mpsafenet is set, initialize TCP callouts as CALLOUT_MPSAFE.


# 395a08c9 12-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Extend coverage of SOCK_LOCK(so) to include so_count, the socket
reference count:

- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
soref(), sorele().

- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
so_count: sofree(), sotryfree().

- Acquire SOCK_LOCK(so) before calling these functions or macros in
various contexts in the stack, both at the socket and protocol
layers.

- In some cases, perform soisdisconnected() before sotryfree(), as
this could result in frobbing of a non-present socket if
sotryfree() actually frees the socket.

- Note that sofree()/sotryfree() will release the socket lock even if
they don't free the socket.

Submitted by: sam
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS


# c18b97c6 03-May-2004 Robert Watson <rwatson@FreeBSD.org>

Switch to using the inpcb MAC label instead of socket MAC label when
labeling new mbufs created from sockets/inpcbs in IPv4. This helps avoid
the need for socket layer locking in the lower level network paths
where inpcb locks are already frequently held where needed. In
particular:

- Use the inpcb for label instead of socket in raw_append().
- Use the inpcb for label instead of socket in tcp_output().
- Use the inpcb for label instead of socket in tcp_respond().
- Use the inpcb for label instead of socket in tcp_twrespond().
- Use the inpcb for label instead of socket in syncache_respond().

While here, modify tcp_respond() to avoid assigning NULL to a stack
variable and centralize assertions about the inpcb when inp is
assigned.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, McAfee Research


# c1537ef0 20-Apr-2004 Mike Silbersack <silby@FreeBSD.org>

Enhance our RFC1948 implementation to perform better in some pathlogical
TIME_WAIT recycling cases I was able to generate with http testing tools.

In short, as the old algorithm relied on ticks to create the time offset
component of an ISN, two connections with the exact same host, port pair
that were generated between timer ticks would have the exact same sequence
number. As a result, the second connection would fail to pass the TIME_WAIT
check on the server side, and the SYN would never be acknowledged.

I've "fixed" this by adding random positive increments to the time component
between clock ticks so that ISNs will *always* be increasing, no matter how
quickly the port is recycled.

Except in such contrived benchmarking situations, this problem should never
come up in normal usage... until networks get faster.

No MFC planned, 4.x is missing other optimizations that are needed to even
create the situation in which such quick port recycling will occur.


# f36cfd49 07-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson


# 47f32f6f 04-Apr-2004 Robert Watson <rwatson@FreeBSD.org>

Two missed in previous commit -- compare pointer with NULL rather than
using it as a boolean.


# 24459934 04-Apr-2004 Robert Watson <rwatson@FreeBSD.org>

Prefer NULL to 0 when checking pointer values as integers or booleans.


# a7b6a14a 28-Feb-2004 Robert Watson <rwatson@FreeBSD.org>

Remove now unneeded arguments to tcp_twrespond() -- so and msrc. These
were needed by the MAC Framework until inpcbs gained labels.

Submitted by: sam


# 47934cef 25-Feb-2004 Don Lewis <truckman@FreeBSD.org>

Split the mlock() kernel code into two parts, mlock(), which unpacks
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire(). Split munlock() in a similar way.

Enable the RLIMIT_MEMLOCK checking code in kern_mlock().

Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.

Nuke the vslock() and vsunlock() implementations, which are no
longer used.

Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.

Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails. Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.

Modify the callers of sysctl_wire_old_buffer() to look for the
error return.

Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.

Reviewed by: bms


# 12e2e970 24-Feb-2004 Andre Oppermann <andre@FreeBSD.org>

Convert the tcp segment reassembly queue to UMA and limit the maximum
amount of segments it will hold.

The following tuneables and sysctls control the behaviour of the tcp
segment reassembly queue:

net.inet.tcp.reass.maxsegments (loader tuneable)
specifies the maximum number of segments all tcp reassemly queues can
hold (defaults to 1/16 of nmbclusters).

net.inet.tcp.reass.maxqlen
specifies the maximum number of segments any individual tcp session queue
can hold (defaults to 48).

net.inet.tcp.reass.cursegments (readonly)
counts the number of segments currently in all reassembly queues.

net.inet.tcp.reass.overflows (readonly)
counts how often either the global or local queue limit has been reached.

Tested by: bms, silby
Reviewed by: bms, silby


# 41fe0c8a 19-Feb-2004 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Fixed ucred structure leak.

Approved by: scottl (mentor)
PR: 54163
MFC after: 3 days


# 32ff0466 14-Feb-2004 Bruce M Simpson <bms@FreeBSD.org>

Final brucification pass. Spell types consistently (u_int). Remove bogus
casts. Remove unnecessary parenthesis.

Submitted by: bde


# 265ed012 13-Feb-2004 Bruce M Simpson <bms@FreeBSD.org>

Brucification.

Submitted by: bde


# efddf5c6 13-Feb-2004 Hajimu UMEMOTO <ume@FreeBSD.org>

supported IPV6_RECVPATHMTU socket option.

Obtained from: KAME


# b30190b5 12-Feb-2004 Bruce M Simpson <bms@FreeBSD.org>

Update the prototype for tcpsignature_apply() to reflect the spelling of
the types used by m_apply()'s callback function, f, as documented in mbuf(9).

Noticed by: njl


# bca0e5bf 12-Feb-2004 Bruce M Simpson <bms@FreeBSD.org>

style(9) pass; whitespace and comments.

Submitted by: njl


# 1cfd4b53 10-Feb-2004 Bruce M Simpson <bms@FreeBSD.org>

Initial import of RFC 2385 (TCP-MD5) digest support.

This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.

For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.

Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.

There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.

Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.

This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.

Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.

Sponsored by: sentex.net


# 53369ac9 08-Jan-2004 Andre Oppermann <andre@FreeBSD.org>

Limiters and sanity checks for TCP MSS (maximum segement size)
resource exhaustion attacks.

For network link optimization TCP can adjust its MSS and thus
packet size according to the observed path MTU. This is done
dynamically based on feedback from the remote host and network
components along the packet path. This information can be
abused to pretend an extremely low path MTU.

The resource exhaustion works in two ways:

o during tcp connection setup the advertized local MSS is
exchanged between the endpoints. The remote endpoint can
set this arbitrarily low (except for a minimum MTU of 64
octets enforced in the BSD code). When the local host is
sending data it is forced to send many small IP packets
instead of a large one.

For example instead of the normal TCP payload size of 1448
it forces TCP payload size of 12 (MTU 64) and thus we have
a 120 times increase in workload and packets. On fast links
this quickly saturates the local CPU and may also hit pps
processing limites of network components along the path.

This type of attack is particularly effective for servers
where the attacker can download large files (WWW and FTP).

We mitigate it by enforcing a minimum MTU settable by sysctl
net.inet.tcp.minmss defaulting to 256 octets.

o the local host is reveiving data on a TCP connection from
the remote host. The local host has no control over the
packet size the remote host is sending. The remote host
may chose to do what is described in the first attack and
send the data in packets with an TCP payload of at least
one byte. For each packet the tcp_input() function will
be entered, the packet is processed and a sowakeup() is
signalled to the connected process.

For example an attack with 2 Mbit/s gives 4716 packets per
second and the same amount of sowakeup()s to the process
(and context switches).

This type of attack is particularly effective for servers
where the attacker can upload large amounts of data.
Normally this is the case with WWW server where large POSTs
can be made.

We mitigate this by calculating the average MSS payload per
second. If it goes below 'net.inet.tcp.minmss' and the pps
rate is above 'net.inet.tcp.minmssoverload' defaulting to
1000 this particular TCP connection is resetted and dropped.

MITRE CVE: CAN-2004-0002
Reviewed by: sam (mentor)
MFC after: 1 day


# bf87c82e 08-Jan-2004 Andre Oppermann <andre@FreeBSD.org>

If path mtu discovery is enabled set the DF bit in all cases we
send packets on a tcp connection.

PR: kern/60889
Tested by: Richard Wendland <richard@wendland.org.uk>
Approved by: re (scottl)


# dba7bc6a 06-Jan-2004 Andre Oppermann <andre@FreeBSD.org>

Enable the following TCP options by default to give it more exposure:

rfc3042 Limited retransmit
rfc3390 Increasing TCP's initial congestion Window
inflight TCP inflight bandwidth limiting

All my production server have it enabled and there have been no
issues. I am confident about having them on by default and it gives
us better overall TCP performance.

Reviewed by: sam (mentor)


# a5b061f9 17-Dec-2003 John Baldwin <jhb@FreeBSD.org>

Fix some becuase -> because typos.

Reported by: Marco Wertejuk <wertejuk@mwcis.com>


# 2d92ec98 17-Dec-2003 Robert Watson <rwatson@FreeBSD.org>

Switch TCP over to using the inpcb label when responding in timed
wait, rather than the socket label. This avoids reaching up to
the socket layer during connection close, which requires locking
changes. To do this, introduce MAC Framework entry point
mac_create_mbuf_from_inpcb(), which is called from tcp_twrespond()
instead of calling mac_create_mbuf_from_socket() or
mac_create_mbuf_netlayer(). Introduce MAC Policy entry point
mpo_create_mbuf_from_inpcb(), and implementations for various
policies, which generally just copy label data from the inpcb to
the mbuf. Assert the inpcb lock in the entry point since we
require consistency for the inpcb label reference.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 0cfbbe3b 26-Nov-2003 Andre Oppermann <andre@FreeBSD.org>

Make sure all uses of stack allocated struct route's are properly
zeroed. Doing a bzero on the entire struct route is not more
expensive than assigning NULL to ro.ro_rt and bzero of ro.ro_dst.

Reviewed by: sam (mentor)
Approved by: re (scottl)


# 97d8d152 20-Nov-2003 Andre Oppermann <andre@FreeBSD.org>

Introduce tcp_hostcache and remove the tcp specific metrics from
the routing table. Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.

It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination. Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.

tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.

It removes significant locking requirements from the tcp stack with
regard to the routing table.

Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)


# c29afad6 08-Nov-2003 Sam Leffler <sam@FreeBSD.org>

o correct locking problem: the inpcb must be held across tcp_respond
o add assertions in tcp_respond to validate inpcb locking assumptions
o use local variable instead of chasing pointers in tcp_respond

Supported by: FreeBSD Foundation


# 4bd4fa3f 02-Nov-2003 Mike Silbersack <silby@FreeBSD.org>

Add an additional check to the tcp_twrecycleable function; I had
previously only considered the send sequence space. Unfortunately,
some OSes (windows) still use a random positive increments scheme for
their syn-ack ISNs, so I must consider receive sequence space as well.

The value of 250000 bytes / second for Microsoft's ISN rate of increase
was determined by testing with an XP machine.


# 96af9ea5 01-Nov-2003 Mike Silbersack <silby@FreeBSD.org>

- Add a new function tcp_twrecycleable, which tells us if the ISN which
we will generate for a given ip/port tuple has advanced far enough
for the time_wait socket in question to be safely recycled.

- Have in_pcblookup_local use tcp_twrecycleable to determine if
time_Wait sockets which are hogging local ports can be safely
freed.

This change preserves proper TIME_WAIT behavior under normal
circumstances while allowing for safe and fast recycling whenever
ephemeral port space is scarce.


# 0709c233 23-Oct-2003 Mike Silbersack <silby@FreeBSD.org>

Reduce the number of tcp time_wait structs to maxsockets / 5; this ensures
that at most 20% of sockets can be in time_wait at one time, ensuring
that time_wait sockets do not starve real connections from inpcb
structures.

No implementation change is needed, jlemon already implemented a nice
LRU-ish algorithm for tcp_tw structure recycling.

This should reduce the need for sysadmins to lower the default msl on
busy servers.


# 184dcdc7 21-Oct-2003 Mike Silbersack <silby@FreeBSD.org>

Change all SYSCTLS which are readonly and have a related TUNABLE
from CTLFLAG_RD to CTLFLAG_RDTUN so that sysctl(8) can provide
more useful error messages.


# 78f94aa9 11-Sep-2003 Ruslan Ermilov <ru@FreeBSD.org>

Fix a bunch of off-by-one errors in the range checking code.


# baee0c3e 21-Aug-2003 Robert Watson <rwatson@FreeBSD.org>

Introduce two new MAC Framework and MAC policy entry points:

mac_reflect_mbuf_icmp()
mac_reflect_mbuf_tcp()

These entry points permit MAC policies to do "update in place"
changes to the labels on ICMP and TCP mbuf headers when an ICMP or
TCP response is generated to a packet outside of the context of
an existing socket. For example, in respond to a ping or a RST
packet to a SYN on a closed port.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 430c6354 06-May-2003 Robert Watson <rwatson@FreeBSD.org>

Correct a bug introduced with reduced TCP state handling; make
sure that the MAC label on TCP responses during TIMEWAIT is
properly set from either the socket (if available), or the mbuf
that it's responding to.

Unfortunately, this is made somewhat difficult by the TCP code,
as tcp_twstart() calls tcp_twrespond() after discarding the socket
but without a reference to the mbuf that causes the "response".
Passing both the socket and the mbuf works arounds this--eventually
it might be good to make sure the mbuf always gets passed in in
"response" scenarios but working through this provided to
complicate things too much.

Approved by: re (scottl)
Reviewed by: hsu
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# cacd79e2 10-Apr-2003 Robert Watson <rwatson@FreeBSD.org>

Remove a potential panic condition introduced by reduced TCP wait
state. Those changed attempted to work around the changed invariant
that inp->in_socket was sometimes now NULL, but the logic wasn't
quite right, meaning that inp->in_socket would be dereferenced by
cr_canseesocket() if security.bsd.see_other_uids, jail, or MAC
were in use. Attempt to clarify and correct the logic.

Note: the work-around originally introduced with the reduced TCP
wait state handling to use cr_cansee() instead of cr_canseesocket()
in this case isn't really right, although it "Does the right thing"
for most of the cases in the base system. We'll need to address
this at some point in the future.

Pointed out by: dcs
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 607b0b0c 08-Mar-2003 Jonathan Lemon <jlemon@FreeBSD.org>

Remove a panic(); if the zone allocator can't provide more timewait
structures, reuse the oldest one. Also move the expiry timer from
a per-structure callout to the tcp slow timer.

Sponsored by: DARPA, NAI Labs


# 521f364b 02-Mar-2003 Dag-Erling Smørgrav <des@FreeBSD.org>

More low-hanging fruit: kill caddr_t in calls to wakeup(9) / [mt]sleep(9).


# 9327ee33 25-Feb-2003 Robert Watson <rwatson@FreeBSD.org>

When generating a TCP response to a connection, not only test if the
tcpcb is NULL, but also its connected inpcb, since we now allow
elements of a TCP connection to hang around after other state, such
as the socket, has been recycled.

Tested by: dcs
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# d25ecb91 21-Feb-2003 Poul-Henning Kamp <phk@FreeBSD.org>

- m = m_gethdr(M_NOWAIT, MT_HEADER);
+ m = m_gethdr(M_DONTWAIT, MT_HEADER);

'nuff said.


# ffae8c5a 19-Feb-2003 Jonathan Lemon <jlemon@FreeBSD.org>

Unbreak non-IPV6 compilation.

Caught by: phk
Sponsored by: DARPA, NAI Labs


# 340c35de 19-Feb-2003 Jonathan Lemon <jlemon@FreeBSD.org>

Add a TCP TIMEWAIT state which uses less space than a fullblown TCP
control block. Allow the socket and tcpcb structures to be freed
earlier than inpcb. Update code to understand an inp w/o a socket.

Reviewed by: hsu, silby, jayanth
Sponsored by: DARPA, NAI Labs


# 79909384 19-Feb-2003 Jonathan Lemon <jlemon@FreeBSD.org>

Convert tcp_fillheaders(tp, ...) -> tcpip_fillheaders(inp, ...) so the
routine does not require a tcpcb to operate. Since we no longer keep
template mbufs around, move pseudo checksum out of this routine, and
merge it with the length update.

Sponsored by: DARPA, NAI Labs


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 4b40c56c 14-Feb-2003 Jeffrey Hsu <hsu@FreeBSD.org>

Take advantage of pre-existing lock-free synchronization and type stable memory
to avoid acquiring SMP locks during expensive copyout process.


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# abe239cf 24-Dec-2002 Jeffrey Hsu <hsu@FreeBSD.org>

Validate inp to prevent an use after free.


# d7ff8ef6 14-Dec-2002 Matthew Dillon <dillon@FreeBSD.org>

Change tcp.inflight_min from 1024 to a production default of 6144. Create
a sysctl for the stabilization value for the bandwidth delay product (inflight)
algorithm and document it.

MFC after: 3 days


# 53be11f6 20-Oct-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Fix two instances of variant struct definitions in sys/netinet:

Remove the never completed _IP_VHL version, it has not caught on
anywhere and it would make us incompatible with other BSD netstacks
to retain this version.

Add a CTASSERT protecting sizeof(struct ip) == 20.

Don't let the size of struct ipq depend on the IPDIVERT option.

This is a functional no-op commit.

Approved by: re


# b9234faf 15-Oct-2002 Sam Leffler <sam@FreeBSD.org>

Tie new "Fast IPsec" code into the build. This involves the usual
configuration stuff as well as conditional code in the IPv4 and IPv6
areas. Everything is conditional on FAST_IPSEC which is mutually
exclusive with IPSEC (KAME IPsec implmentation).

As noted previously, don't use FAST_IPSEC with INET6 at the moment.

Reviewed by: KAME, rwatson
Approved by: silence
Supported by: Vernier Networks


# 5d846453 15-Oct-2002 Sam Leffler <sam@FreeBSD.org>

Replace aux mbufs with packet tags:

o instead of a list of mbufs use a list of m_tag structures a la openbsd
o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit
ABI/module number cookie
o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and
use this in defining openbsd-compatible m_tag_find and m_tag_get routines
o rewrite KAME use of aux mbufs in terms of packet tags
o eliminate the most heavily used aux mbufs by adding an additional struct
inpcb parameter to ip_output and ip6_output to allow the IPsec code to
locate the security policy to apply to outbound packets
o bump __FreeBSD_version so code can be conditionalized
o fixup ipfilter's call to ip_output based on __FreeBSD_version

Reviewed by: julian, luigi (silent), -arch, -net, darren
Approved by: julian, silence from everyone else
Obtained from: openbsd (mostly)
MFC after: 1 month


# c8d50f24 10-Oct-2002 Matthew Dillon <dillon@FreeBSD.org>

turn off debugging by default if bandwidth delay product limiting is
turned on (it is already off in -stable).


# 4f1e1f32 24-Aug-2002 Matthew Dillon <dillon@FreeBSD.org>

Correct bug in t_bw_rtttime rollover, #undef USERTT


# 1fcc99b5 17-Aug-2002 Matthew Dillon <dillon@FreeBSD.org>

Implement TCP bandwidth delay product window limiting, similar to (but
not meant to duplicate) TCP/Vegas. Add four sysctls and default the
implementation to 'off'.

net.inet.tcp.inflight_enable enable algorithm (defaults to 0=off)
net.inet.tcp.inflight_debug debugging (defaults to 1=on)
net.inet.tcp.inflight_min minimum window limit
net.inet.tcp.inflight_max maximum window limit

MFC after: 1 week


# d00e44fb 31-Jul-2002 Robert Watson <rwatson@FreeBSD.org>

Document the undocumented assumption that at least one of the PCB
pointer and incoming mbuf pointer will be non-NULL in tcp_respond().
This is relied on by the MAC code for correctness, as well as
existing code.

Obtained from: TrustedBSD PRoject
Sponsored by: DARPA, NAI Labs


# c488362e 31-Jul-2002 Robert Watson <rwatson@FreeBSD.org>

Introduce support for Mandatory Access Control and extensible
kernel access control.

Instrument the TCP socket code for packet generation and delivery:
label outgoing mbufs with the label of the socket, and check socket and
mbuf labels before permitting delivery to a socket. Assign labels
to newly accepted connections when the syncache/cookie code has done
its business. Also set peer labels as convenient. Currently,
MAC policies cannot influence the PCB matching algorithm, so cannot
implement polyinstantiation. Note that there is at least one case
where a PCB is not available due to the TCP packet not being associated
with any socket, so we don't label in that case, but need to handle
it in a special manner.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 5c38b6db 28-Jul-2002 Don Lewis <truckman@FreeBSD.org>

Wire the sysctl output buffer before grabbing any locks to prevent
SYSCTL_OUT() from blocking while locks are held. This should
only be done when it would be inconvenient to make a temporary copy of
the data and defer calling SYSCTL_OUT() until after the locks are
released.


# 701bec5a 18-Jul-2002 Matthew Dillon <dillon@FreeBSD.org>

Introduce two new sysctl's:

net.inet.tcp.rexmit_min (default 3 ticks equiv)

This sysctl is the retransmit timer RTO minimum,
specified in milliseconds. This value is
designed for algorithmic stability only.

net.inet.tcp.rexmit_slop (default 200ms)

This sysctl is the retransmit timer RTO slop
which is added to every retransmit timeout and
is designed to handle protocol stack overheads
and delayed ack issues.

Note that the *original* code applied a 1-second
RTO minimum but never applied real slop to the RTO
calculation, so any RTO calculation over one second
would have no slop and thus not account for
protocol stack overheads (TCP timestamps are not
a measure of protocol turnaround!). Essentially,
the original code made the RTO calculation almost
completely irrelevant.

Please note that the 200ms slop is debateable.
This commit is not meant to be a line in the sand,
and if the community winds up deciding that increasing
it is the correct solution then it's easy to do.
Note that larger values will destroy performance
on lossy networks while smaller values may result in
a greater number of unnecessary retransmits.


# 0e1eebb8 11-Jul-2002 Don Lewis <truckman@FreeBSD.org>

Defer calling SYSCTL_OUT() until after the locks have been released.


# 142b2bd6 11-Jul-2002 Don Lewis <truckman@FreeBSD.org>

Reduce the nesting level of a code block that doesn't need to be in
an else clause.


# eb538bfd 30-Jun-2002 Jesper Skriver <jesper@FreeBSD.org>

Extend the effect of the sysctl net.inet.tcp.icmp_may_rst
so that, if we recieve a ICMP "time to live exceeded in transit",
(type 11, code 0) for a TCP connection on SYN-SENT state, close
the connection.

MFC after: 2 weeks


# 2d40081d 21-Jun-2002 Jeffrey Hsu <hsu@FreeBSD.org>

TCP notify functions can change the pcb list.


# 3ce144ea 14-Jun-2002 Jeffrey Hsu <hsu@FreeBSD.org>

Notify functions can destroy the pcb, so they have to return an
indication of whether this happenned so the calling function
knows whether or not to unlock the pcb.

Submitted by: Jennifer Yang (yangjihui@yahoo.com)
Bug reported by: Sid Carter (sidcarter@symonds.net)


# 73dca207 11-Jun-2002 Jeffrey Hsu <hsu@FreeBSD.org>

Fix logic which resulted in missing a call to INP_UNLOCK().


# f76fcf6d 10-Jun-2002 Jeffrey Hsu <hsu@FreeBSD.org>

Lock up inpcb.

Submitted by: Jennifer Yang <yangjihui@yahoo.com>


# 4cc20ab1 31-May-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Back out my lats commit of locking down a socket, it conflicts with hsu's work.

Requested by: hsu


# 243917fe 19-May-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Lock down a socket, milestone 1.

o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

- so_count
- so_options
- so_linger
- so_state

o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:

- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()

Reviewed by: alfred


# 898568d8 10-Apr-2002 Mike Silbersack <silby@FreeBSD.org>

Remove some ISN generation code which has been unused since the
syncache went in.

MFC after: 3 days


# 44731cab 01-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on: smp@


# 29dc1288 22-Mar-2002 Robert Watson <rwatson@FreeBSD.org>

Merge from TrustedBSD MAC branch:

Move the network code from using cr_cansee() to check whether a
socket is visible to a requesting credential to using a new
function, cr_canseesocket(), which accepts a subject credential
and object socket. Implement cr_canseesocket() so that it does a
prison check, a uid check, and add a comment where shortly a MAC
hook will go. This will allow MAC policies to seperately
instrument the visibility of sockets from the visibility of
processes.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 69c2d429 19-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

Switch vm_zone.h with uma.h. Change over to uma interfaces.


# 4d77a549 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# a854ed98 27-Feb-2002 John Baldwin <jhb@FreeBSD.org>

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


# 60d9b32e 26-Feb-2002 Alfred Perlstein <alfred@FreeBSD.org>

More IPV6 const fixes.


# 76183f34 26-Feb-2002 Dima Dorfman <dd@FreeBSD.org>

Introduce a version field to `struct xucred' in place of one of the
spares (the size of the field was changed from u_short to u_int to
reflect what it really ends up being). Accordingly, change users of
xucred to set and check this field as appropriate. In the kernel,
this is being done inside the new cru2x() routine which takes a
`struct ucred' and fills out a `struct xucred' according to the
former. This also has the pleasant sideaffect of removing some
duplicate code.

Reviewed by: rwatson


# c5993889 04-Feb-2002 Hajimu UMEMOTO <ume@FreeBSD.org>

In tcp_respond(), correctly reset returned IPv6 header. This is essential
when the original packet contains an IPv6 extension header.

Obtained from: KAME
MFC after: 1 week


# be2ac88c 21-Nov-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Introduce a syncache, which enables FreeBSD to withstand a SYN flood
DoS in an improved fashion over the existing code.

Reviewed by: silby (in a previous iteration)
Sponsored by: DARPA, NAI Labs


# ce178806 07-Nov-2001 Robert Watson <rwatson@FreeBSD.org>

o Replace reference to 'struct proc' with 'struct thread' in 'struct
sysctl_req', which describes in-progress sysctl requests. This permits
sysctl handlers to have access to the current thread, permitting work
on implementing td->td_ucred, migration of suser() to using struct
thread to derive the appropriate ucred, and allowing struct thread to be
passed down to other code, such as network code where td is not currently
available (and curproc is used).

o Note: netncp and netsmb are not updated to reflect this change, as they
are not currently KSE-adapted.

Reviewed by: julian
Obtained from: TrustedBSD Project


# 8a7d8cc6 09-Oct-2001 Robert Watson <rwatson@FreeBSD.org>

- Combine kern.ps_showallprocs and kern.ipc.showallsockets into
a single kern.security.seeotheruids_permitted, describes as:
"Unprivileged processes may see subjects/objects with different real uid"
NOTE: kern.ps_showallprocs exists in -STABLE, and therefore there is
an API change. kern.ipc.showallsockets does not.
- Check kern.security.seeotheruids_permitted in cr_cansee().
- Replace visibility calls to socheckuid() with cr_cansee() (retain
the change to socheckuid() in ipfw, where it is used for rule-matching).
- Remove prison_unpcb() and make use of cr_cansee() against the UNIX
domain socket credential instead of comparing root vnodes for the
UDS and the process. This allows multiple jails to share the same
chroot() and not see each others UNIX domain sockets.
- Remove unused socheckproc().

Now that cr_cansee() is used universally for socket visibility, a variety
of policies are more consistently enforced, including uid-based
restrictions and jail-based restrictions. This also better-supports
the introduction of additional MAC models.

Reviewed by: ps, billf
Obtained from: TrustedBSD Project


# 4787fd37 05-Oct-2001 Paul Saab <ps@FreeBSD.org>

Only allow users to see their own socket connections if
kern.ipc.showallsockets is set to 0.

Submitted by: billf (with modifications by me)
Inspired by: Dave McKay (aka pm aka Packet Magnet)
Reviewed by: peter
MFC after: 2 weeks


# 94088977 20-Sep-2001 Robert Watson <rwatson@FreeBSD.org>

o Rename u_cansee() to cr_cansee(), making the name more comprehensible
in the face of a rename of ucred to cred, and possibly generally.

Obtained from: TrustedBSD Project


# b0e3ad75 21-Aug-2001 Mike Silbersack <silby@FreeBSD.org>

Much delayed but now present: RFC 1948 style sequence numbers

In order to ensure security and functionality, RFC 1948 style
initial sequence number generation has been implemented. Barring
any major crypographic breakthroughs, this algorithm should be
unbreakable. In addition, the problems with TIME_WAIT recycling
which affect our currently used algorithm are not present.

Reviewed by: jesper


# 57e119f6 26-Jul-2001 Peter Wemm <peter@FreeBSD.org>

Fix a warning.


# 01651724 26-Jul-2001 Peter Wemm <peter@FreeBSD.org>

Patch up some style(9) stuff in tcp_new_isn()


# 92971bd3 26-Jul-2001 Peter Wemm <peter@FreeBSD.org>

s/OpemBSD/OpenBSD/


# 2d610a50 07-Jul-2001 Mike Silbersack <silby@FreeBSD.org>

Temporary feature: Runtime tuneable tcp initial sequence number
generation scheme. Users may now select between the currently used
OpenBSD algorithm and the older random positive increment method.

While the OpenBSD algorithm is more secure, it also breaks TIME_WAIT
handling; this is causing trouble for an increasing number of folks.

To switch between generation schemes, one sets the sysctl
net.inet.tcp.tcp_seq_genscheme. 0 = random positive increments,
1 = the OpenBSD algorithm. 1 is still the default.

Once a secure _and_ compatible algorithm is implemented, this sysctl
will be removed.

Reviewed by: jlemon
Tested by: numerous subscribers of -net


# 7ce87f12 23-Jun-2001 David Malone <dwmalone@FreeBSD.org>

Allow getcred sysctl to work in jailed root processes. Processes can
only do getcred calls for sockets which were created in the same jail.
This should allow the ident to work in a reasonable way within jails.

PR: 28107
Approved by: des, rwatson


# f962cba5 23-Jun-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Replace bzero() of struct ip with explicit zeroing of structure members,
which is faster.


# 08517d53 22-Jun-2001 Mike Silbersack <silby@FreeBSD.org>

Eliminate the allocation of a tcp template structure for each
connection. The information contained in a tcptemp can be
reconstructed from a tcpcb when needed.

Previously, tcp templates required the allocation of one
mbuf per connection. On large systems, this change should
free up a large number of mbufs.

Reviewed by: bmilekic, jlemon, ru
MFC after: 2 weeks


# ff242829 19-Jun-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

made sure to use the correct sa_len for rtalloc().
sizeof(ro_dst) is not necessarily the correct one.
this change would also fix the recent path MTU discovery problem for the
destination of an incoming TCP connection.

Submitted by: JINMEI Tatuya <jinmei@kame.net>
Obtained from: KAME
MFC after: 2 weeks


# 33841545 10-Jun-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

Sync with recent KAME.
This work was based on kame-20010528-freebsd43-snap.tgz and some
critical problem after the snap was out were fixed.
There are many many changes since last KAME merge.

TODO:
- The definitions of SADB_* in sys/net/pfkeyv2.h are still different
from RFC2407/IANA assignment because of binary compatibility
issue. It should be fixed under 5-CURRENT.
- ip6po_m member of struct ip6_pktopts is no longer used. But, it
is still there because of binary compatibility issue. It should
be removed under 5-CURRENT.

Reviewed by: itojun
Obtained from: KAME
MFC after: 3 weeks


# 09786698 07-Jun-2001 Peter Wemm <peter@FreeBSD.org>

"Fix" the previous initial attempt at fixing TUNABLE_INT(). This time
around, use a common function for looking up and extracting the tunables
from the kernel environment. This saves duplicating the same function
over and over again. This way typically has an overhead of 8 bytes + the
path string, versus about 26 bytes + the path string.


# 4422746f 06-Jun-2001 Peter Wemm <peter@FreeBSD.org>

Back out part of my previous commit. This was a last minute change
and I botched testing. This is a perfect example of how NOT to do
this sort of thing. :-(


# 81930014 06-Jun-2001 Peter Wemm <peter@FreeBSD.org>

Make the TUNABLE_*() macros look and behave more consistantly like the
SYSCTL_*() macros. TUNABLE_INT_DECL() was an odd name because it didn't
actually declare the int, which is what the name suggests it would do.


# d1745f45 20-Apr-2001 Jesper Skriver <jesper@FreeBSD.org>

Say goodbye to TCP_COMPAT_42

Reviewed by: wollman
Requested by: wollman


# f0a04f3f 17-Apr-2001 Kris Kennaway <kris@FreeBSD.org>

Randomize the TCP initial sequence numbers more thoroughly.

Obtained from: OpenBSD
Reviewed by: jesper, peter, -developers


# b77d155d 28-Mar-2001 Jesper Skriver <jesper@FreeBSD.org>

MFC candidate.

Change code from PRC_UNREACH_ADMIN_PROHIB to PRC_UNREACH_PORT for
ICMP_UNREACH_PROTOCOL and ICMP_UNREACH_PORT

And let TCP treat PRC_UNREACH_PORT like PRC_UNREACH_ADMIN_PROHIB

This should fix the case where port unreachables for udp returned
ENETRESET instead of ECONNREFUSED

Problem found by: Bill Fenner <fenner@research.att.com>
Reviewed by: jlemon


# 462b86fe 16-Mar-2001 Poul-Henning Kamp <phk@FreeBSD.org>

<sys/queue.h> makeover.


# c693a045 26-Feb-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Remove in_pcbnotify and use in_pcblookup_hash to find the cb directly.

For TCP, verify that the sequence number in the ICMP packet falls within
the tcp receive window before performing any actions indicated by the
icmp packet.

Clean up some layering violations (access to tcp internals from in_pcb)


# c484d1a3 23-Feb-2001 Jonathan Lemon <jlemon@FreeBSD.org>

When converting soft error into a hard error, drop the connection. The
error will be passed up to the user, who will close the connection, so
it does not appear to make a sense to leave the connection open.

This also fixes a bug with kqueue, where the filter does not set EOF
on the connection, because the connection is still open.

Also remove calls to so{rw}wakeup, as we aren't doing anything with
them at the moment anyway.

Reviewed by: alfred, jesper


# e4bb5b05 23-Feb-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Allow ICMP unreachables which map into PRC_UNREACH_ADMIN_PROHIB to
reset TCP connections which are in the SYN_SENT state, if the sequence
number in the echoed ICMP reply is correct. This behavior can be
controlled by the sysctl net.inet.tcp.icmp_may_rst.

Currently, only subtypes 2,3,10,11,12 are treated as such
(port, protocol and administrative unreachables).

Assocaiate an error code with these resets which is reported to the
user application: ENETRESET.

Disallow resetting TCP sessions which are not in a SYN_SENT state.

Reviewed by: jesper, -net


# d1c54148 22-Feb-2001 Jesper Skriver <jesper@FreeBSD.org>

Redo the security update done in rev 1.54 of src/sys/netinet/tcp_subr.c
and 1.84 of src/sys/netinet/udp_usrreq.c

The changes broken down:

- remove 0 as a wildcard for addresses and port numbers in
src/sys/netinet/in_pcb.c:in_pcbnotify()
- add src/sys/netinet/in_pcb.c:in_pcbnotifyall() used to notify
all sessions with the specific remote address.
- change
- src/sys/netinet/udp_usrreq.c:udp_ctlinput()
- src/sys/netinet/tcp_subr.c:tcp_ctlinput()
to use in_pcbnotifyall() to notify multiple sessions, instead of
using in_pcbnotify() with 0 as src address and as port numbers.
- remove check for src port == 0 in
- src/sys/netinet/tcp_subr.c:tcp_ctlinput()
- src/sys/netinet/udp_usrreq.c:udp_ctlinput()
as they are no longer needed.
- move handling of redirects and host dead from in_pcbnotify() to
udp_ctlinput() and tcp_ctlinput(), so they will call
in_pcbnotifyall() to notify all sessions with the specific
remote address.

Approved by: jlemon
Inspired by: NetBSD


# 58e9b417 20-Feb-2001 Jesper Skriver <jesper@FreeBSD.org>

Only call in_pcbnotify if the src port number != 0, as we
treat 0 as a wildcard in src/sys/in_pbc.c:in_pcbnotify()

It's sufficient to check for src|local port, as we'll have no
sessions with src|local port == 0

Without this a attacker sending ICMP messages, where the attached
IP header (+ 8 bytes) has the address and port numbers == 0, would
have the ICMP message applied to all sessions.

PR: kern/25195
Submitted by: originally by jesper, reimplimented by jlemon's advice
Reviewed by: jlemon
Approved by: jlemon


# c0511d3b 18-Feb-2001 Brian Feldman <green@FreeBSD.org>

Switch to using a struct xucred instead of a struct xucred when not
actually in the kernel. This structure is a different size than
what is currently in -CURRENT, but should hopefully be the last time
any application breakage is caused there. As soon as any major
inconveniences are removed, the definition of the in-kernel struct
ucred should be conditionalized upon defined(_KERNEL).

This also changes struct export_args to remove dependency on the
constantly-changing struct ucred, as well as limiting the bounds
of the size fields to the correct size. This means: a) mountd and
friends won't break all the time, b) mountd and friends won't crash
the kernel all the time if they don't know what they're doing wrt
actual struct export_args layout.

Reviewed by: bde


# 90fcbbd6 18-Feb-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unneeded loop increment in src/sys/netinet/in_pcb.c:in_pcbnotify

Add new PRC_UNREACH_ADMIN_PROHIB in sys/sys/protosw.h

Remove condition on TCP in src/sys/netinet/ip_icmp.c:icmp_input

In src/sys/netinet/ip_icmp.c:icmp_input set code = PRC_UNREACH_ADMIN_PROHIB
or PRC_UNREACH_HOST for all unreachables except ICMP_UNREACH_NEEDFRAG

Rename sysctl icmp_admin_prohib_like_rst to icmp_unreach_like_rst
to reflect the fact that we also react on ICMP unreachables that
are not administrative prohibited. Also update the comments to
reflect this.

In sys/netinet/tcp_subr.c:tcp_ctlinput add code to treat
PRC_UNREACH_ADMIN_PROHIB and PRC_UNREACH_HOST different.

PR: 23986
Submitted by: Jesper Skriver <jesper@skriver.dk>


# fc2ffbe6 04-Feb-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)


# 442fad67 24-Dec-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Update the "icmp_admin_prohib_like_rst" code to check the tcp-window and
to be configurable with respect to acting only in SYN or in all TCP states.

PR: 23665
Submitted by: Jesper Skriver <jesper@skriver.dk>


# b11d7a4a 16-Dec-2000 Poul-Henning Kamp <phk@FreeBSD.org>

We currently does not react to ICMP administratively prohibited
messages send by routers when they deny our traffic, this causes
a timeout when trying to connect to TCP ports/services on a remote
host, which is blocked by routers or firewalls.

rfc1122 (Requirements for Internet Hosts) section 3.2.2.1 actually
requi re that we treat such a message for a TCP session, that we
treat it like if we had recieved a RST.

quote begin.

A Destination Unreachable message that is received MUST be
reported to the transport layer. The transport layer SHOULD
use the information appropriately; for example, see Sections
4.1.3.3, 4.2.3.9, and 4.2.4 below. A transport protocol
that has its own mechanism for notifying the sender that a
port is unreachable (e.g., TCP, which sends RST segments)
MUST nevertheless accept an ICMP Port Unreachable for the
same purpose.

quote end.

I've written a small extension that implement this, it also create
a sysctl "net.inet.tcp.icmp_admin_prohib_like_rst" to control if
this new behaviour is activated.

When it's activated (set to 1) we'll treat a ICMP administratively
prohibited message (icmp type 3 code 9, 10 and 13) for a TCP
sessions, as if we recived a TCP RST, but only if the TCP session
is in SYN_SENT state.

The reason for only reacting when in SYN_SENT state, is that this
will solve the problem, and at the same time minimize the risk of
this being abused.

I suggest that we enable this new behaviour by default, but it
would be a change of current behaviour, so if people prefer to
leave it disabled by default, at least for now, this would be ok
for me, the attached diff actually have the sysctl set to 0 by
default.

PR: 23086
Submitted by: Jesper Skriver <jesper@skriver.dk>


# e82ac18e 24-Nov-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Revert the last commit to the callout interface, and add a flag to
callout_init() indicating whether the callout is safe or not. Update
the callers of callout_init() to reflect the new interface.

Okayed by: Jake


# 46aa3347 27-Oct-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Convert all users of fldoff() to offsetof(). fldoff() is bad
because it only takes a struct tag which makes it impossible to
use unions, typedefs etc.

Define __offsetof() in <machine/ansi.h>

Define offsetof() in terms of __offsetof() in <stddef.h> and <sys/types.h>

Remove myriad of local offsetof() definitions.

Remove includes of <stddef.h> in kernel code.

NB: Kernelcode should *never* include from /usr/include !

Make <sys/queue.h> include <machine/ansi.h> to avoid polluting the API.

Deprecate <struct.h> with a warning. The warning turns into an error on
01-12-2000 and the file gets removed entirely on 01-01-2001.

Paritials reviews by: various.
Significant brucifications by: bde


# d31944e6 23-Oct-2000 Jun-ichiro itojun Hagino <itojun@FreeBSD.org>

be careful on mbuf overrun on ctlinput.
short icmp6 packet may be able to panic the kernel.
sync with kame.


# be515d91 28-Sep-2000 Kris Kennaway <kris@FreeBSD.org>

Use stronger random number generation for TCP_ISSINCR and tcp_iss.

Reviewed by: peter, jlemon


# 9d8c8a67 25-Sep-2000 Bosko Milekic <bmilekic@FreeBSD.org>

Finally make do_tcpdrain sysctl live under correct parent, _net_inet_tcp,
as opposed to _debug. Like before, default value remains 1.


# e7f32693 21-Jul-2000 Jayanth Vijayaraghavan <jayanth@FreeBSD.org>

When a connection is being dropped due to a listen queue overflow,
delete the cloned route that is associated with the connection.
This does not exhaust the routing table memory when the system
is under a SYN flood attack. The route entry is not deleted if there
is any prior information cached in it.

Reviewed by: Peter Wemm,asmodai


# 686cdd19 04-Jul-2000 Jun-ichiro itojun Hagino <itojun@FreeBSD.org>

sync with kame tree as of july00. tons of bug fixes/improvements.

API changes:
- additional IPv6 ioctls
- IPsec PF_KEY API was changed, it is mandatory to upgrade setkey(8).
(also syntax change)


# 77978ab8 04-Jul-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Previous commit changing SYSCTL_HANDLER_ARGS violated KNF.

Pointed out by: bde


# 82d9ae4e 03-Jul-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Style police catches up with rev 1.26 of src/sys/sys/sysctl.h:

Sanitize SYSCTL_HANDLER_ARGS so that simplistic tools can grog our
sources:

-sysctl_vm_zone SYSCTL_HANDLER_ARGS
+sysctl_vm_zone (SYSCTL_HANDLER_ARGS)


# 34e3e010 19-Apr-2000 Yoshinobu Inoue <shin@FreeBSD.org>

Let initialize th_sum before in6_cksum(), again.
Without this fix, all IPv6 TCP RST packet has wrong cksum value,
so IPv6 connect() trial to 5.0 machine won't fail until tcp connect timeout,
when they should fail soon.

Thanks to haro@tk.kubota.co.jp (Munehiro Matsuda) for his much debugging
help and detailed info.


# db4f9cc7 27-Mar-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Add support for offloading IP/TCP/UDP checksums to NIC hardware which
supports them.


# f885f636 28-Feb-2000 Paul Saab <ps@FreeBSD.org>

Limit the maximum permissible TCP window size to 65535 octets if
window scaling is disabled.

PR: kern/16914
Submitted by: Jayanth Vijayaraghavan <jayanth@yahoo-inc.com>
Reviewed by: wollman
Approved by: jkh


# e2de10ab 24-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

Fix the bug that IPv4 ttl is not initialized when AF_INET6 socket is used
for IPv4 communication.(IPv4 mapped IPv6 addr.)
Also removed IPv6 hoplimit initialization because it is alway done at
tcp_output.

Confirmed by: Bernd Walter <ticso@cicely5.cicely.de>


# 3a2a9f79 15-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

Fixed the problem that IPsec connection hangs when bigger data is sent.
-opt_ipsec.h was missing on some tcp files (sorry for basic mistake)
-made buildable as above fix
-also added some missing IPv4 mapped IPv6 addr consideration into
ipsec4_getpolicybysock


# 0567ae02 15-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

Added missing 'else' for 'if (isipv6)' at IPv6 length setting in tcp_respond().
By this bug, IPv6 reset was not sent.
(I checked around same kind of bug, but no other found.)


# 595ad281 14-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

Removed wrong(unnecessary) & operators for pointer, in ipsec_hdrsiz_tcp().
This must be one of the reason why connections over IPsec hangs for
bigger packets.(which was reported on freebsd-current@freebsd.org)

But there still seems to be another bug and the problem is not yet fixed.


# 21ab895f 13-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

Clear rt after RTFREE. This might have sometime caused kernel panic at rtfree()
on INET6 enabled environment.


# 72e17428 12-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

removed incorrect ip6 length setting for IPv6 tcp reset packet.


# fb59c426 09-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

tcp updates to support IPv6.
also a small patch to sys/nfs/nfs_socket.c, as max_hdr size change.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 052a6aad 28-Dec-1999 Mike Smith <msmith@FreeBSD.org>

Make tcp_drain() actually do something. When invoked (usually as a
desperation measure in low-memory situations), walk the tcpbs and
flush the reassembly queues.

This behaviour is currently controlled by the debug.do_tcpdrain sysctl
(defaults to on).

Submitted by: Bosko Milekic <bmilekic@dsuper.net>
Reviewed by: wollman


# cfa1ca9d 07-Dec-1999 Yoshinobu Inoue <shin@FreeBSD.org>

udp IPv6 support, IPv6/IPv4 tunneling support in kernel,
packet divert at kernel for IPv6/IPv4 translater daemon

This includes queue related patch submitted by jburkhol@home.com.

Submitted by: queue related patch from jburkhol@home.com
Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 82cd038d 21-Nov-1999 Yoshinobu Inoue <shin@FreeBSD.org>

KAME netinet6 basic part(no IPsec,no V6 Multicast Forwarding, no UDP/TCP
for IPv6 yet)

With this patch, you can assigne IPv6 addr automatically, and can reply to
IPv6 ping.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 76429de4 05-Nov-1999 Yoshinobu Inoue <shin@FreeBSD.org>

KAME related header files additions and merges.
(only those which don't affect c source files so much)

Reviewed by: cvs-committers
Obtained from: KAME project


# 2f9a2132 18-Sep-1999 Brian Feldman <green@FreeBSD.org>

Change so_cred's type to a ucred, not a pcred. THis makes more sense, actually.
Make a sonewconn3() which takes an extra argument (proc) so new sockets created
with sonewconn() from a user's system call get the correct credentials, not
just the parent's credentials.


# 9b8b58e0 30-Aug-1999 Jonathan Lemon <jlemon@FreeBSD.org>

Restructure TCP timeout handling:

- eliminate the fast/slow timeout lists for TCP and instead use a
callout entry for each timer.
- increase the TCP timer granularity to HZ
- implement "bad retransmit" recovery, as presented in
"On Estimating End-to-End Network Path Properties", by Allman and Paxson.

Submitted by: jlemon, wollmann


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# 6da3d657 26-Aug-1999 Jonathan Lemon <jlemon@FreeBSD.org>

Add readonly OID ``net.inet.tcp.tcbhashsize'' so it is possible to
discover the size of the TCB hashtable on a running system.


# 490d50b6 11-Jul-1999 Brian Feldman <green@FreeBSD.org>

Two new sysctls: net.inet.tcp.getcred and net.inet.udp.getcred. These take
a sockaddr_in[2] (local, then remote) and return a struct ucred. Example
code for these is at:
http://www.FreeBSD.org/~green/inetd_ident.patch
http://www.FreeBSD.org/~green/freebsd4.c (for pidentd)

Reviewed by: bde


# 35ec852a 05-Jul-1999 Mike Smith <msmith@FreeBSD.org>

Use the new tunable macros for the net.inet.tcp.tcbhashsize tunable.


# 23fc6cdd 16-Jun-1999 Tor Egge <tegge@FreeBSD.org>

Close a race window where a tcp socket is closed while tcp_pcblist is
copying out tcp socket info, causing a NULL pointer to be dereferenced.


# 3d177f46 03-May-1999 Bill Fumerola <billf@FreeBSD.org>

Add sysctl descriptions to many SYSCTL_XXXs

PR: kern/11197
Submitted by: Adrian Chadd <adrian@FreeBSD.org>
Reviewed by: billf(spelling/style/minor nits)
Looked at by: bde(style)


# 75c13541 28-Apr-1999 Poul-Henning Kamp <phk@FreeBSD.org>

This Implements the mumbled about "Jail" feature.

This is a seriously beefed up chroot kind of thing. The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.

For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact: "real virtual servers".

Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.

Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.

It generally does what one would expect, but setting up a jail
still takes a little knowledge.

A few notes:

I have no scripts for setting up a jail, don't ask me for them.

The IP number should be an alias on one of the interfaces.

mount a /proc in each jail, it will make ps more useable.

/proc/<pid>/status tells the hostname of the prison for
jailed processes.

Quotas are only sensible if you have a mountpoint per prison.

There are no privisions for stopping resource-hogging.

Some "#ifdef INET" and similar may be missing (send patches!)

If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!

Tools, comments, patches & documentation most welcome.

Have fun...

Sponsored by: http://www.rndassociates.com/
Run for almost a year by: http://www.servetheweb.com/


# ea95d691 03-Feb-1999 Mike Smith <msmith@FreeBSD.org>

Nuke all the stupid ffs() stuff and use powerof2() instead.
Submitted by: Bruce Evans <bde@zeta.org.au>


# 609b6256 03-Feb-1999 Mike Smith <msmith@FreeBSD.org>

Fix power-of-2 check for the TCB hash size.

Submitted by: Brian Feldman <green@unixhelp.org>


# aa145849 03-Feb-1999 Mike Smith <msmith@FreeBSD.org>

Make TCBHASHSIZE a boot-time tunable as well, taking its value from the
variable net.inet.tcp.tcbhashsize.

Requested by: David Filo <filo@yahoo-inc.com>


# f1d19042 07-Dec-1998 Archie Cobbs <archie@FreeBSD.org>

The "easy" fixes for compiling the kernel -Wunused: remove unreferenced static
and local variables, goto labels, and functions declared but not defined.


# d285db55 15-Nov-1998 Guido van Rooij <guido@FreeBSD.org>

The below patch helps to reduce the leakage of internal socket information
when a TCP "stealth" scan is directed at a *BSD box by ensuring the window
is 0 for all RST packets generated through tcp_respond()
Reviewed by: Don Lewis <Don.Lewis@tsc.tdk.com>
Obtained from: Bugtraq (from: Darren Reed <avalon@COOMBS.ANU.EDU.AU>)


# 19ddafa3 06-Sep-1998 Poul-Henning Kamp <phk@FreeBSD.org>

RFC 1644 has the status "Experimental Protocol", which means:

4.1.4. Experimental Protocol

A system should not implement an experimental protocol unless it
is participating in the experiment and has coordinated its use of
the protocol with the developer of the protocol.

Pointed out by: Steinar Haug <sthaug@nethelp.no>


# 6effc713 24-Aug-1998 Doug Rabson <dfr@FreeBSD.org>

Re-implement tcp and ip fragment reassembly to not store pointers in the
ip header which can't work on alpha since pointers are too big.

Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>


# 98271db4 15-May-1998 Garrett Wollman <wollman@FreeBSD.org>

Convert socket structures to be type-stable and add a version number.

Define a parameter which indicates the maximum number of sockets in a
system, and use this to size the zone allocators used for sockets and
for certain PCBs.

Convert PF_LOCAL PCB structures to be type-stable and add a version number.

Define an external format for infomation about socket structures and use
it in several places.

Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through
sysctl(3) without blocking network interrupts for an unreasonable
length of time. This probably still has some bugs and/or race
conditions, but it seems to work well enough on my machines.

It is now possible for `netstat' to get almost all of its information
via the sysctl(3) interface rather than reading kmem (changes to follow).


# 8781d8e9 28-Mar-1998 Bruce Evans <bde@FreeBSD.org>

Fixed style bugs (mostly) in previous commit.


# 3d4d47f3 24-Mar-1998 Garrett Wollman <wollman@FreeBSD.org>

Use the zone allocator to allocate inpcbs and tcpcbs. Each protocol creates
its own zone; this is used particularly by TCP which allocates both inpcb and
tcpcb in a single allocation. (Some hackery ensures that the tcpcb is
reasonably aligned.) Also keep track of the number of pcbs of each type
allocated, and keep a generation count (instance version number) for future
use.


# c3229e05 27-Jan-1998 David Greenman <dg@FreeBSD.org>

Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.

Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.

These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.

Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.

WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).


# 92252381 24-Jan-1998 Eivind Eklund <eivind@FreeBSD.org>

Make TCP_COMPAT_42 a new style option.


# 45d6875d 18-Dec-1997 Julian Elischer <julian@FreeBSD.org>

Fix an incredibly horrible bug in the ipfw code
where if you are using the "reset tcp" firewall command,
the kernel would write ethernet headers onto random kernel stack locations.

Fought to the death by: terry, julian, archie.
fix valid for 2.2 series as well.


# 55b211e3 28-Oct-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# 0cc12cc5 16-Sep-1997 Joerg Wunsch <joerg@FreeBSD.org>

Make TCPDEBUG a new-style option.


# 514ede09 16-Sep-1997 Bruce Evans <bde@FreeBSD.org>

Fixed gratuitous ANSIisms.


# ca98b82c 02-Apr-1997 David Greenman <dg@FreeBSD.org>

Reorganize elements of the inpcb struct to take better advantage of
cache lines. Removed the struct ip proto since only a couple of chars
were actually being used in it. Changed the order of compares in the
PCB hash lookup to take advantage of partial cache line fills (on PPro).

Discussed-with: wollman


# ddd79a97 03-Mar-1997 David Greenman <dg@FreeBSD.org>

Improved performance of hash algorithm while (hopefully) not reducing
the quality of the hash distribution. This does not fix a problem dealing
with poor distribution when using lots of IP aliases and listening
on the same port on every one of them...some other day perhaps; fixing
that requires significant code changes.
The use of xor was inspired by David S. Miller <davem@jenolan.rutgers.edu>


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# d0390e05 14-Feb-1997 Garrett Wollman <wollman@FreeBSD.org>

Fix the mechanism for choosing wehether to save the slow-start threshold
in the route. This allows us to remove the unconditional setting of the
pipesize in the route, which should mean that SO_SNDBUF and SO_RCVBUF
should actually work again. While we're at it:

- Convert udp_usrreq from `mondo switch statement from Hell' to new-style.
- Delete old TCP mondo switch statement from Hell, which had previously
been diked out.


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 5e2d0696 24-Jul-1996 Garrett Wollman <wollman@FreeBSD.org>

Eliminate some more references to separate ip_v and ip_hl fields.


# 51fb3922 14-Jun-1996 Garrett Wollman <wollman@FreeBSD.org>

Better selection of initial retransmit timeout when no cached
RTT information is available.

Submitted by: kbracey@art.acorn.co.uk (Kevin Bracey)
(slightly modified by me)


# 6da5712b 05-Jun-1996 Garrett Wollman <wollman@FreeBSD.org>

Correct formula for TCP RTO calculation. Also try to do a better job in
filling in a new PCB's rttvar (but this is not the last word on the subject).
And get rid of `#ifdef RTV_RTT', it's been true for four years now...


# ebcae94e 27-Mar-1996 Garrett Wollman <wollman@FreeBSD.org>

In tcp_respond(), check that ro->ro_rt is non-null before RTFREEing
it.


# 9512fd2e 22-Mar-1996 Garrett Wollman <wollman@FreeBSD.org>

Make sure tcp_respond() always calls ip_output() with a valid
route pointer. This has no effect in the current ip_output(),
but my version requires that ip_output() always be passed a route.


# 2ee45d7d 11-Mar-1996 David Greenman <dg@FreeBSD.org>

Move or add #include <queue.h> in preparation for upcoming struct socket
changes.


# bda4c85a 20-Dec-1995 Garrett Wollman <wollman@FreeBSD.org>

Fix a nagging divide-by-zero error resulting from the MTU discovery code
getting triggered at a bad time.


# b62d102c 15-Dec-1995 Bruce Evans <bde@FreeBSD.org>

Uniformized pr_ctlinput protosw functions. The third arg is now `void
*' instead of caddr_t and it isn't optional (it never was). Most of the
netipx (and netns) pr_ctlinput functions abuse the second arg instead of
using the third arg but fixing this is beyond the scope of this round
of changes.


# b7a44e34 05-Dec-1995 Garrett Wollman <wollman@FreeBSD.org>

Path MTU Discovery is now standard.


# 0312fbe9 14-Nov-1995 Poul-Henning Kamp <phk@FreeBSD.org>

New style sysctl & staticize alot of stuff.


# 98163b98 09-Nov-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Start adding new style sysctl here too.


# 3d1f141b 16-Oct-1995 Garrett Wollman <wollman@FreeBSD.org>

The ability to administratively change the MTU of an interface presents
a few new wrinkles for MTU discovery which tcp_output() had better
be prepared to handle. ip_output() is also modified to do something
helpful in this case, since it has already calculated the information
we need.


# 3abc79d2 12-Oct-1995 Garrett Wollman <wollman@FreeBSD.org>

The additional checks involving sequence numbers in MTU discovery resends
turned out not to be necessary; simply watching for MTU decreases (which
we already did) automagically eliminates all the cases we were trying to
protect against.


# 143d7a54 10-Oct-1995 Garrett Wollman <wollman@FreeBSD.org>

More MTU discovery: avoid over-retransmission if route changes in the
middle of a fully-open window. Also, keep track of how many retransmits
we do as a result of MTU discovery. This may actually do more work than
necessary, but it's an unusual condition...

Suggested by: Janey Hoe <janey@lcs.mit.edu>


# e79adb8e 03-Oct-1995 Garrett Wollman <wollman@FreeBSD.org>

Finish 4.4-Lite-2 merge: randomize TCP initial sequence numbers
to make ISS-guessing spoofing attacks harder.


# f001bbb8 22-Sep-1995 Garrett Wollman <wollman@FreeBSD.org>

Correct spelling error in MTUDISC code.


# efe4b0eb 21-Sep-1995 Garrett Wollman <wollman@FreeBSD.org>

Second try: get 4.4-Lite-2 into the source tree. The conflicts don't
matter because none of our working source files are on the CSRG branch
any more.

Obtained from: 4.4BSD-Lite-2


# f138387a 20-Sep-1995 Garrett Wollman <wollman@FreeBSD.org>

Add support in TCP for Path MTU discovery. This is highly experimental
and gated on `options MTUDISC' in the source. It is also practically
untested becausse (sniff!) I don't have easy access to a network with
an MTU of less than an Ethernet. If you have a small MTU network,
please try it and tell me if it works!


# 5cbf3e08 18-Sep-1995 Garrett Wollman <wollman@FreeBSD.org>

Initial back-end support for IP MTU discovery, gated on MTUDISC. The support
for TCP has yet to be written.


# fc978271 29-Jun-1995 Garrett Wollman <wollman@FreeBSD.org>

Keep track of the number of samples through the srtt filter so that we
know better when to cache values in the route, rather than relying on a
heuristic involving sequence numbers that broke when tcp_sendspace
was increased to 16k.


# 91677201 19-Jun-1995 Garrett Wollman <wollman@FreeBSD.org>

Now that we've gone to all sorts of effort to allow TCP to cache some of
its connection parameters, we want to keep statistics on how often this
actually happens to see whether there is any work that needs to be done in
TCP itself.

Suggested by: John Wroclawski <jtw@lcs.mit.edu>


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# 15bd2b43 08-Apr-1995 David Greenman <dg@FreeBSD.org>

Implemented PCB hashing. Includes new functions in_pcbinshash, in_pcbrehash,
and in_pcblookuphash.


# b5e8ce9f 16-Mar-1995 Bruce Evans <bde@FreeBSD.org>

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 9617d8b1 05-Mar-1995 Nate Williams <nate@FreeBSD.org>

Removed unnecessary define for TCPOUTFLAGS since they are not used.


# 41f82abe 15-Feb-1995 Garrett Wollman <wollman@FreeBSD.org>

Transaction TCP support now standard. Hack away!


# a0292f23 09-Feb-1995 Garrett Wollman <wollman@FreeBSD.org>

Merge Transaction TCP, courtesy of Andras Olah <olah@cs.utwente.nl> and
Bob Braden <braden@isi.edu>.

NB: This has not had David's TCP ACK hack re-integrated. It is not clear
what the correct solution to this problem is, if any. If a better solution
doesn't pop up in response to this message, I'll put David's code back in
(or he's welcome to do so himself).


# ac0776ae 08-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

Cosmetics: silences gcc -Wall.


# 623ae52e 02-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

GCC cleanup.
Reviewed by:
Submitted by:
Obtained from:


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources