History log of /openbsd-current/sys/netinet/tcp_var.h
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# 1.178 13-May-2024 jsg

remove prototypes with no matching function
ok mpi@


# 1.177 12-Apr-2024 bluhm

Split single TCP inpcb table into IPv4 and IPv6 parts.

With two separate TCP hash tables, each one becomes smaller. When
we remove the exclusive net lock from TCP, contention on internet
PCB table mutex will be reduced. UDP has been split earlier into
IPv4 and IPv6. Replace branch conditions based on INP_IPV6 with
assertions.

OK mvs@


Revision tags: OPENBSD_7_5_BASE
# 1.176 13-Feb-2024 bluhm

Merge struct route and struct route_in6.

Use a common struct route for both inet and inet6. Unfortunately
struct sockaddr is shorter than sockaddr_in6, so netinet/in.h has
to be exposed from net/route.h. Struct route has to be bsd visible
for userland as netstat kvm code inspects inp_route. Internet PCB
and TCP SYN cache can use a plain struct route now. All specific
sockaddr types for inet and inet6 are embeded there.

OK claudio@


# 1.175 27-Jan-2024 bluhm

Declare address parameter in TCP SYN cache const.

tcp6_ctlinput() casted a constant sockaddr_sin6 to non-const sockaddr.
sa6_src may be &sa6_any which lives in read-only data section.
Better pass down the const addresses to syn_cache_lookup(). They
are needed for hash lookup and are not modified.

OK mvs@


# 1.174 11-Jan-2024 bluhm

Fix white spaces in TCP.


# 1.173 29-Nov-2023 bluhm

Run TCP syn cache timer without kernel lock.

As syn_cache_timer() uses syn cache mutex and exclusive net lock,
it does not need kernel lock.

OK mvs@


# 1.172 16-Nov-2023 bluhm

Run TCP SYN cache timer logik without net lock.

Introduce global TCP SYN cache mutex. Devide timer function in
parts protected by mutex and sending with netlock. Split the flags
field in dynamic flags protected by mutex and fixed flags set during
initialization. Document whether fields of struct syn_cache are
protected by net lock or mutex.

input and OK sashan@


Revision tags: OPENBSD_7_4_BASE
# 1.171 04-Sep-2023 bluhm

Fix netstat output of uses of current SYN cache left.

TCP syn cache variable scs_use is basically counting packet insertions
into syn cache. Prefer type long to exclude overflow on fast
machines. Due to counting downwards from a limit, it can become
negative. Copy it out as tcps_sc_uses_left via sysctl, and print
it as signed long long integer.

OK mvs@


# 1.170 28-Aug-2023 bluhm

Introduce reference counting for TCP syn cache entries.

The syn_cache_reaper() is a hack to serialize timeouts. Unfortunately
it has a race and panics sometimes with pool_do_get: syncache free
list modified. Add a reference counter for timeout and list of syn
cache entries. Currently list refcout is not strictly necessary
due to exclusive netlock, but will be needed when we continue
unlocking.

Checking timeout_initialized() is not MP friendly, better do proper
initialization during object allocation. Refcount in btrace helps
to find leaks.

bug reported and fix tested by Peter J. Philipp
OK claudio@


# 1.169 06-Jul-2023 bluhm

Convert tcp_now() time counter to 64 bit.

After changing tcp now tick to milliseconds, 32 bits will wrap
around after 49 days of uptime. That may be a problem in some
places of our stack. Better use a 64 bit counter.

As timestamp option is 32 bit in TCP protocol, use the lower 32 bit
there. There are casts to 32 bits that should behave correctly.

Start with random 63 bit offset to avoid uptime leakage. 2^63
milliseconds result in 2.9*10^8 years of possible uptime.

OK yasuoka@


# 1.168 02-Jul-2023 bluhm

Use TSO and LRO on the loopback interface to transfer TCP faster.

If tcplro is activated on lo(4), ignore the MTU with TCP packets.
They are passed along with the information that they have to be
chopped in case they are forwarded later. New netstat(1) counter
shows that software LRO is in effect. The feature is currently
turned off by default.

tested by jan@; OK claudio@ jan@


# 1.167 23-May-2023 jan

New counters for LRO packets from hardware TCP offloading.

With tweaks from patrick@ and bluhm@.

OK bluhm@


# 1.166 18-May-2023 jan

Use TSO offloading in ix(4).

With a lot of tweaks, improvements and testing from bluhm.

Thanks to Hrvoje Popovski from the University of Zagreb for
his great testing effort to make this happen.

ok bluhm


# 1.165 15-May-2023 bluhm

Implement the TCP/IP layer for hardware TCP segmentation offload.
If the driver of a network interface claims to support TSO, do not
chop the packet in software, but pass it down to the interface
layer.
Precalculate parts of the pseudo header checksum, but without the
packet length. The length of all generated smaller packets is not
known yet. Driver and hardware will use the mbuf packet header
field ph_mss to calculate it and update checksum.
Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware
might support ony one protocol family. The old flag IFXF_TSO is
only relevant for large receive offload. It is missnamed, but keep
that for now.
Note that drivers do not set TSO capabilites yet. Also the ifconfig
flags and pseudo interfaces capabilities will be done separately.
So this commit should not change behavior.
heavily based on the work from jan@; OK sashan@


# 1.164 10-May-2023 bluhm

Implement TCP send offloading, for now in software only. This is
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@


Revision tags: OPENBSD_7_3_BASE
# 1.163 14-Mar-2023 yasuoka

To avoid misunderstanding, keep variables for tcp keepalive in
milliseconds, which is the same unit of tcp_now(). However, keep the
unit of sysctl variables in seconds and convert their unit in
tcp_sysctl(). Additionally revert TCPTV_SRTTDFLT back to 3 seconds,
which was mistakenly changed to 1.5 seconds by tcp_timer.h 1.19.

ok claudio


# 1.162 13-Dec-2022 claudio

In tcp_now() switch from getnsecuptime() to getnsecruntime()

The tcp timer is not supposed to run during suspend but getnsecuptime() does
and because of this sessions with TCP_KEEPALIVE on reset after a few hours
of sleep.

Problem noticed by mlarkin@, investigation by yasuoka@ additional testing jca@
OK yasuoka@ jca@ cheloha@


# 1.161 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.160 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.159 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.158 13-Sep-2022 mvs

Change pru_rcvd() return type to the type of void. We have no interest
on pru_rcvd() return value.

Drop "pru_rcvd != NULL" check within pru_rcvd() wrapper. We only call it
if the socket's protocol have PR_WANTRCVD flag set. Such sockets are
route domain, tcp(4) and unix(4) sockets.

ok guenther@ bluhm@


# 1.157 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.156 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.155 03-Sep-2022 mvs

Move PRU_SOCKADDR request to (*pru_sockaddr)()

Introduce in{,6}_sockaddr() functions, and use them for all except tcp(4)
inet sockets. For tcp(4) sockets use tcp_sockaddr() to keep debug ability.

The key management and route domain sockets returns EINVAL error for
PRU_SOCKADDR request, so keep this behaviour for a while instead of make
pru_sockaddr handler optional and return EOPNOTSUPP.

ok bluhm@


# 1.154 02-Sep-2022 mvs

Move PRU_CONTROL request to (*pru_control)().

The 'proc *' arg is not used for PRU_CONTROL request, so remove it from
pru_control() wrapper.

Split out {tcp,udp}6_usrreqs from {tcp,udp}_usrreqs and use them for
inet6 case.

ok guenther@ bluhm@


# 1.153 31-Aug-2022 mvs

Move PRU_SENDOOB request to (*pru_sendoob)().

PRU_SENDOOB request always consumes passed `top' and `control' mbufs. To
avoid dummy m_freem(9) handlers for all protocols release passed mbufs
in the pru_sendoob() EOPNOTSUPP error path.

Also fix `control' mbuf(9) leak in the tcp(4) PRU_SENDOOB error path.

ok bluhm@


# 1.152 29-Aug-2022 mvs

Move PRU_RCVOOB request to (*pru_rcvoob)().

ok bluhm@


# 1.151 28-Aug-2022 mvs

Move PRU_SENSE request to (*pru_sense)().

ok bluhm@


# 1.150 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.149 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.148 26-Aug-2022 mvs

Move PRU_RCVD request to (*pru_rcvd)().

ok bluhm@


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.177 12-Apr-2024 bluhm

Split single TCP inpcb table into IPv4 and IPv6 parts.

With two separate TCP hash tables, each one becomes smaller. When
we remove the exclusive net lock from TCP, contention on internet
PCB table mutex will be reduced. UDP has been split earlier into
IPv4 and IPv6. Replace branch conditions based on INP_IPV6 with
assertions.

OK mvs@


Revision tags: OPENBSD_7_5_BASE
# 1.176 13-Feb-2024 bluhm

Merge struct route and struct route_in6.

Use a common struct route for both inet and inet6. Unfortunately
struct sockaddr is shorter than sockaddr_in6, so netinet/in.h has
to be exposed from net/route.h. Struct route has to be bsd visible
for userland as netstat kvm code inspects inp_route. Internet PCB
and TCP SYN cache can use a plain struct route now. All specific
sockaddr types for inet and inet6 are embeded there.

OK claudio@


# 1.175 27-Jan-2024 bluhm

Declare address parameter in TCP SYN cache const.

tcp6_ctlinput() casted a constant sockaddr_sin6 to non-const sockaddr.
sa6_src may be &sa6_any which lives in read-only data section.
Better pass down the const addresses to syn_cache_lookup(). They
are needed for hash lookup and are not modified.

OK mvs@


# 1.174 11-Jan-2024 bluhm

Fix white spaces in TCP.


# 1.173 29-Nov-2023 bluhm

Run TCP syn cache timer without kernel lock.

As syn_cache_timer() uses syn cache mutex and exclusive net lock,
it does not need kernel lock.

OK mvs@


# 1.172 16-Nov-2023 bluhm

Run TCP SYN cache timer logik without net lock.

Introduce global TCP SYN cache mutex. Devide timer function in
parts protected by mutex and sending with netlock. Split the flags
field in dynamic flags protected by mutex and fixed flags set during
initialization. Document whether fields of struct syn_cache are
protected by net lock or mutex.

input and OK sashan@


Revision tags: OPENBSD_7_4_BASE
# 1.171 04-Sep-2023 bluhm

Fix netstat output of uses of current SYN cache left.

TCP syn cache variable scs_use is basically counting packet insertions
into syn cache. Prefer type long to exclude overflow on fast
machines. Due to counting downwards from a limit, it can become
negative. Copy it out as tcps_sc_uses_left via sysctl, and print
it as signed long long integer.

OK mvs@


# 1.170 28-Aug-2023 bluhm

Introduce reference counting for TCP syn cache entries.

The syn_cache_reaper() is a hack to serialize timeouts. Unfortunately
it has a race and panics sometimes with pool_do_get: syncache free
list modified. Add a reference counter for timeout and list of syn
cache entries. Currently list refcout is not strictly necessary
due to exclusive netlock, but will be needed when we continue
unlocking.

Checking timeout_initialized() is not MP friendly, better do proper
initialization during object allocation. Refcount in btrace helps
to find leaks.

bug reported and fix tested by Peter J. Philipp
OK claudio@


# 1.169 06-Jul-2023 bluhm

Convert tcp_now() time counter to 64 bit.

After changing tcp now tick to milliseconds, 32 bits will wrap
around after 49 days of uptime. That may be a problem in some
places of our stack. Better use a 64 bit counter.

As timestamp option is 32 bit in TCP protocol, use the lower 32 bit
there. There are casts to 32 bits that should behave correctly.

Start with random 63 bit offset to avoid uptime leakage. 2^63
milliseconds result in 2.9*10^8 years of possible uptime.

OK yasuoka@


# 1.168 02-Jul-2023 bluhm

Use TSO and LRO on the loopback interface to transfer TCP faster.

If tcplro is activated on lo(4), ignore the MTU with TCP packets.
They are passed along with the information that they have to be
chopped in case they are forwarded later. New netstat(1) counter
shows that software LRO is in effect. The feature is currently
turned off by default.

tested by jan@; OK claudio@ jan@


# 1.167 23-May-2023 jan

New counters for LRO packets from hardware TCP offloading.

With tweaks from patrick@ and bluhm@.

OK bluhm@


# 1.166 18-May-2023 jan

Use TSO offloading in ix(4).

With a lot of tweaks, improvements and testing from bluhm.

Thanks to Hrvoje Popovski from the University of Zagreb for
his great testing effort to make this happen.

ok bluhm


# 1.165 15-May-2023 bluhm

Implement the TCP/IP layer for hardware TCP segmentation offload.
If the driver of a network interface claims to support TSO, do not
chop the packet in software, but pass it down to the interface
layer.
Precalculate parts of the pseudo header checksum, but without the
packet length. The length of all generated smaller packets is not
known yet. Driver and hardware will use the mbuf packet header
field ph_mss to calculate it and update checksum.
Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware
might support ony one protocol family. The old flag IFXF_TSO is
only relevant for large receive offload. It is missnamed, but keep
that for now.
Note that drivers do not set TSO capabilites yet. Also the ifconfig
flags and pseudo interfaces capabilities will be done separately.
So this commit should not change behavior.
heavily based on the work from jan@; OK sashan@


# 1.164 10-May-2023 bluhm

Implement TCP send offloading, for now in software only. This is
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@


Revision tags: OPENBSD_7_3_BASE
# 1.163 14-Mar-2023 yasuoka

To avoid misunderstanding, keep variables for tcp keepalive in
milliseconds, which is the same unit of tcp_now(). However, keep the
unit of sysctl variables in seconds and convert their unit in
tcp_sysctl(). Additionally revert TCPTV_SRTTDFLT back to 3 seconds,
which was mistakenly changed to 1.5 seconds by tcp_timer.h 1.19.

ok claudio


# 1.162 13-Dec-2022 claudio

In tcp_now() switch from getnsecuptime() to getnsecruntime()

The tcp timer is not supposed to run during suspend but getnsecuptime() does
and because of this sessions with TCP_KEEPALIVE on reset after a few hours
of sleep.

Problem noticed by mlarkin@, investigation by yasuoka@ additional testing jca@
OK yasuoka@ jca@ cheloha@


# 1.161 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.160 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.159 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.158 13-Sep-2022 mvs

Change pru_rcvd() return type to the type of void. We have no interest
on pru_rcvd() return value.

Drop "pru_rcvd != NULL" check within pru_rcvd() wrapper. We only call it
if the socket's protocol have PR_WANTRCVD flag set. Such sockets are
route domain, tcp(4) and unix(4) sockets.

ok guenther@ bluhm@


# 1.157 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.156 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.155 03-Sep-2022 mvs

Move PRU_SOCKADDR request to (*pru_sockaddr)()

Introduce in{,6}_sockaddr() functions, and use them for all except tcp(4)
inet sockets. For tcp(4) sockets use tcp_sockaddr() to keep debug ability.

The key management and route domain sockets returns EINVAL error for
PRU_SOCKADDR request, so keep this behaviour for a while instead of make
pru_sockaddr handler optional and return EOPNOTSUPP.

ok bluhm@


# 1.154 02-Sep-2022 mvs

Move PRU_CONTROL request to (*pru_control)().

The 'proc *' arg is not used for PRU_CONTROL request, so remove it from
pru_control() wrapper.

Split out {tcp,udp}6_usrreqs from {tcp,udp}_usrreqs and use them for
inet6 case.

ok guenther@ bluhm@


# 1.153 31-Aug-2022 mvs

Move PRU_SENDOOB request to (*pru_sendoob)().

PRU_SENDOOB request always consumes passed `top' and `control' mbufs. To
avoid dummy m_freem(9) handlers for all protocols release passed mbufs
in the pru_sendoob() EOPNOTSUPP error path.

Also fix `control' mbuf(9) leak in the tcp(4) PRU_SENDOOB error path.

ok bluhm@


# 1.152 29-Aug-2022 mvs

Move PRU_RCVOOB request to (*pru_rcvoob)().

ok bluhm@


# 1.151 28-Aug-2022 mvs

Move PRU_SENSE request to (*pru_sense)().

ok bluhm@


# 1.150 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.149 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.148 26-Aug-2022 mvs

Move PRU_RCVD request to (*pru_rcvd)().

ok bluhm@


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.176 13-Feb-2024 bluhm

Merge struct route and struct route_in6.

Use a common struct route for both inet and inet6. Unfortunately
struct sockaddr is shorter than sockaddr_in6, so netinet/in.h has
to be exposed from net/route.h. Struct route has to be bsd visible
for userland as netstat kvm code inspects inp_route. Internet PCB
and TCP SYN cache can use a plain struct route now. All specific
sockaddr types for inet and inet6 are embeded there.

OK claudio@


# 1.175 27-Jan-2024 bluhm

Declare address parameter in TCP SYN cache const.

tcp6_ctlinput() casted a constant sockaddr_sin6 to non-const sockaddr.
sa6_src may be &sa6_any which lives in read-only data section.
Better pass down the const addresses to syn_cache_lookup(). They
are needed for hash lookup and are not modified.

OK mvs@


# 1.174 11-Jan-2024 bluhm

Fix white spaces in TCP.


# 1.173 29-Nov-2023 bluhm

Run TCP syn cache timer without kernel lock.

As syn_cache_timer() uses syn cache mutex and exclusive net lock,
it does not need kernel lock.

OK mvs@


# 1.172 16-Nov-2023 bluhm

Run TCP SYN cache timer logik without net lock.

Introduce global TCP SYN cache mutex. Devide timer function in
parts protected by mutex and sending with netlock. Split the flags
field in dynamic flags protected by mutex and fixed flags set during
initialization. Document whether fields of struct syn_cache are
protected by net lock or mutex.

input and OK sashan@


Revision tags: OPENBSD_7_4_BASE
# 1.171 04-Sep-2023 bluhm

Fix netstat output of uses of current SYN cache left.

TCP syn cache variable scs_use is basically counting packet insertions
into syn cache. Prefer type long to exclude overflow on fast
machines. Due to counting downwards from a limit, it can become
negative. Copy it out as tcps_sc_uses_left via sysctl, and print
it as signed long long integer.

OK mvs@


# 1.170 28-Aug-2023 bluhm

Introduce reference counting for TCP syn cache entries.

The syn_cache_reaper() is a hack to serialize timeouts. Unfortunately
it has a race and panics sometimes with pool_do_get: syncache free
list modified. Add a reference counter for timeout and list of syn
cache entries. Currently list refcout is not strictly necessary
due to exclusive netlock, but will be needed when we continue
unlocking.

Checking timeout_initialized() is not MP friendly, better do proper
initialization during object allocation. Refcount in btrace helps
to find leaks.

bug reported and fix tested by Peter J. Philipp
OK claudio@


# 1.169 06-Jul-2023 bluhm

Convert tcp_now() time counter to 64 bit.

After changing tcp now tick to milliseconds, 32 bits will wrap
around after 49 days of uptime. That may be a problem in some
places of our stack. Better use a 64 bit counter.

As timestamp option is 32 bit in TCP protocol, use the lower 32 bit
there. There are casts to 32 bits that should behave correctly.

Start with random 63 bit offset to avoid uptime leakage. 2^63
milliseconds result in 2.9*10^8 years of possible uptime.

OK yasuoka@


# 1.168 02-Jul-2023 bluhm

Use TSO and LRO on the loopback interface to transfer TCP faster.

If tcplro is activated on lo(4), ignore the MTU with TCP packets.
They are passed along with the information that they have to be
chopped in case they are forwarded later. New netstat(1) counter
shows that software LRO is in effect. The feature is currently
turned off by default.

tested by jan@; OK claudio@ jan@


# 1.167 23-May-2023 jan

New counters for LRO packets from hardware TCP offloading.

With tweaks from patrick@ and bluhm@.

OK bluhm@


# 1.166 18-May-2023 jan

Use TSO offloading in ix(4).

With a lot of tweaks, improvements and testing from bluhm.

Thanks to Hrvoje Popovski from the University of Zagreb for
his great testing effort to make this happen.

ok bluhm


# 1.165 15-May-2023 bluhm

Implement the TCP/IP layer for hardware TCP segmentation offload.
If the driver of a network interface claims to support TSO, do not
chop the packet in software, but pass it down to the interface
layer.
Precalculate parts of the pseudo header checksum, but without the
packet length. The length of all generated smaller packets is not
known yet. Driver and hardware will use the mbuf packet header
field ph_mss to calculate it and update checksum.
Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware
might support ony one protocol family. The old flag IFXF_TSO is
only relevant for large receive offload. It is missnamed, but keep
that for now.
Note that drivers do not set TSO capabilites yet. Also the ifconfig
flags and pseudo interfaces capabilities will be done separately.
So this commit should not change behavior.
heavily based on the work from jan@; OK sashan@


# 1.164 10-May-2023 bluhm

Implement TCP send offloading, for now in software only. This is
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@


Revision tags: OPENBSD_7_3_BASE
# 1.163 14-Mar-2023 yasuoka

To avoid misunderstanding, keep variables for tcp keepalive in
milliseconds, which is the same unit of tcp_now(). However, keep the
unit of sysctl variables in seconds and convert their unit in
tcp_sysctl(). Additionally revert TCPTV_SRTTDFLT back to 3 seconds,
which was mistakenly changed to 1.5 seconds by tcp_timer.h 1.19.

ok claudio


# 1.162 13-Dec-2022 claudio

In tcp_now() switch from getnsecuptime() to getnsecruntime()

The tcp timer is not supposed to run during suspend but getnsecuptime() does
and because of this sessions with TCP_KEEPALIVE on reset after a few hours
of sleep.

Problem noticed by mlarkin@, investigation by yasuoka@ additional testing jca@
OK yasuoka@ jca@ cheloha@


# 1.161 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.160 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.159 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.158 13-Sep-2022 mvs

Change pru_rcvd() return type to the type of void. We have no interest
on pru_rcvd() return value.

Drop "pru_rcvd != NULL" check within pru_rcvd() wrapper. We only call it
if the socket's protocol have PR_WANTRCVD flag set. Such sockets are
route domain, tcp(4) and unix(4) sockets.

ok guenther@ bluhm@


# 1.157 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.156 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.155 03-Sep-2022 mvs

Move PRU_SOCKADDR request to (*pru_sockaddr)()

Introduce in{,6}_sockaddr() functions, and use them for all except tcp(4)
inet sockets. For tcp(4) sockets use tcp_sockaddr() to keep debug ability.

The key management and route domain sockets returns EINVAL error for
PRU_SOCKADDR request, so keep this behaviour for a while instead of make
pru_sockaddr handler optional and return EOPNOTSUPP.

ok bluhm@


# 1.154 02-Sep-2022 mvs

Move PRU_CONTROL request to (*pru_control)().

The 'proc *' arg is not used for PRU_CONTROL request, so remove it from
pru_control() wrapper.

Split out {tcp,udp}6_usrreqs from {tcp,udp}_usrreqs and use them for
inet6 case.

ok guenther@ bluhm@


# 1.153 31-Aug-2022 mvs

Move PRU_SENDOOB request to (*pru_sendoob)().

PRU_SENDOOB request always consumes passed `top' and `control' mbufs. To
avoid dummy m_freem(9) handlers for all protocols release passed mbufs
in the pru_sendoob() EOPNOTSUPP error path.

Also fix `control' mbuf(9) leak in the tcp(4) PRU_SENDOOB error path.

ok bluhm@


# 1.152 29-Aug-2022 mvs

Move PRU_RCVOOB request to (*pru_rcvoob)().

ok bluhm@


# 1.151 28-Aug-2022 mvs

Move PRU_SENSE request to (*pru_sense)().

ok bluhm@


# 1.150 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.149 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.148 26-Aug-2022 mvs

Move PRU_RCVD request to (*pru_rcvd)().

ok bluhm@


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.175 27-Jan-2024 bluhm

Declare address parameter in TCP SYN cache const.

tcp6_ctlinput() casted a constant sockaddr_sin6 to non-const sockaddr.
sa6_src may be &sa6_any which lives in read-only data section.
Better pass down the const addresses to syn_cache_lookup(). They
are needed for hash lookup and are not modified.

OK mvs@


# 1.174 11-Jan-2024 bluhm

Fix white spaces in TCP.


# 1.173 29-Nov-2023 bluhm

Run TCP syn cache timer without kernel lock.

As syn_cache_timer() uses syn cache mutex and exclusive net lock,
it does not need kernel lock.

OK mvs@


# 1.172 16-Nov-2023 bluhm

Run TCP SYN cache timer logik without net lock.

Introduce global TCP SYN cache mutex. Devide timer function in
parts protected by mutex and sending with netlock. Split the flags
field in dynamic flags protected by mutex and fixed flags set during
initialization. Document whether fields of struct syn_cache are
protected by net lock or mutex.

input and OK sashan@


Revision tags: OPENBSD_7_4_BASE
# 1.171 04-Sep-2023 bluhm

Fix netstat output of uses of current SYN cache left.

TCP syn cache variable scs_use is basically counting packet insertions
into syn cache. Prefer type long to exclude overflow on fast
machines. Due to counting downwards from a limit, it can become
negative. Copy it out as tcps_sc_uses_left via sysctl, and print
it as signed long long integer.

OK mvs@


# 1.170 28-Aug-2023 bluhm

Introduce reference counting for TCP syn cache entries.

The syn_cache_reaper() is a hack to serialize timeouts. Unfortunately
it has a race and panics sometimes with pool_do_get: syncache free
list modified. Add a reference counter for timeout and list of syn
cache entries. Currently list refcout is not strictly necessary
due to exclusive netlock, but will be needed when we continue
unlocking.

Checking timeout_initialized() is not MP friendly, better do proper
initialization during object allocation. Refcount in btrace helps
to find leaks.

bug reported and fix tested by Peter J. Philipp
OK claudio@


# 1.169 06-Jul-2023 bluhm

Convert tcp_now() time counter to 64 bit.

After changing tcp now tick to milliseconds, 32 bits will wrap
around after 49 days of uptime. That may be a problem in some
places of our stack. Better use a 64 bit counter.

As timestamp option is 32 bit in TCP protocol, use the lower 32 bit
there. There are casts to 32 bits that should behave correctly.

Start with random 63 bit offset to avoid uptime leakage. 2^63
milliseconds result in 2.9*10^8 years of possible uptime.

OK yasuoka@


# 1.168 02-Jul-2023 bluhm

Use TSO and LRO on the loopback interface to transfer TCP faster.

If tcplro is activated on lo(4), ignore the MTU with TCP packets.
They are passed along with the information that they have to be
chopped in case they are forwarded later. New netstat(1) counter
shows that software LRO is in effect. The feature is currently
turned off by default.

tested by jan@; OK claudio@ jan@


# 1.167 23-May-2023 jan

New counters for LRO packets from hardware TCP offloading.

With tweaks from patrick@ and bluhm@.

OK bluhm@


# 1.166 18-May-2023 jan

Use TSO offloading in ix(4).

With a lot of tweaks, improvements and testing from bluhm.

Thanks to Hrvoje Popovski from the University of Zagreb for
his great testing effort to make this happen.

ok bluhm


# 1.165 15-May-2023 bluhm

Implement the TCP/IP layer for hardware TCP segmentation offload.
If the driver of a network interface claims to support TSO, do not
chop the packet in software, but pass it down to the interface
layer.
Precalculate parts of the pseudo header checksum, but without the
packet length. The length of all generated smaller packets is not
known yet. Driver and hardware will use the mbuf packet header
field ph_mss to calculate it and update checksum.
Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware
might support ony one protocol family. The old flag IFXF_TSO is
only relevant for large receive offload. It is missnamed, but keep
that for now.
Note that drivers do not set TSO capabilites yet. Also the ifconfig
flags and pseudo interfaces capabilities will be done separately.
So this commit should not change behavior.
heavily based on the work from jan@; OK sashan@


# 1.164 10-May-2023 bluhm

Implement TCP send offloading, for now in software only. This is
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@


Revision tags: OPENBSD_7_3_BASE
# 1.163 14-Mar-2023 yasuoka

To avoid misunderstanding, keep variables for tcp keepalive in
milliseconds, which is the same unit of tcp_now(). However, keep the
unit of sysctl variables in seconds and convert their unit in
tcp_sysctl(). Additionally revert TCPTV_SRTTDFLT back to 3 seconds,
which was mistakenly changed to 1.5 seconds by tcp_timer.h 1.19.

ok claudio


# 1.162 13-Dec-2022 claudio

In tcp_now() switch from getnsecuptime() to getnsecruntime()

The tcp timer is not supposed to run during suspend but getnsecuptime() does
and because of this sessions with TCP_KEEPALIVE on reset after a few hours
of sleep.

Problem noticed by mlarkin@, investigation by yasuoka@ additional testing jca@
OK yasuoka@ jca@ cheloha@


# 1.161 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.160 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.159 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.158 13-Sep-2022 mvs

Change pru_rcvd() return type to the type of void. We have no interest
on pru_rcvd() return value.

Drop "pru_rcvd != NULL" check within pru_rcvd() wrapper. We only call it
if the socket's protocol have PR_WANTRCVD flag set. Such sockets are
route domain, tcp(4) and unix(4) sockets.

ok guenther@ bluhm@


# 1.157 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.156 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.155 03-Sep-2022 mvs

Move PRU_SOCKADDR request to (*pru_sockaddr)()

Introduce in{,6}_sockaddr() functions, and use them for all except tcp(4)
inet sockets. For tcp(4) sockets use tcp_sockaddr() to keep debug ability.

The key management and route domain sockets returns EINVAL error for
PRU_SOCKADDR request, so keep this behaviour for a while instead of make
pru_sockaddr handler optional and return EOPNOTSUPP.

ok bluhm@


# 1.154 02-Sep-2022 mvs

Move PRU_CONTROL request to (*pru_control)().

The 'proc *' arg is not used for PRU_CONTROL request, so remove it from
pru_control() wrapper.

Split out {tcp,udp}6_usrreqs from {tcp,udp}_usrreqs and use them for
inet6 case.

ok guenther@ bluhm@


# 1.153 31-Aug-2022 mvs

Move PRU_SENDOOB request to (*pru_sendoob)().

PRU_SENDOOB request always consumes passed `top' and `control' mbufs. To
avoid dummy m_freem(9) handlers for all protocols release passed mbufs
in the pru_sendoob() EOPNOTSUPP error path.

Also fix `control' mbuf(9) leak in the tcp(4) PRU_SENDOOB error path.

ok bluhm@


# 1.152 29-Aug-2022 mvs

Move PRU_RCVOOB request to (*pru_rcvoob)().

ok bluhm@


# 1.151 28-Aug-2022 mvs

Move PRU_SENSE request to (*pru_sense)().

ok bluhm@


# 1.150 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.149 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.148 26-Aug-2022 mvs

Move PRU_RCVD request to (*pru_rcvd)().

ok bluhm@


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.174 11-Jan-2024 bluhm

Fix white spaces in TCP.


# 1.173 29-Nov-2023 bluhm

Run TCP syn cache timer without kernel lock.

As syn_cache_timer() uses syn cache mutex and exclusive net lock,
it does not need kernel lock.

OK mvs@


# 1.172 16-Nov-2023 bluhm

Run TCP SYN cache timer logik without net lock.

Introduce global TCP SYN cache mutex. Devide timer function in
parts protected by mutex and sending with netlock. Split the flags
field in dynamic flags protected by mutex and fixed flags set during
initialization. Document whether fields of struct syn_cache are
protected by net lock or mutex.

input and OK sashan@


Revision tags: OPENBSD_7_4_BASE
# 1.171 04-Sep-2023 bluhm

Fix netstat output of uses of current SYN cache left.

TCP syn cache variable scs_use is basically counting packet insertions
into syn cache. Prefer type long to exclude overflow on fast
machines. Due to counting downwards from a limit, it can become
negative. Copy it out as tcps_sc_uses_left via sysctl, and print
it as signed long long integer.

OK mvs@


# 1.170 28-Aug-2023 bluhm

Introduce reference counting for TCP syn cache entries.

The syn_cache_reaper() is a hack to serialize timeouts. Unfortunately
it has a race and panics sometimes with pool_do_get: syncache free
list modified. Add a reference counter for timeout and list of syn
cache entries. Currently list refcout is not strictly necessary
due to exclusive netlock, but will be needed when we continue
unlocking.

Checking timeout_initialized() is not MP friendly, better do proper
initialization during object allocation. Refcount in btrace helps
to find leaks.

bug reported and fix tested by Peter J. Philipp
OK claudio@


# 1.169 06-Jul-2023 bluhm

Convert tcp_now() time counter to 64 bit.

After changing tcp now tick to milliseconds, 32 bits will wrap
around after 49 days of uptime. That may be a problem in some
places of our stack. Better use a 64 bit counter.

As timestamp option is 32 bit in TCP protocol, use the lower 32 bit
there. There are casts to 32 bits that should behave correctly.

Start with random 63 bit offset to avoid uptime leakage. 2^63
milliseconds result in 2.9*10^8 years of possible uptime.

OK yasuoka@


# 1.168 02-Jul-2023 bluhm

Use TSO and LRO on the loopback interface to transfer TCP faster.

If tcplro is activated on lo(4), ignore the MTU with TCP packets.
They are passed along with the information that they have to be
chopped in case they are forwarded later. New netstat(1) counter
shows that software LRO is in effect. The feature is currently
turned off by default.

tested by jan@; OK claudio@ jan@


# 1.167 23-May-2023 jan

New counters for LRO packets from hardware TCP offloading.

With tweaks from patrick@ and bluhm@.

OK bluhm@


# 1.166 18-May-2023 jan

Use TSO offloading in ix(4).

With a lot of tweaks, improvements and testing from bluhm.

Thanks to Hrvoje Popovski from the University of Zagreb for
his great testing effort to make this happen.

ok bluhm


# 1.165 15-May-2023 bluhm

Implement the TCP/IP layer for hardware TCP segmentation offload.
If the driver of a network interface claims to support TSO, do not
chop the packet in software, but pass it down to the interface
layer.
Precalculate parts of the pseudo header checksum, but without the
packet length. The length of all generated smaller packets is not
known yet. Driver and hardware will use the mbuf packet header
field ph_mss to calculate it and update checksum.
Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware
might support ony one protocol family. The old flag IFXF_TSO is
only relevant for large receive offload. It is missnamed, but keep
that for now.
Note that drivers do not set TSO capabilites yet. Also the ifconfig
flags and pseudo interfaces capabilities will be done separately.
So this commit should not change behavior.
heavily based on the work from jan@; OK sashan@


# 1.164 10-May-2023 bluhm

Implement TCP send offloading, for now in software only. This is
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@


Revision tags: OPENBSD_7_3_BASE
# 1.163 14-Mar-2023 yasuoka

To avoid misunderstanding, keep variables for tcp keepalive in
milliseconds, which is the same unit of tcp_now(). However, keep the
unit of sysctl variables in seconds and convert their unit in
tcp_sysctl(). Additionally revert TCPTV_SRTTDFLT back to 3 seconds,
which was mistakenly changed to 1.5 seconds by tcp_timer.h 1.19.

ok claudio


# 1.162 13-Dec-2022 claudio

In tcp_now() switch from getnsecuptime() to getnsecruntime()

The tcp timer is not supposed to run during suspend but getnsecuptime() does
and because of this sessions with TCP_KEEPALIVE on reset after a few hours
of sleep.

Problem noticed by mlarkin@, investigation by yasuoka@ additional testing jca@
OK yasuoka@ jca@ cheloha@


# 1.161 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.160 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.159 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.158 13-Sep-2022 mvs

Change pru_rcvd() return type to the type of void. We have no interest
on pru_rcvd() return value.

Drop "pru_rcvd != NULL" check within pru_rcvd() wrapper. We only call it
if the socket's protocol have PR_WANTRCVD flag set. Such sockets are
route domain, tcp(4) and unix(4) sockets.

ok guenther@ bluhm@


# 1.157 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.156 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.155 03-Sep-2022 mvs

Move PRU_SOCKADDR request to (*pru_sockaddr)()

Introduce in{,6}_sockaddr() functions, and use them for all except tcp(4)
inet sockets. For tcp(4) sockets use tcp_sockaddr() to keep debug ability.

The key management and route domain sockets returns EINVAL error for
PRU_SOCKADDR request, so keep this behaviour for a while instead of make
pru_sockaddr handler optional and return EOPNOTSUPP.

ok bluhm@


# 1.154 02-Sep-2022 mvs

Move PRU_CONTROL request to (*pru_control)().

The 'proc *' arg is not used for PRU_CONTROL request, so remove it from
pru_control() wrapper.

Split out {tcp,udp}6_usrreqs from {tcp,udp}_usrreqs and use them for
inet6 case.

ok guenther@ bluhm@


# 1.153 31-Aug-2022 mvs

Move PRU_SENDOOB request to (*pru_sendoob)().

PRU_SENDOOB request always consumes passed `top' and `control' mbufs. To
avoid dummy m_freem(9) handlers for all protocols release passed mbufs
in the pru_sendoob() EOPNOTSUPP error path.

Also fix `control' mbuf(9) leak in the tcp(4) PRU_SENDOOB error path.

ok bluhm@


# 1.152 29-Aug-2022 mvs

Move PRU_RCVOOB request to (*pru_rcvoob)().

ok bluhm@


# 1.151 28-Aug-2022 mvs

Move PRU_SENSE request to (*pru_sense)().

ok bluhm@


# 1.150 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.149 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.148 26-Aug-2022 mvs

Move PRU_RCVD request to (*pru_rcvd)().

ok bluhm@


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.173 29-Nov-2023 bluhm

Run TCP syn cache timer without kernel lock.

As syn_cache_timer() uses syn cache mutex and exclusive net lock,
it does not need kernel lock.

OK mvs@


# 1.172 16-Nov-2023 bluhm

Run TCP SYN cache timer logik without net lock.

Introduce global TCP SYN cache mutex. Devide timer function in
parts protected by mutex and sending with netlock. Split the flags
field in dynamic flags protected by mutex and fixed flags set during
initialization. Document whether fields of struct syn_cache are
protected by net lock or mutex.

input and OK sashan@


Revision tags: OPENBSD_7_4_BASE
# 1.171 04-Sep-2023 bluhm

Fix netstat output of uses of current SYN cache left.

TCP syn cache variable scs_use is basically counting packet insertions
into syn cache. Prefer type long to exclude overflow on fast
machines. Due to counting downwards from a limit, it can become
negative. Copy it out as tcps_sc_uses_left via sysctl, and print
it as signed long long integer.

OK mvs@


# 1.170 28-Aug-2023 bluhm

Introduce reference counting for TCP syn cache entries.

The syn_cache_reaper() is a hack to serialize timeouts. Unfortunately
it has a race and panics sometimes with pool_do_get: syncache free
list modified. Add a reference counter for timeout and list of syn
cache entries. Currently list refcout is not strictly necessary
due to exclusive netlock, but will be needed when we continue
unlocking.

Checking timeout_initialized() is not MP friendly, better do proper
initialization during object allocation. Refcount in btrace helps
to find leaks.

bug reported and fix tested by Peter J. Philipp
OK claudio@


# 1.169 06-Jul-2023 bluhm

Convert tcp_now() time counter to 64 bit.

After changing tcp now tick to milliseconds, 32 bits will wrap
around after 49 days of uptime. That may be a problem in some
places of our stack. Better use a 64 bit counter.

As timestamp option is 32 bit in TCP protocol, use the lower 32 bit
there. There are casts to 32 bits that should behave correctly.

Start with random 63 bit offset to avoid uptime leakage. 2^63
milliseconds result in 2.9*10^8 years of possible uptime.

OK yasuoka@


# 1.168 02-Jul-2023 bluhm

Use TSO and LRO on the loopback interface to transfer TCP faster.

If tcplro is activated on lo(4), ignore the MTU with TCP packets.
They are passed along with the information that they have to be
chopped in case they are forwarded later. New netstat(1) counter
shows that software LRO is in effect. The feature is currently
turned off by default.

tested by jan@; OK claudio@ jan@


# 1.167 23-May-2023 jan

New counters for LRO packets from hardware TCP offloading.

With tweaks from patrick@ and bluhm@.

OK bluhm@


# 1.166 18-May-2023 jan

Use TSO offloading in ix(4).

With a lot of tweaks, improvements and testing from bluhm.

Thanks to Hrvoje Popovski from the University of Zagreb for
his great testing effort to make this happen.

ok bluhm


# 1.165 15-May-2023 bluhm

Implement the TCP/IP layer for hardware TCP segmentation offload.
If the driver of a network interface claims to support TSO, do not
chop the packet in software, but pass it down to the interface
layer.
Precalculate parts of the pseudo header checksum, but without the
packet length. The length of all generated smaller packets is not
known yet. Driver and hardware will use the mbuf packet header
field ph_mss to calculate it and update checksum.
Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware
might support ony one protocol family. The old flag IFXF_TSO is
only relevant for large receive offload. It is missnamed, but keep
that for now.
Note that drivers do not set TSO capabilites yet. Also the ifconfig
flags and pseudo interfaces capabilities will be done separately.
So this commit should not change behavior.
heavily based on the work from jan@; OK sashan@


# 1.164 10-May-2023 bluhm

Implement TCP send offloading, for now in software only. This is
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@


Revision tags: OPENBSD_7_3_BASE
# 1.163 14-Mar-2023 yasuoka

To avoid misunderstanding, keep variables for tcp keepalive in
milliseconds, which is the same unit of tcp_now(). However, keep the
unit of sysctl variables in seconds and convert their unit in
tcp_sysctl(). Additionally revert TCPTV_SRTTDFLT back to 3 seconds,
which was mistakenly changed to 1.5 seconds by tcp_timer.h 1.19.

ok claudio


# 1.162 13-Dec-2022 claudio

In tcp_now() switch from getnsecuptime() to getnsecruntime()

The tcp timer is not supposed to run during suspend but getnsecuptime() does
and because of this sessions with TCP_KEEPALIVE on reset after a few hours
of sleep.

Problem noticed by mlarkin@, investigation by yasuoka@ additional testing jca@
OK yasuoka@ jca@ cheloha@


# 1.161 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.160 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.159 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.158 13-Sep-2022 mvs

Change pru_rcvd() return type to the type of void. We have no interest
on pru_rcvd() return value.

Drop "pru_rcvd != NULL" check within pru_rcvd() wrapper. We only call it
if the socket's protocol have PR_WANTRCVD flag set. Such sockets are
route domain, tcp(4) and unix(4) sockets.

ok guenther@ bluhm@


# 1.157 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.156 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.155 03-Sep-2022 mvs

Move PRU_SOCKADDR request to (*pru_sockaddr)()

Introduce in{,6}_sockaddr() functions, and use them for all except tcp(4)
inet sockets. For tcp(4) sockets use tcp_sockaddr() to keep debug ability.

The key management and route domain sockets returns EINVAL error for
PRU_SOCKADDR request, so keep this behaviour for a while instead of make
pru_sockaddr handler optional and return EOPNOTSUPP.

ok bluhm@


# 1.154 02-Sep-2022 mvs

Move PRU_CONTROL request to (*pru_control)().

The 'proc *' arg is not used for PRU_CONTROL request, so remove it from
pru_control() wrapper.

Split out {tcp,udp}6_usrreqs from {tcp,udp}_usrreqs and use them for
inet6 case.

ok guenther@ bluhm@


# 1.153 31-Aug-2022 mvs

Move PRU_SENDOOB request to (*pru_sendoob)().

PRU_SENDOOB request always consumes passed `top' and `control' mbufs. To
avoid dummy m_freem(9) handlers for all protocols release passed mbufs
in the pru_sendoob() EOPNOTSUPP error path.

Also fix `control' mbuf(9) leak in the tcp(4) PRU_SENDOOB error path.

ok bluhm@


# 1.152 29-Aug-2022 mvs

Move PRU_RCVOOB request to (*pru_rcvoob)().

ok bluhm@


# 1.151 28-Aug-2022 mvs

Move PRU_SENSE request to (*pru_sense)().

ok bluhm@


# 1.150 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.149 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.148 26-Aug-2022 mvs

Move PRU_RCVD request to (*pru_rcvd)().

ok bluhm@


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.172 16-Nov-2023 bluhm

Run TCP SYN cache timer logik without net lock.

Introduce global TCP SYN cache mutex. Devide timer function in
parts protected by mutex and sending with netlock. Split the flags
field in dynamic flags protected by mutex and fixed flags set during
initialization. Document whether fields of struct syn_cache are
protected by net lock or mutex.

input and OK sashan@


Revision tags: OPENBSD_7_4_BASE
# 1.171 04-Sep-2023 bluhm

Fix netstat output of uses of current SYN cache left.

TCP syn cache variable scs_use is basically counting packet insertions
into syn cache. Prefer type long to exclude overflow on fast
machines. Due to counting downwards from a limit, it can become
negative. Copy it out as tcps_sc_uses_left via sysctl, and print
it as signed long long integer.

OK mvs@


# 1.170 28-Aug-2023 bluhm

Introduce reference counting for TCP syn cache entries.

The syn_cache_reaper() is a hack to serialize timeouts. Unfortunately
it has a race and panics sometimes with pool_do_get: syncache free
list modified. Add a reference counter for timeout and list of syn
cache entries. Currently list refcout is not strictly necessary
due to exclusive netlock, but will be needed when we continue
unlocking.

Checking timeout_initialized() is not MP friendly, better do proper
initialization during object allocation. Refcount in btrace helps
to find leaks.

bug reported and fix tested by Peter J. Philipp
OK claudio@


# 1.169 06-Jul-2023 bluhm

Convert tcp_now() time counter to 64 bit.

After changing tcp now tick to milliseconds, 32 bits will wrap
around after 49 days of uptime. That may be a problem in some
places of our stack. Better use a 64 bit counter.

As timestamp option is 32 bit in TCP protocol, use the lower 32 bit
there. There are casts to 32 bits that should behave correctly.

Start with random 63 bit offset to avoid uptime leakage. 2^63
milliseconds result in 2.9*10^8 years of possible uptime.

OK yasuoka@


# 1.168 02-Jul-2023 bluhm

Use TSO and LRO on the loopback interface to transfer TCP faster.

If tcplro is activated on lo(4), ignore the MTU with TCP packets.
They are passed along with the information that they have to be
chopped in case they are forwarded later. New netstat(1) counter
shows that software LRO is in effect. The feature is currently
turned off by default.

tested by jan@; OK claudio@ jan@


# 1.167 23-May-2023 jan

New counters for LRO packets from hardware TCP offloading.

With tweaks from patrick@ and bluhm@.

OK bluhm@


# 1.166 18-May-2023 jan

Use TSO offloading in ix(4).

With a lot of tweaks, improvements and testing from bluhm.

Thanks to Hrvoje Popovski from the University of Zagreb for
his great testing effort to make this happen.

ok bluhm


# 1.165 15-May-2023 bluhm

Implement the TCP/IP layer for hardware TCP segmentation offload.
If the driver of a network interface claims to support TSO, do not
chop the packet in software, but pass it down to the interface
layer.
Precalculate parts of the pseudo header checksum, but without the
packet length. The length of all generated smaller packets is not
known yet. Driver and hardware will use the mbuf packet header
field ph_mss to calculate it and update checksum.
Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware
might support ony one protocol family. The old flag IFXF_TSO is
only relevant for large receive offload. It is missnamed, but keep
that for now.
Note that drivers do not set TSO capabilites yet. Also the ifconfig
flags and pseudo interfaces capabilities will be done separately.
So this commit should not change behavior.
heavily based on the work from jan@; OK sashan@


# 1.164 10-May-2023 bluhm

Implement TCP send offloading, for now in software only. This is
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@


Revision tags: OPENBSD_7_3_BASE
# 1.163 14-Mar-2023 yasuoka

To avoid misunderstanding, keep variables for tcp keepalive in
milliseconds, which is the same unit of tcp_now(). However, keep the
unit of sysctl variables in seconds and convert their unit in
tcp_sysctl(). Additionally revert TCPTV_SRTTDFLT back to 3 seconds,
which was mistakenly changed to 1.5 seconds by tcp_timer.h 1.19.

ok claudio


# 1.162 13-Dec-2022 claudio

In tcp_now() switch from getnsecuptime() to getnsecruntime()

The tcp timer is not supposed to run during suspend but getnsecuptime() does
and because of this sessions with TCP_KEEPALIVE on reset after a few hours
of sleep.

Problem noticed by mlarkin@, investigation by yasuoka@ additional testing jca@
OK yasuoka@ jca@ cheloha@


# 1.161 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.160 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.159 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.158 13-Sep-2022 mvs

Change pru_rcvd() return type to the type of void. We have no interest
on pru_rcvd() return value.

Drop "pru_rcvd != NULL" check within pru_rcvd() wrapper. We only call it
if the socket's protocol have PR_WANTRCVD flag set. Such sockets are
route domain, tcp(4) and unix(4) sockets.

ok guenther@ bluhm@


# 1.157 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.156 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.155 03-Sep-2022 mvs

Move PRU_SOCKADDR request to (*pru_sockaddr)()

Introduce in{,6}_sockaddr() functions, and use them for all except tcp(4)
inet sockets. For tcp(4) sockets use tcp_sockaddr() to keep debug ability.

The key management and route domain sockets returns EINVAL error for
PRU_SOCKADDR request, so keep this behaviour for a while instead of make
pru_sockaddr handler optional and return EOPNOTSUPP.

ok bluhm@


# 1.154 02-Sep-2022 mvs

Move PRU_CONTROL request to (*pru_control)().

The 'proc *' arg is not used for PRU_CONTROL request, so remove it from
pru_control() wrapper.

Split out {tcp,udp}6_usrreqs from {tcp,udp}_usrreqs and use them for
inet6 case.

ok guenther@ bluhm@


# 1.153 31-Aug-2022 mvs

Move PRU_SENDOOB request to (*pru_sendoob)().

PRU_SENDOOB request always consumes passed `top' and `control' mbufs. To
avoid dummy m_freem(9) handlers for all protocols release passed mbufs
in the pru_sendoob() EOPNOTSUPP error path.

Also fix `control' mbuf(9) leak in the tcp(4) PRU_SENDOOB error path.

ok bluhm@


# 1.152 29-Aug-2022 mvs

Move PRU_RCVOOB request to (*pru_rcvoob)().

ok bluhm@


# 1.151 28-Aug-2022 mvs

Move PRU_SENSE request to (*pru_sense)().

ok bluhm@


# 1.150 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.149 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.148 26-Aug-2022 mvs

Move PRU_RCVD request to (*pru_rcvd)().

ok bluhm@


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.171 04-Sep-2023 bluhm

Fix netstat output of uses of current SYN cache left.

TCP syn cache variable scs_use is basically counting packet insertions
into syn cache. Prefer type long to exclude overflow on fast
machines. Due to counting downwards from a limit, it can become
negative. Copy it out as tcps_sc_uses_left via sysctl, and print
it as signed long long integer.

OK mvs@


# 1.170 28-Aug-2023 bluhm

Introduce reference counting for TCP syn cache entries.

The syn_cache_reaper() is a hack to serialize timeouts. Unfortunately
it has a race and panics sometimes with pool_do_get: syncache free
list modified. Add a reference counter for timeout and list of syn
cache entries. Currently list refcout is not strictly necessary
due to exclusive netlock, but will be needed when we continue
unlocking.

Checking timeout_initialized() is not MP friendly, better do proper
initialization during object allocation. Refcount in btrace helps
to find leaks.

bug reported and fix tested by Peter J. Philipp
OK claudio@


# 1.169 06-Jul-2023 bluhm

Convert tcp_now() time counter to 64 bit.

After changing tcp now tick to milliseconds, 32 bits will wrap
around after 49 days of uptime. That may be a problem in some
places of our stack. Better use a 64 bit counter.

As timestamp option is 32 bit in TCP protocol, use the lower 32 bit
there. There are casts to 32 bits that should behave correctly.

Start with random 63 bit offset to avoid uptime leakage. 2^63
milliseconds result in 2.9*10^8 years of possible uptime.

OK yasuoka@


# 1.168 02-Jul-2023 bluhm

Use TSO and LRO on the loopback interface to transfer TCP faster.

If tcplro is activated on lo(4), ignore the MTU with TCP packets.
They are passed along with the information that they have to be
chopped in case they are forwarded later. New netstat(1) counter
shows that software LRO is in effect. The feature is currently
turned off by default.

tested by jan@; OK claudio@ jan@


# 1.167 23-May-2023 jan

New counters for LRO packets from hardware TCP offloading.

With tweaks from patrick@ and bluhm@.

OK bluhm@


# 1.166 18-May-2023 jan

Use TSO offloading in ix(4).

With a lot of tweaks, improvements and testing from bluhm.

Thanks to Hrvoje Popovski from the University of Zagreb for
his great testing effort to make this happen.

ok bluhm


# 1.165 15-May-2023 bluhm

Implement the TCP/IP layer for hardware TCP segmentation offload.
If the driver of a network interface claims to support TSO, do not
chop the packet in software, but pass it down to the interface
layer.
Precalculate parts of the pseudo header checksum, but without the
packet length. The length of all generated smaller packets is not
known yet. Driver and hardware will use the mbuf packet header
field ph_mss to calculate it and update checksum.
Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware
might support ony one protocol family. The old flag IFXF_TSO is
only relevant for large receive offload. It is missnamed, but keep
that for now.
Note that drivers do not set TSO capabilites yet. Also the ifconfig
flags and pseudo interfaces capabilities will be done separately.
So this commit should not change behavior.
heavily based on the work from jan@; OK sashan@


# 1.164 10-May-2023 bluhm

Implement TCP send offloading, for now in software only. This is
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@


Revision tags: OPENBSD_7_3_BASE
# 1.163 14-Mar-2023 yasuoka

To avoid misunderstanding, keep variables for tcp keepalive in
milliseconds, which is the same unit of tcp_now(). However, keep the
unit of sysctl variables in seconds and convert their unit in
tcp_sysctl(). Additionally revert TCPTV_SRTTDFLT back to 3 seconds,
which was mistakenly changed to 1.5 seconds by tcp_timer.h 1.19.

ok claudio


# 1.162 13-Dec-2022 claudio

In tcp_now() switch from getnsecuptime() to getnsecruntime()

The tcp timer is not supposed to run during suspend but getnsecuptime() does
and because of this sessions with TCP_KEEPALIVE on reset after a few hours
of sleep.

Problem noticed by mlarkin@, investigation by yasuoka@ additional testing jca@
OK yasuoka@ jca@ cheloha@


# 1.161 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.160 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.159 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.158 13-Sep-2022 mvs

Change pru_rcvd() return type to the type of void. We have no interest
on pru_rcvd() return value.

Drop "pru_rcvd != NULL" check within pru_rcvd() wrapper. We only call it
if the socket's protocol have PR_WANTRCVD flag set. Such sockets are
route domain, tcp(4) and unix(4) sockets.

ok guenther@ bluhm@


# 1.157 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.156 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.155 03-Sep-2022 mvs

Move PRU_SOCKADDR request to (*pru_sockaddr)()

Introduce in{,6}_sockaddr() functions, and use them for all except tcp(4)
inet sockets. For tcp(4) sockets use tcp_sockaddr() to keep debug ability.

The key management and route domain sockets returns EINVAL error for
PRU_SOCKADDR request, so keep this behaviour for a while instead of make
pru_sockaddr handler optional and return EOPNOTSUPP.

ok bluhm@


# 1.154 02-Sep-2022 mvs

Move PRU_CONTROL request to (*pru_control)().

The 'proc *' arg is not used for PRU_CONTROL request, so remove it from
pru_control() wrapper.

Split out {tcp,udp}6_usrreqs from {tcp,udp}_usrreqs and use them for
inet6 case.

ok guenther@ bluhm@


# 1.153 31-Aug-2022 mvs

Move PRU_SENDOOB request to (*pru_sendoob)().

PRU_SENDOOB request always consumes passed `top' and `control' mbufs. To
avoid dummy m_freem(9) handlers for all protocols release passed mbufs
in the pru_sendoob() EOPNOTSUPP error path.

Also fix `control' mbuf(9) leak in the tcp(4) PRU_SENDOOB error path.

ok bluhm@


# 1.152 29-Aug-2022 mvs

Move PRU_RCVOOB request to (*pru_rcvoob)().

ok bluhm@


# 1.151 28-Aug-2022 mvs

Move PRU_SENSE request to (*pru_sense)().

ok bluhm@


# 1.150 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.149 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.148 26-Aug-2022 mvs

Move PRU_RCVD request to (*pru_rcvd)().

ok bluhm@


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.170 28-Aug-2023 bluhm

Introduce reference counting for TCP syn cache entries.

The syn_cache_reaper() is a hack to serialize timeouts. Unfortunately
it has a race and panics sometimes with pool_do_get: syncache free
list modified. Add a reference counter for timeout and list of syn
cache entries. Currently list refcout is not strictly necessary
due to exclusive netlock, but will be needed when we continue
unlocking.

Checking timeout_initialized() is not MP friendly, better do proper
initialization during object allocation. Refcount in btrace helps
to find leaks.

bug reported and fix tested by Peter J. Philipp
OK claudio@


# 1.169 06-Jul-2023 bluhm

Convert tcp_now() time counter to 64 bit.

After changing tcp now tick to milliseconds, 32 bits will wrap
around after 49 days of uptime. That may be a problem in some
places of our stack. Better use a 64 bit counter.

As timestamp option is 32 bit in TCP protocol, use the lower 32 bit
there. There are casts to 32 bits that should behave correctly.

Start with random 63 bit offset to avoid uptime leakage. 2^63
milliseconds result in 2.9*10^8 years of possible uptime.

OK yasuoka@


# 1.168 02-Jul-2023 bluhm

Use TSO and LRO on the loopback interface to transfer TCP faster.

If tcplro is activated on lo(4), ignore the MTU with TCP packets.
They are passed along with the information that they have to be
chopped in case they are forwarded later. New netstat(1) counter
shows that software LRO is in effect. The feature is currently
turned off by default.

tested by jan@; OK claudio@ jan@


# 1.167 23-May-2023 jan

New counters for LRO packets from hardware TCP offloading.

With tweaks from patrick@ and bluhm@.

OK bluhm@


# 1.166 18-May-2023 jan

Use TSO offloading in ix(4).

With a lot of tweaks, improvements and testing from bluhm.

Thanks to Hrvoje Popovski from the University of Zagreb for
his great testing effort to make this happen.

ok bluhm


# 1.165 15-May-2023 bluhm

Implement the TCP/IP layer for hardware TCP segmentation offload.
If the driver of a network interface claims to support TSO, do not
chop the packet in software, but pass it down to the interface
layer.
Precalculate parts of the pseudo header checksum, but without the
packet length. The length of all generated smaller packets is not
known yet. Driver and hardware will use the mbuf packet header
field ph_mss to calculate it and update checksum.
Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware
might support ony one protocol family. The old flag IFXF_TSO is
only relevant for large receive offload. It is missnamed, but keep
that for now.
Note that drivers do not set TSO capabilites yet. Also the ifconfig
flags and pseudo interfaces capabilities will be done separately.
So this commit should not change behavior.
heavily based on the work from jan@; OK sashan@


# 1.164 10-May-2023 bluhm

Implement TCP send offloading, for now in software only. This is
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@


Revision tags: OPENBSD_7_3_BASE
# 1.163 14-Mar-2023 yasuoka

To avoid misunderstanding, keep variables for tcp keepalive in
milliseconds, which is the same unit of tcp_now(). However, keep the
unit of sysctl variables in seconds and convert their unit in
tcp_sysctl(). Additionally revert TCPTV_SRTTDFLT back to 3 seconds,
which was mistakenly changed to 1.5 seconds by tcp_timer.h 1.19.

ok claudio


# 1.162 13-Dec-2022 claudio

In tcp_now() switch from getnsecuptime() to getnsecruntime()

The tcp timer is not supposed to run during suspend but getnsecuptime() does
and because of this sessions with TCP_KEEPALIVE on reset after a few hours
of sleep.

Problem noticed by mlarkin@, investigation by yasuoka@ additional testing jca@
OK yasuoka@ jca@ cheloha@


# 1.161 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.160 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.159 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.158 13-Sep-2022 mvs

Change pru_rcvd() return type to the type of void. We have no interest
on pru_rcvd() return value.

Drop "pru_rcvd != NULL" check within pru_rcvd() wrapper. We only call it
if the socket's protocol have PR_WANTRCVD flag set. Such sockets are
route domain, tcp(4) and unix(4) sockets.

ok guenther@ bluhm@


# 1.157 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.156 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.155 03-Sep-2022 mvs

Move PRU_SOCKADDR request to (*pru_sockaddr)()

Introduce in{,6}_sockaddr() functions, and use them for all except tcp(4)
inet sockets. For tcp(4) sockets use tcp_sockaddr() to keep debug ability.

The key management and route domain sockets returns EINVAL error for
PRU_SOCKADDR request, so keep this behaviour for a while instead of make
pru_sockaddr handler optional and return EOPNOTSUPP.

ok bluhm@


# 1.154 02-Sep-2022 mvs

Move PRU_CONTROL request to (*pru_control)().

The 'proc *' arg is not used for PRU_CONTROL request, so remove it from
pru_control() wrapper.

Split out {tcp,udp}6_usrreqs from {tcp,udp}_usrreqs and use them for
inet6 case.

ok guenther@ bluhm@


# 1.153 31-Aug-2022 mvs

Move PRU_SENDOOB request to (*pru_sendoob)().

PRU_SENDOOB request always consumes passed `top' and `control' mbufs. To
avoid dummy m_freem(9) handlers for all protocols release passed mbufs
in the pru_sendoob() EOPNOTSUPP error path.

Also fix `control' mbuf(9) leak in the tcp(4) PRU_SENDOOB error path.

ok bluhm@


# 1.152 29-Aug-2022 mvs

Move PRU_RCVOOB request to (*pru_rcvoob)().

ok bluhm@


# 1.151 28-Aug-2022 mvs

Move PRU_SENSE request to (*pru_sense)().

ok bluhm@


# 1.150 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.149 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.148 26-Aug-2022 mvs

Move PRU_RCVD request to (*pru_rcvd)().

ok bluhm@


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.169 06-Jul-2023 bluhm

Convert tcp_now() time counter to 64 bit.

After changing tcp now tick to milliseconds, 32 bits will wrap
around after 49 days of uptime. That may be a problem in some
places of our stack. Better use a 64 bit counter.

As timestamp option is 32 bit in TCP protocol, use the lower 32 bit
there. There are casts to 32 bits that should behave correctly.

Start with random 63 bit offset to avoid uptime leakage. 2^63
milliseconds result in 2.9*10^8 years of possible uptime.

OK yasuoka@


# 1.168 02-Jul-2023 bluhm

Use TSO and LRO on the loopback interface to transfer TCP faster.

If tcplro is activated on lo(4), ignore the MTU with TCP packets.
They are passed along with the information that they have to be
chopped in case they are forwarded later. New netstat(1) counter
shows that software LRO is in effect. The feature is currently
turned off by default.

tested by jan@; OK claudio@ jan@


# 1.167 23-May-2023 jan

New counters for LRO packets from hardware TCP offloading.

With tweaks from patrick@ and bluhm@.

OK bluhm@


# 1.166 18-May-2023 jan

Use TSO offloading in ix(4).

With a lot of tweaks, improvements and testing from bluhm.

Thanks to Hrvoje Popovski from the University of Zagreb for
his great testing effort to make this happen.

ok bluhm


# 1.165 15-May-2023 bluhm

Implement the TCP/IP layer for hardware TCP segmentation offload.
If the driver of a network interface claims to support TSO, do not
chop the packet in software, but pass it down to the interface
layer.
Precalculate parts of the pseudo header checksum, but without the
packet length. The length of all generated smaller packets is not
known yet. Driver and hardware will use the mbuf packet header
field ph_mss to calculate it and update checksum.
Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware
might support ony one protocol family. The old flag IFXF_TSO is
only relevant for large receive offload. It is missnamed, but keep
that for now.
Note that drivers do not set TSO capabilites yet. Also the ifconfig
flags and pseudo interfaces capabilities will be done separately.
So this commit should not change behavior.
heavily based on the work from jan@; OK sashan@


# 1.164 10-May-2023 bluhm

Implement TCP send offloading, for now in software only. This is
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@


Revision tags: OPENBSD_7_3_BASE
# 1.163 14-Mar-2023 yasuoka

To avoid misunderstanding, keep variables for tcp keepalive in
milliseconds, which is the same unit of tcp_now(). However, keep the
unit of sysctl variables in seconds and convert their unit in
tcp_sysctl(). Additionally revert TCPTV_SRTTDFLT back to 3 seconds,
which was mistakenly changed to 1.5 seconds by tcp_timer.h 1.19.

ok claudio


# 1.162 13-Dec-2022 claudio

In tcp_now() switch from getnsecuptime() to getnsecruntime()

The tcp timer is not supposed to run during suspend but getnsecuptime() does
and because of this sessions with TCP_KEEPALIVE on reset after a few hours
of sleep.

Problem noticed by mlarkin@, investigation by yasuoka@ additional testing jca@
OK yasuoka@ jca@ cheloha@


# 1.161 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.160 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.159 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.158 13-Sep-2022 mvs

Change pru_rcvd() return type to the type of void. We have no interest
on pru_rcvd() return value.

Drop "pru_rcvd != NULL" check within pru_rcvd() wrapper. We only call it
if the socket's protocol have PR_WANTRCVD flag set. Such sockets are
route domain, tcp(4) and unix(4) sockets.

ok guenther@ bluhm@


# 1.157 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.156 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.155 03-Sep-2022 mvs

Move PRU_SOCKADDR request to (*pru_sockaddr)()

Introduce in{,6}_sockaddr() functions, and use them for all except tcp(4)
inet sockets. For tcp(4) sockets use tcp_sockaddr() to keep debug ability.

The key management and route domain sockets returns EINVAL error for
PRU_SOCKADDR request, so keep this behaviour for a while instead of make
pru_sockaddr handler optional and return EOPNOTSUPP.

ok bluhm@


# 1.154 02-Sep-2022 mvs

Move PRU_CONTROL request to (*pru_control)().

The 'proc *' arg is not used for PRU_CONTROL request, so remove it from
pru_control() wrapper.

Split out {tcp,udp}6_usrreqs from {tcp,udp}_usrreqs and use them for
inet6 case.

ok guenther@ bluhm@


# 1.153 31-Aug-2022 mvs

Move PRU_SENDOOB request to (*pru_sendoob)().

PRU_SENDOOB request always consumes passed `top' and `control' mbufs. To
avoid dummy m_freem(9) handlers for all protocols release passed mbufs
in the pru_sendoob() EOPNOTSUPP error path.

Also fix `control' mbuf(9) leak in the tcp(4) PRU_SENDOOB error path.

ok bluhm@


# 1.152 29-Aug-2022 mvs

Move PRU_RCVOOB request to (*pru_rcvoob)().

ok bluhm@


# 1.151 28-Aug-2022 mvs

Move PRU_SENSE request to (*pru_sense)().

ok bluhm@


# 1.150 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.149 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.148 26-Aug-2022 mvs

Move PRU_RCVD request to (*pru_rcvd)().

ok bluhm@


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.168 02-Jul-2023 bluhm

Use TSO and LRO on the loopback interface to transfer TCP faster.

If tcplro is activated on lo(4), ignore the MTU with TCP packets.
They are passed along with the information that they have to be
chopped in case they are forwarded later. New netstat(1) counter
shows that software LRO is in effect. The feature is currently
turned off by default.

tested by jan@; OK claudio@ jan@


# 1.167 23-May-2023 jan

New counters for LRO packets from hardware TCP offloading.

With tweaks from patrick@ and bluhm@.

OK bluhm@


# 1.166 18-May-2023 jan

Use TSO offloading in ix(4).

With a lot of tweaks, improvements and testing from bluhm.

Thanks to Hrvoje Popovski from the University of Zagreb for
his great testing effort to make this happen.

ok bluhm


# 1.165 15-May-2023 bluhm

Implement the TCP/IP layer for hardware TCP segmentation offload.
If the driver of a network interface claims to support TSO, do not
chop the packet in software, but pass it down to the interface
layer.
Precalculate parts of the pseudo header checksum, but without the
packet length. The length of all generated smaller packets is not
known yet. Driver and hardware will use the mbuf packet header
field ph_mss to calculate it and update checksum.
Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware
might support ony one protocol family. The old flag IFXF_TSO is
only relevant for large receive offload. It is missnamed, but keep
that for now.
Note that drivers do not set TSO capabilites yet. Also the ifconfig
flags and pseudo interfaces capabilities will be done separately.
So this commit should not change behavior.
heavily based on the work from jan@; OK sashan@


# 1.164 10-May-2023 bluhm

Implement TCP send offloading, for now in software only. This is
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@


Revision tags: OPENBSD_7_3_BASE
# 1.163 14-Mar-2023 yasuoka

To avoid misunderstanding, keep variables for tcp keepalive in
milliseconds, which is the same unit of tcp_now(). However, keep the
unit of sysctl variables in seconds and convert their unit in
tcp_sysctl(). Additionally revert TCPTV_SRTTDFLT back to 3 seconds,
which was mistakenly changed to 1.5 seconds by tcp_timer.h 1.19.

ok claudio


# 1.162 13-Dec-2022 claudio

In tcp_now() switch from getnsecuptime() to getnsecruntime()

The tcp timer is not supposed to run during suspend but getnsecuptime() does
and because of this sessions with TCP_KEEPALIVE on reset after a few hours
of sleep.

Problem noticed by mlarkin@, investigation by yasuoka@ additional testing jca@
OK yasuoka@ jca@ cheloha@


# 1.161 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.160 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.159 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.158 13-Sep-2022 mvs

Change pru_rcvd() return type to the type of void. We have no interest
on pru_rcvd() return value.

Drop "pru_rcvd != NULL" check within pru_rcvd() wrapper. We only call it
if the socket's protocol have PR_WANTRCVD flag set. Such sockets are
route domain, tcp(4) and unix(4) sockets.

ok guenther@ bluhm@


# 1.157 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.156 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.155 03-Sep-2022 mvs

Move PRU_SOCKADDR request to (*pru_sockaddr)()

Introduce in{,6}_sockaddr() functions, and use them for all except tcp(4)
inet sockets. For tcp(4) sockets use tcp_sockaddr() to keep debug ability.

The key management and route domain sockets returns EINVAL error for
PRU_SOCKADDR request, so keep this behaviour for a while instead of make
pru_sockaddr handler optional and return EOPNOTSUPP.

ok bluhm@


# 1.154 02-Sep-2022 mvs

Move PRU_CONTROL request to (*pru_control)().

The 'proc *' arg is not used for PRU_CONTROL request, so remove it from
pru_control() wrapper.

Split out {tcp,udp}6_usrreqs from {tcp,udp}_usrreqs and use them for
inet6 case.

ok guenther@ bluhm@


# 1.153 31-Aug-2022 mvs

Move PRU_SENDOOB request to (*pru_sendoob)().

PRU_SENDOOB request always consumes passed `top' and `control' mbufs. To
avoid dummy m_freem(9) handlers for all protocols release passed mbufs
in the pru_sendoob() EOPNOTSUPP error path.

Also fix `control' mbuf(9) leak in the tcp(4) PRU_SENDOOB error path.

ok bluhm@


# 1.152 29-Aug-2022 mvs

Move PRU_RCVOOB request to (*pru_rcvoob)().

ok bluhm@


# 1.151 28-Aug-2022 mvs

Move PRU_SENSE request to (*pru_sense)().

ok bluhm@


# 1.150 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.149 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.148 26-Aug-2022 mvs

Move PRU_RCVD request to (*pru_rcvd)().

ok bluhm@


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.167 23-May-2023 jan

New counters for LRO packets from hardware TCP offloading.

With tweaks from patrick@ and bluhm@.

OK bluhm@


# 1.166 18-May-2023 jan

Use TSO offloading in ix(4).

With a lot of tweaks, improvements and testing from bluhm.

Thanks to Hrvoje Popovski from the University of Zagreb for
his great testing effort to make this happen.

ok bluhm


# 1.165 15-May-2023 bluhm

Implement the TCP/IP layer for hardware TCP segmentation offload.
If the driver of a network interface claims to support TSO, do not
chop the packet in software, but pass it down to the interface
layer.
Precalculate parts of the pseudo header checksum, but without the
packet length. The length of all generated smaller packets is not
known yet. Driver and hardware will use the mbuf packet header
field ph_mss to calculate it and update checksum.
Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware
might support ony one protocol family. The old flag IFXF_TSO is
only relevant for large receive offload. It is missnamed, but keep
that for now.
Note that drivers do not set TSO capabilites yet. Also the ifconfig
flags and pseudo interfaces capabilities will be done separately.
So this commit should not change behavior.
heavily based on the work from jan@; OK sashan@


# 1.164 10-May-2023 bluhm

Implement TCP send offloading, for now in software only. This is
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@


Revision tags: OPENBSD_7_3_BASE
# 1.163 14-Mar-2023 yasuoka

To avoid misunderstanding, keep variables for tcp keepalive in
milliseconds, which is the same unit of tcp_now(). However, keep the
unit of sysctl variables in seconds and convert their unit in
tcp_sysctl(). Additionally revert TCPTV_SRTTDFLT back to 3 seconds,
which was mistakenly changed to 1.5 seconds by tcp_timer.h 1.19.

ok claudio


# 1.162 13-Dec-2022 claudio

In tcp_now() switch from getnsecuptime() to getnsecruntime()

The tcp timer is not supposed to run during suspend but getnsecuptime() does
and because of this sessions with TCP_KEEPALIVE on reset after a few hours
of sleep.

Problem noticed by mlarkin@, investigation by yasuoka@ additional testing jca@
OK yasuoka@ jca@ cheloha@


# 1.161 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.160 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.159 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.158 13-Sep-2022 mvs

Change pru_rcvd() return type to the type of void. We have no interest
on pru_rcvd() return value.

Drop "pru_rcvd != NULL" check within pru_rcvd() wrapper. We only call it
if the socket's protocol have PR_WANTRCVD flag set. Such sockets are
route domain, tcp(4) and unix(4) sockets.

ok guenther@ bluhm@


# 1.157 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.156 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.155 03-Sep-2022 mvs

Move PRU_SOCKADDR request to (*pru_sockaddr)()

Introduce in{,6}_sockaddr() functions, and use them for all except tcp(4)
inet sockets. For tcp(4) sockets use tcp_sockaddr() to keep debug ability.

The key management and route domain sockets returns EINVAL error for
PRU_SOCKADDR request, so keep this behaviour for a while instead of make
pru_sockaddr handler optional and return EOPNOTSUPP.

ok bluhm@


# 1.154 02-Sep-2022 mvs

Move PRU_CONTROL request to (*pru_control)().

The 'proc *' arg is not used for PRU_CONTROL request, so remove it from
pru_control() wrapper.

Split out {tcp,udp}6_usrreqs from {tcp,udp}_usrreqs and use them for
inet6 case.

ok guenther@ bluhm@


# 1.153 31-Aug-2022 mvs

Move PRU_SENDOOB request to (*pru_sendoob)().

PRU_SENDOOB request always consumes passed `top' and `control' mbufs. To
avoid dummy m_freem(9) handlers for all protocols release passed mbufs
in the pru_sendoob() EOPNOTSUPP error path.

Also fix `control' mbuf(9) leak in the tcp(4) PRU_SENDOOB error path.

ok bluhm@


# 1.152 29-Aug-2022 mvs

Move PRU_RCVOOB request to (*pru_rcvoob)().

ok bluhm@


# 1.151 28-Aug-2022 mvs

Move PRU_SENSE request to (*pru_sense)().

ok bluhm@


# 1.150 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.149 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.148 26-Aug-2022 mvs

Move PRU_RCVD request to (*pru_rcvd)().

ok bluhm@


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.163 14-Mar-2023 yasuoka

To avoid misunderstanding, keep variables for tcp keepalive in
milliseconds, which is the same unit of tcp_now(). However, keep the
unit of sysctl variables in seconds and convert their unit in
tcp_sysctl(). Additionally revert TCPTV_SRTTDFLT back to 3 seconds,
which was mistakenly changed to 1.5 seconds by tcp_timer.h 1.19.

ok claudio


# 1.162 13-Dec-2022 claudio

In tcp_now() switch from getnsecuptime() to getnsecruntime()

The tcp timer is not supposed to run during suspend but getnsecuptime() does
and because of this sessions with TCP_KEEPALIVE on reset after a few hours
of sleep.

Problem noticed by mlarkin@, investigation by yasuoka@ additional testing jca@
OK yasuoka@ jca@ cheloha@


# 1.161 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.160 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.159 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.158 13-Sep-2022 mvs

Change pru_rcvd() return type to the type of void. We have no interest
on pru_rcvd() return value.

Drop "pru_rcvd != NULL" check within pru_rcvd() wrapper. We only call it
if the socket's protocol have PR_WANTRCVD flag set. Such sockets are
route domain, tcp(4) and unix(4) sockets.

ok guenther@ bluhm@


# 1.157 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.156 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.155 03-Sep-2022 mvs

Move PRU_SOCKADDR request to (*pru_sockaddr)()

Introduce in{,6}_sockaddr() functions, and use them for all except tcp(4)
inet sockets. For tcp(4) sockets use tcp_sockaddr() to keep debug ability.

The key management and route domain sockets returns EINVAL error for
PRU_SOCKADDR request, so keep this behaviour for a while instead of make
pru_sockaddr handler optional and return EOPNOTSUPP.

ok bluhm@


# 1.154 02-Sep-2022 mvs

Move PRU_CONTROL request to (*pru_control)().

The 'proc *' arg is not used for PRU_CONTROL request, so remove it from
pru_control() wrapper.

Split out {tcp,udp}6_usrreqs from {tcp,udp}_usrreqs and use them for
inet6 case.

ok guenther@ bluhm@


# 1.153 31-Aug-2022 mvs

Move PRU_SENDOOB request to (*pru_sendoob)().

PRU_SENDOOB request always consumes passed `top' and `control' mbufs. To
avoid dummy m_freem(9) handlers for all protocols release passed mbufs
in the pru_sendoob() EOPNOTSUPP error path.

Also fix `control' mbuf(9) leak in the tcp(4) PRU_SENDOOB error path.

ok bluhm@


# 1.152 29-Aug-2022 mvs

Move PRU_RCVOOB request to (*pru_rcvoob)().

ok bluhm@


# 1.151 28-Aug-2022 mvs

Move PRU_SENSE request to (*pru_sense)().

ok bluhm@


# 1.150 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.149 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.148 26-Aug-2022 mvs

Move PRU_RCVD request to (*pru_rcvd)().

ok bluhm@


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.162 13-Dec-2022 claudio

In tcp_now() switch from getnsecuptime() to getnsecruntime()

The tcp timer is not supposed to run during suspend but getnsecuptime() does
and because of this sessions with TCP_KEEPALIVE on reset after a few hours
of sleep.

Problem noticed by mlarkin@, investigation by yasuoka@ additional testing jca@
OK yasuoka@ jca@ cheloha@


# 1.161 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.160 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.159 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.158 13-Sep-2022 mvs

Change pru_rcvd() return type to the type of void. We have no interest
on pru_rcvd() return value.

Drop "pru_rcvd != NULL" check within pru_rcvd() wrapper. We only call it
if the socket's protocol have PR_WANTRCVD flag set. Such sockets are
route domain, tcp(4) and unix(4) sockets.

ok guenther@ bluhm@


# 1.157 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.156 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.155 03-Sep-2022 mvs

Move PRU_SOCKADDR request to (*pru_sockaddr)()

Introduce in{,6}_sockaddr() functions, and use them for all except tcp(4)
inet sockets. For tcp(4) sockets use tcp_sockaddr() to keep debug ability.

The key management and route domain sockets returns EINVAL error for
PRU_SOCKADDR request, so keep this behaviour for a while instead of make
pru_sockaddr handler optional and return EOPNOTSUPP.

ok bluhm@


# 1.154 02-Sep-2022 mvs

Move PRU_CONTROL request to (*pru_control)().

The 'proc *' arg is not used for PRU_CONTROL request, so remove it from
pru_control() wrapper.

Split out {tcp,udp}6_usrreqs from {tcp,udp}_usrreqs and use them for
inet6 case.

ok guenther@ bluhm@


# 1.153 31-Aug-2022 mvs

Move PRU_SENDOOB request to (*pru_sendoob)().

PRU_SENDOOB request always consumes passed `top' and `control' mbufs. To
avoid dummy m_freem(9) handlers for all protocols release passed mbufs
in the pru_sendoob() EOPNOTSUPP error path.

Also fix `control' mbuf(9) leak in the tcp(4) PRU_SENDOOB error path.

ok bluhm@


# 1.152 29-Aug-2022 mvs

Move PRU_RCVOOB request to (*pru_rcvoob)().

ok bluhm@


# 1.151 28-Aug-2022 mvs

Move PRU_SENSE request to (*pru_sense)().

ok bluhm@


# 1.150 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.149 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.148 26-Aug-2022 mvs

Move PRU_RCVD request to (*pru_rcvd)().

ok bluhm@


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.161 07-Nov-2022 yasuoka

Modify TCP receive buffer size auto scaling to use the smoothed RTT
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.

tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio


# 1.160 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.159 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.158 13-Sep-2022 mvs

Change pru_rcvd() return type to the type of void. We have no interest
on pru_rcvd() return value.

Drop "pru_rcvd != NULL" check within pru_rcvd() wrapper. We only call it
if the socket's protocol have PR_WANTRCVD flag set. Such sockets are
route domain, tcp(4) and unix(4) sockets.

ok guenther@ bluhm@


# 1.157 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.156 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.155 03-Sep-2022 mvs

Move PRU_SOCKADDR request to (*pru_sockaddr)()

Introduce in{,6}_sockaddr() functions, and use them for all except tcp(4)
inet sockets. For tcp(4) sockets use tcp_sockaddr() to keep debug ability.

The key management and route domain sockets returns EINVAL error for
PRU_SOCKADDR request, so keep this behaviour for a while instead of make
pru_sockaddr handler optional and return EOPNOTSUPP.

ok bluhm@


# 1.154 02-Sep-2022 mvs

Move PRU_CONTROL request to (*pru_control)().

The 'proc *' arg is not used for PRU_CONTROL request, so remove it from
pru_control() wrapper.

Split out {tcp,udp}6_usrreqs from {tcp,udp}_usrreqs and use them for
inet6 case.

ok guenther@ bluhm@


# 1.153 31-Aug-2022 mvs

Move PRU_SENDOOB request to (*pru_sendoob)().

PRU_SENDOOB request always consumes passed `top' and `control' mbufs. To
avoid dummy m_freem(9) handlers for all protocols release passed mbufs
in the pru_sendoob() EOPNOTSUPP error path.

Also fix `control' mbuf(9) leak in the tcp(4) PRU_SENDOOB error path.

ok bluhm@


# 1.152 29-Aug-2022 mvs

Move PRU_RCVOOB request to (*pru_rcvoob)().

ok bluhm@


# 1.151 28-Aug-2022 mvs

Move PRU_SENSE request to (*pru_sense)().

ok bluhm@


# 1.150 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.149 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.148 26-Aug-2022 mvs

Move PRU_RCVD request to (*pru_rcvd)().

ok bluhm@


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.160 17-Oct-2022 mvs

Change pru_abort() return type to the type of void and make pru_abort()
optional.

We have no interest on pru_abort() return value. We call it only from
soabort() which is dummy pru_abort() wrapper and has no return value.

Only the connection oriented sockets need to implement (*pru_abort)()
handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing
code for all others, it doesn't called.

ok guenther@


# 1.159 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.158 13-Sep-2022 mvs

Change pru_rcvd() return type to the type of void. We have no interest
on pru_rcvd() return value.

Drop "pru_rcvd != NULL" check within pru_rcvd() wrapper. We only call it
if the socket's protocol have PR_WANTRCVD flag set. Such sockets are
route domain, tcp(4) and unix(4) sockets.

ok guenther@ bluhm@


# 1.157 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.156 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.155 03-Sep-2022 mvs

Move PRU_SOCKADDR request to (*pru_sockaddr)()

Introduce in{,6}_sockaddr() functions, and use them for all except tcp(4)
inet sockets. For tcp(4) sockets use tcp_sockaddr() to keep debug ability.

The key management and route domain sockets returns EINVAL error for
PRU_SOCKADDR request, so keep this behaviour for a while instead of make
pru_sockaddr handler optional and return EOPNOTSUPP.

ok bluhm@


# 1.154 02-Sep-2022 mvs

Move PRU_CONTROL request to (*pru_control)().

The 'proc *' arg is not used for PRU_CONTROL request, so remove it from
pru_control() wrapper.

Split out {tcp,udp}6_usrreqs from {tcp,udp}_usrreqs and use them for
inet6 case.

ok guenther@ bluhm@


# 1.153 31-Aug-2022 mvs

Move PRU_SENDOOB request to (*pru_sendoob)().

PRU_SENDOOB request always consumes passed `top' and `control' mbufs. To
avoid dummy m_freem(9) handlers for all protocols release passed mbufs
in the pru_sendoob() EOPNOTSUPP error path.

Also fix `control' mbuf(9) leak in the tcp(4) PRU_SENDOOB error path.

ok bluhm@


# 1.152 29-Aug-2022 mvs

Move PRU_RCVOOB request to (*pru_rcvoob)().

ok bluhm@


# 1.151 28-Aug-2022 mvs

Move PRU_SENSE request to (*pru_sense)().

ok bluhm@


# 1.150 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.149 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.148 26-Aug-2022 mvs

Move PRU_RCVD request to (*pru_rcvd)().

ok bluhm@


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.159 03-Oct-2022 bluhm

System calls should not fail due to temporary memory shortage in
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@


Revision tags: OPENBSD_7_2_BASE
# 1.158 13-Sep-2022 mvs

Change pru_rcvd() return type to the type of void. We have no interest
on pru_rcvd() return value.

Drop "pru_rcvd != NULL" check within pru_rcvd() wrapper. We only call it
if the socket's protocol have PR_WANTRCVD flag set. Such sockets are
route domain, tcp(4) and unix(4) sockets.

ok guenther@ bluhm@


# 1.157 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.156 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.155 03-Sep-2022 mvs

Move PRU_SOCKADDR request to (*pru_sockaddr)()

Introduce in{,6}_sockaddr() functions, and use them for all except tcp(4)
inet sockets. For tcp(4) sockets use tcp_sockaddr() to keep debug ability.

The key management and route domain sockets returns EINVAL error for
PRU_SOCKADDR request, so keep this behaviour for a while instead of make
pru_sockaddr handler optional and return EOPNOTSUPP.

ok bluhm@


# 1.154 02-Sep-2022 mvs

Move PRU_CONTROL request to (*pru_control)().

The 'proc *' arg is not used for PRU_CONTROL request, so remove it from
pru_control() wrapper.

Split out {tcp,udp}6_usrreqs from {tcp,udp}_usrreqs and use them for
inet6 case.

ok guenther@ bluhm@


# 1.153 31-Aug-2022 mvs

Move PRU_SENDOOB request to (*pru_sendoob)().

PRU_SENDOOB request always consumes passed `top' and `control' mbufs. To
avoid dummy m_freem(9) handlers for all protocols release passed mbufs
in the pru_sendoob() EOPNOTSUPP error path.

Also fix `control' mbuf(9) leak in the tcp(4) PRU_SENDOOB error path.

ok bluhm@


# 1.152 29-Aug-2022 mvs

Move PRU_RCVOOB request to (*pru_rcvoob)().

ok bluhm@


# 1.151 28-Aug-2022 mvs

Move PRU_SENSE request to (*pru_sense)().

ok bluhm@


# 1.150 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.149 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.148 26-Aug-2022 mvs

Move PRU_RCVD request to (*pru_rcvd)().

ok bluhm@


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.158 13-Sep-2022 mvs

Change pru_rcvd() return type to the type of void. We have no interest
on pru_rcvd() return value.

Drop "pru_rcvd != NULL" check within pru_rcvd() wrapper. We only call it
if the socket's protocol have PR_WANTRCVD flag set. Such sockets are
route domain, tcp(4) and unix(4) sockets.

ok guenther@ bluhm@


# 1.157 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.156 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.155 03-Sep-2022 mvs

Move PRU_SOCKADDR request to (*pru_sockaddr)()

Introduce in{,6}_sockaddr() functions, and use them for all except tcp(4)
inet sockets. For tcp(4) sockets use tcp_sockaddr() to keep debug ability.

The key management and route domain sockets returns EINVAL error for
PRU_SOCKADDR request, so keep this behaviour for a while instead of make
pru_sockaddr handler optional and return EOPNOTSUPP.

ok bluhm@


# 1.154 02-Sep-2022 mvs

Move PRU_CONTROL request to (*pru_control)().

The 'proc *' arg is not used for PRU_CONTROL request, so remove it from
pru_control() wrapper.

Split out {tcp,udp}6_usrreqs from {tcp,udp}_usrreqs and use them for
inet6 case.

ok guenther@ bluhm@


# 1.153 31-Aug-2022 mvs

Move PRU_SENDOOB request to (*pru_sendoob)().

PRU_SENDOOB request always consumes passed `top' and `control' mbufs. To
avoid dummy m_freem(9) handlers for all protocols release passed mbufs
in the pru_sendoob() EOPNOTSUPP error path.

Also fix `control' mbuf(9) leak in the tcp(4) PRU_SENDOOB error path.

ok bluhm@


# 1.152 29-Aug-2022 mvs

Move PRU_RCVOOB request to (*pru_rcvoob)().

ok bluhm@


# 1.151 28-Aug-2022 mvs

Move PRU_SENSE request to (*pru_sense)().

ok bluhm@


# 1.150 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.149 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.148 26-Aug-2022 mvs

Move PRU_RCVD request to (*pru_rcvd)().

ok bluhm@


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.157 03-Sep-2022 mvs

Move PRU_PEERADDR request to (*pru_peeraddr)().

Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets,
except tcp(4) case.

Also remove *_usrreq() handlers.

ok bluhm@


# 1.156 03-Sep-2022 bluhm

Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@


# 1.155 03-Sep-2022 mvs

Move PRU_SOCKADDR request to (*pru_sockaddr)()

Introduce in{,6}_sockaddr() functions, and use them for all except tcp(4)
inet sockets. For tcp(4) sockets use tcp_sockaddr() to keep debug ability.

The key management and route domain sockets returns EINVAL error for
PRU_SOCKADDR request, so keep this behaviour for a while instead of make
pru_sockaddr handler optional and return EOPNOTSUPP.

ok bluhm@


# 1.154 02-Sep-2022 mvs

Move PRU_CONTROL request to (*pru_control)().

The 'proc *' arg is not used for PRU_CONTROL request, so remove it from
pru_control() wrapper.

Split out {tcp,udp}6_usrreqs from {tcp,udp}_usrreqs and use them for
inet6 case.

ok guenther@ bluhm@


# 1.153 31-Aug-2022 mvs

Move PRU_SENDOOB request to (*pru_sendoob)().

PRU_SENDOOB request always consumes passed `top' and `control' mbufs. To
avoid dummy m_freem(9) handlers for all protocols release passed mbufs
in the pru_sendoob() EOPNOTSUPP error path.

Also fix `control' mbuf(9) leak in the tcp(4) PRU_SENDOOB error path.

ok bluhm@


# 1.152 29-Aug-2022 mvs

Move PRU_RCVOOB request to (*pru_rcvoob)().

ok bluhm@


# 1.151 28-Aug-2022 mvs

Move PRU_SENSE request to (*pru_sense)().

ok bluhm@


# 1.150 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.149 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.148 26-Aug-2022 mvs

Move PRU_RCVD request to (*pru_rcvd)().

ok bluhm@


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.154 02-Sep-2022 mvs

Move PRU_CONTROL request to (*pru_control)().

The 'proc *' arg is not used for PRU_CONTROL request, so remove it from
pru_control() wrapper.

Split out {tcp,udp}6_usrreqs from {tcp,udp}_usrreqs and use them for
inet6 case.

ok guenther@ bluhm@


# 1.153 31-Aug-2022 mvs

Move PRU_SENDOOB request to (*pru_sendoob)().

PRU_SENDOOB request always consumes passed `top' and `control' mbufs. To
avoid dummy m_freem(9) handlers for all protocols release passed mbufs
in the pru_sendoob() EOPNOTSUPP error path.

Also fix `control' mbuf(9) leak in the tcp(4) PRU_SENDOOB error path.

ok bluhm@


# 1.152 29-Aug-2022 mvs

Move PRU_RCVOOB request to (*pru_rcvoob)().

ok bluhm@


# 1.151 28-Aug-2022 mvs

Move PRU_SENSE request to (*pru_sense)().

ok bluhm@


# 1.150 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.149 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.148 26-Aug-2022 mvs

Move PRU_RCVD request to (*pru_rcvd)().

ok bluhm@


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.153 31-Aug-2022 mvs

Move PRU_SENDOOB request to (*pru_sendoob)().

PRU_SENDOOB request always consumes passed `top' and `control' mbufs. To
avoid dummy m_freem(9) handlers for all protocols release passed mbufs
in the pru_sendoob() EOPNOTSUPP error path.

Also fix `control' mbuf(9) leak in the tcp(4) PRU_SENDOOB error path.

ok bluhm@


# 1.152 29-Aug-2022 mvs

Move PRU_RCVOOB request to (*pru_rcvoob)().

ok bluhm@


# 1.151 28-Aug-2022 mvs

Move PRU_SENSE request to (*pru_sense)().

ok bluhm@


# 1.150 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.149 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.148 26-Aug-2022 mvs

Move PRU_RCVD request to (*pru_rcvd)().

ok bluhm@


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.152 29-Aug-2022 mvs

Move PRU_RCVOOB request to (*pru_rcvoob)().

ok bluhm@


# 1.151 28-Aug-2022 mvs

Move PRU_SENSE request to (*pru_sense)().

ok bluhm@


# 1.150 28-Aug-2022 mvs

Move PRU_ABORT request to (*pru_abort)().

We abort only the sockets which are linked to `so_q' or `so_q0' queues of
listening socket. Such sockets have no corresponding file descriptor and
are not accessed from userland, so PRU_ABORT used to destroy them on
listening socket destruction.

Currently all our sockets support PRU_ABORT request, but actually it
required only for tcp(4) and unix(4) sockets, so i should be optional.
However, they will be removed with separate diff, and this time PRU_ABORT
requests were converted as is.

Also, the socket should be destroyed on PRU_ABORT request, but route and
key management sockets leave it alive. This was also converted as is,
because this wrong code never called.

ok bluhm@


# 1.149 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.148 26-Aug-2022 mvs

Move PRU_RCVD request to (*pru_rcvd)().

ok bluhm@


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.149 27-Aug-2022 mvs

Move PRU_SEND request to (*pru_send)().

The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9)
leak. It was fixed in new gre_send().

The former pfkeyv2_send() was renamed to pfkeyv2_dosend().

ok bluhm@


# 1.148 26-Aug-2022 mvs

Move PRU_RCVD request to (*pru_rcvd)().

ok bluhm@


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.148 26-Aug-2022 mvs

Move PRU_RCVD request to (*pru_rcvd)().

ok bluhm@


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.147 22-Aug-2022 mvs

Move PRU_SHUTDOWN request to (*pru_shutdown)().

ok bluhm@


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.146 22-Aug-2022 mvs

Move PRU_DISCONNECT request to (*pru_disconnect).

ok bluhm@


# 1.145 22-Aug-2022 mvs

Move PRU_ACCEPT request to (*pru_accept)().

ok bluhm@


# 1.144 21-Aug-2022 mvs

Move PRU_CONNECT request to (*pru_connect)() handler.

ok bluhm@


# 1.143 21-Aug-2022 mvs

Move PRU_LISTEN request to (*pru_listen)() handler.

ok bluhm@


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.142 20-Aug-2022 mvs

Move PRU_BIND request to (*pru_bind)() handler.

For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.

ok bluhm@ guenther@


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.141 15-Aug-2022 mvs

Introduce 'pr_usrreqs' structure and move existing user-protocol
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.

Based on reverted diff from guenther@.

ok bluhm@


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.140 11-Aug-2022 claudio

Add TCP_INFO support to getsockopt for tcp sessions.

TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.139 25-Feb-2022 guenther

Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.com
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.138 25-Feb-2022 guenther

Move pr_attach and pr_detach to a new structure pr_usrreqs that can
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.

Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.

ok mvs@ bluhm@


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.137 23-Jan-2022 bluhm

Define all TCP TF_ flags as unsigned numbers. They are stored in
u_int t_flags. Shifting TF_TIMER with TCPT_DELACK can touch the
sign bit.
found by kubsan; suggested by deraadt@; OK miod@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.136 28-Jan-2021 visa

Drop tcp_trace() from SMALL_KERNEL builds to make room on amd64 floppy

OK deraadt@


Revision tags: OPENBSD_6_8_BASE
# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.135 18-Aug-2020 gnezdo

Convert tcp_sysctl to sysctl_bounded_args

This introduces bounds checks for many net.inet.tcp sysctl variables.
Folded some fitting cases into the framework: tcp_do_sack, tcp_do_ecn.

ok derradt@


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.134 12-Jul-2019 bluhm

Count the number of TCP SACK options that were dropped due to the
sack hole list length or pool limit.
OK claudio@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.133 11-Jun-2018 bluhm

The output from tcp debug sockets was incomplete. After detach tp
was NULL and nothing was traced. So save the old tcpcb and use
that to retrieve some information. Note that otb may be freed and
must not be dereferenced. Use a heuristic for cases where the
address family is in the IP header but not provided in the PCB.
OK visa@


# 1.132 08-May-2018 bluhm

Historically there were slow and fast tcp timeouts. That is why
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@


Revision tags: OPENBSD_6_3_BASE
# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.131 07-Feb-2018 bluhm

Historically TCP timeouts were implemented with pr_slowtimo and
pr_fasttimo. That is the reason why we have two timeout mechanisms
with complicated ticks calculation. Move the delay ACK timeout to
milliseconds and remove some ticks and hz mess from the others.
This makes it easier to see the actual values.
OK florian@ dhill@ dlg@


# 1.130 06-Feb-2018 bluhm

There was a race in the TCP timers. As they may sleep to grab the
netlock, timers may still run after they have been disarmed. Deleting
the timeout is not sufficient to cancel them, but the code from 4.4
BSD is assuming this.
The solution is to add a flag for every timer to see whether it has
been armed or canceled. Remove the TF_DEAD check as tcp_canceltimers()
is called before the reaper timer is fired. Cancelation works
reliably now.
OK mpi@


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.129 23-Jan-2018 bluhm

The TCP reaper timeout was still imlemented as soft timeout. So
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.128 02-Nov-2017 florian

Move PRU_DETACH out of pr_usrreq into per proto pr_detach
functions to pave way for more fine grained locking.

Suggested by, comments & OK mpi


# 1.127 25-Oct-2017 job

Remove the TCP_FACK option and associated #if{,n}def code.

TCP_FACK was disabled by provos@ in June 1999.
TCP_FACK is an algorithm that decides that when something is lost, all
not SACKed packets until the most forward SACK are lost. It may be a
correct estimate, if network does not reorder packets.

OK visa@ mpi@ mikeb@


# 1.126 24-Oct-2017 mikeb

Refactor handling of partial TCP acknowledgements

With input from Klemens Nanni, OK visa, mpi, bluhm


# 1.125 22-Oct-2017 mikeb

Unconditionally enable TCP selective acknowledgements (SACK)

OK deraadt, mpi, visa, job


Revision tags: OPENBSD_6_2_BASE
# 1.124 14-Apr-2017 bluhm

Pass down the address family through the pr_input calls. This
allows to simplify code used for both IPv4 and IPv6.
OK mikeb@ deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.123 13-Mar-2017 claudio

Move PRU_ATTACH out of the pr_usrreq functions into pr_attach.
Attach is quite a different thing to the other PRU functions and
this should make locking a bit simpler. This also removes the ugly
hack on how proto was passed to the attach function.
OK bluhm@ and mpi@ on a previous version


# 1.122 09-Feb-2017 jca

percpu counters for TCP stats

ok mpi@ bluhm@


# 1.121 01-Feb-2017 dhill

In sogetopt, preallocate an mbuf to avoid using sleeping mallocs with
the netlock held. This also changes the prototypes of the *ctloutput
functions to take an mbuf instead of an mbuf pointer.

help, guidance from bluhm@ and mpi@
ok bluhm@


# 1.120 29-Jan-2017 bluhm

Change the IPv4 pr_input function to the way IPv6 is implemented,
to get rid of struct ip6protosw and some wrapper functions. It is
more consistent to have less different structures. The divert_input
functions cannot be called anyway, so remove them.
OK visa@ mpi@


# 1.119 26-Jan-2017 bluhm

Reduce the difference between struct protosw and ip6protosw. The
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@


# 1.118 25-Jan-2017 bluhm

Since raw_input() and route_input() are gone from pr_input, we can
make the variable parameters of the protocol input functions fixed.
Also add the proto to make it similar to IPv6.
OK mpi@ guenther@ millert@


# 1.117 16-Nov-2016 mpi

Kill recursive splsoftnet()s.

While here keep local definitions local.

ok bluhm@


# 1.116 04-Oct-2016 mpi

Convert timeouts that need a process context to timeout_set_proc(9).

The current reason is that rtalloc_mpath(9) inside ip_output() might
end up inserting a RTF_CLONED route and that require a write lock.

ok kettenis@, bluhm@


Revision tags: OPENBSD_6_0_BASE
# 1.115 20-Jul-2016 bluhm

To tune the TCP SYN cache we need more information. Print the
relevant counters with netstat -s -p tcp.
OK henning@


# 1.114 20-Jul-2016 bluhm

Make the size for the syn cache hash array tunable. As we are
swapping between two syn caches for random reseeding anyway, this
feature can be added easily. When the cache is empty, there is an
opportunity to change the hash size. This allows an admin under
SYN flood attack to defend his machine.
Suggested by claudio@; OK jung@ claudio@ jmc@


# 1.113 18-Jun-2016 vgross

Add net.inet.{tcp,udp}.rootonly sysctl, to mark which ports
cannot be bound to by non-root users.

Ok millert@ bluhm@


# 1.112 29-Mar-2016 bluhm

Allow to adjust tcp_syn_use_limit with sysctl net.inet.tcp.synuselimit.
This is convenient to test the feature and may be useful to defend
against syn flooding in a denial of service condition. It is
consistent to the existing syn cache sysctls. Move some declarations
to tcp_var.h to access the syn cache sets from tcp_sysctl().
OK mpi@


# 1.111 27-Mar-2016 bluhm

To prevent attacks on the hash buckets of the syn cache, our TCP
stack reseeds the hash function every time the cache is empty.
Unfortunatly the attacker can prevent the reseeding by sending
unanswered SYN packes periodically.
Fix this by having an active syn cache that gets new entries and a
passive one that is idling out. When the passive one is empty and
the active one has been used 100000 times, they switch roles and
the hash function is reseeded with new random.
tedu@ agrees; OK mpi@


# 1.110 21-Mar-2016 bluhm

Add a tcps_sc_seedrandom counter in TCP SYN cache and netstat -s.
This shows how often the hash function is reseeded and the random
bucket distribution changes.
OK mpi@ claudio@


Revision tags: OPENBSD_5_9_BASE
# 1.109 27-Aug-2015 bluhm

The syn cache is completely implemented in tcp_input.c. So all its
global variables should also live there.
OK markus@


# 1.108 24-Aug-2015 bluhm

Rename the syn cache counter into tcp_syn_cache_count to have the
same prefix for all variables. Convert the counter type to int,
the limit is also int. Before searching the cache, check that it
is not empty. Do not access the counter outside of the syn cache
from tcp_ctlinput(), let the syn_cache_lookup() function handle it.
OK dlg@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.107 08-Feb-2015 yasuoka

Count dropped SYN packets on the tcpstat. They are dropped due to the
listen queue (backlog) limit or the memory shortage in syn-cache.

ok henning reyk claudio


# 1.106 21-Jan-2015 deraadt

To satisfy kernel grovellers and bad (but document) sysctl
practice, be pragmatic and #include <sys/timeout.h> for
struct tcpb (glorious namespace violation)
ok kettenis millert sthen


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.105 23-Jan-2014 henning

since the cksum rewrite the counters for hardware checksummed packets
are are lie, since the software engine emulates hardware offloading
and that is later indistinguishable. so kill the hw cksummed counters.
introduce software checksummed packet counters instead.
tcp/udp handles ip & ipvshit, ip cksum covered, 6 has no ip layer cksum.
as before we still have a miscounting bug for inbound with pf on, to be
fixed in the next step.
found by, prodding & ok naddy


# 1.104 23-Oct-2013 deraadt

remove historical #if 1


# 1.103 21-Oct-2013 phessler

Sprinkle a lot more IPv6 routing domains support in the kernel.

Mostly mechanical, setting and passing the rdomain and rtable correctly.
Not yet enabled.

Lots of help and hints from claudio and bluhm

OK claudio@, bluhm@


# 1.102 12-Aug-2013 bluhm

Add the TCP socket option TCP_NOPUSH to delay sending the stream.
This is useful to aggregate data in the kernel from multiple sources
like writes and socket splicing. It avoids sending small packets.
From FreeBSD via David Hill; OK mikeb@ henning@


Revision tags: OPENBSD_5_4_BASE
# 1.101 01-Jun-2013 bluhm

Pass the routing domain to IPv6 pr_ctlinput() like in IPv4.
OK claudio@


# 1.100 10-Apr-2013 mpi

Remove various external variable declaration from sources files and
move them to the corresponding header with an appropriate comment if
necessary.

ok guenther@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.99 06-Jul-2011 sthen

Add sysctl net.inet.tcp.always_keepalive, when this is set the system
behaves as if SO_KEEPALIVE was set on all TCP sockets, forcing keepalives
to be sent every net.inet.tcp.keepidle half-seconds.

In conjunction with a keepidle value greatly reduced from the default,
this can be useful for keeping sessions open if you are stuck on a network
with short NAT or firewall timeouts.

Feedback from various people, ok henning@ claudio@


Revision tags: OPENBSD_4_9_BASE
# 1.98 07-Jan-2011 bluhm

Add socket option SO_SPLICE to splice together two TCP sockets.
The data received on the source socket will automatically be sent
on the drain socket. This allows to write relay daemons with zero
data copy.
ok markus@


# 1.97 21-Oct-2010 bluhm

There is no TCP6 in our kernel, so remove the #ifndef TCP6.
No binary change.
ok claudio@ henning@


# 1.96 24-Sep-2010 claudio

TCP send and recv buffer scaling.
Send buffer is scaled by not accounting unacknowledged on the wire
data against the buffer limit. Receive buffer scaling is done similar
to FreeBSD -- measure the delay * bandwith product and base the
buffer on that. The problem is that our RTT measurment is coarse
so it overshoots on low delay links. This does not matter that much
since the recvbuffer is almost always empty.
Add a back pressure mechanism to control the amount of memory
assigned to socketbuffers that kicks in when 80% of the cluster
pool is used.
Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org.

Based on work by markus@ and djm@.

OK dlg@, henning@, put it in deraadt@


Revision tags: OPENBSD_4_8_BASE
# 1.95 09-Jul-2010 reyk

Add support for using IPsec in multiple rdomains.

This allows to run isakmpd/iked/ipsecctl in multiple rdomains
independently (with "route exec"); the kernel will pickup the rdomain
from the process context of the pfkey socket and load the flows and
SAs into the matching rdomain encap routing table. The network stack
also needs to pass the rdomain to the ipsec stack to lookup the
correct rdomain that belongs to an interface/mbuf/... You can now run
individual IPsec configs per rdomain or create IPsec VPNs between
multiple rdomains on the same machine ;). Note that a primary enc(4)
in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1.

Test by some people, mostly on existing "rdomain 0" setups. Was in
snaps for some days and people didn't complain.

ok claudio@ naddy@


# 1.94 03-Jul-2010 guenther

Fix the naming of interfaces and variables for rdomains and rtables
and make it possible to bind sockets (including listening sockets!)
to rtables and not just rdomains. This changes the name of the
system calls, socket option, and ioctl. After building with this
you should remove the files /usr/share/man/cat2/[gs]etrdomain.0.

Since this removes the existing [gs]etrdomain() system calls, the
libc major is bumped.

Written by claudio@, criticized^Wcritiqued by me


Revision tags: OPENBSD_4_7_BASE
# 1.93 13-Nov-2009 claudio

Extend the protosw pr_ctlinput function to include the rdomain. This is
needed so that the route and inp lookups done in TCP and UDP know where
to look. Additionally in_pcbnotifyall() and tcp_respond() got a rdomain
argument as well for similar reasons. With this tcp seems to be now
fully rdomain save and no longer leaks single packets into the main domain.
Looks good markus@, henning@


# 1.92 10-Aug-2009 claudio

sockets created via a listening socket lose the rdomain and fail to work
therefore. Inherit the rdomain through the syncache.
There are some interactions that need some more work (ctlinput) so this
can be improved but is good enough for now.
OK markus@


Revision tags: OPENBSD_4_6_BASE
# 1.91 05-Jun-2009 claudio

Initial support for routing domains. This allows to bind interfaces to
alternate routing table and separate them from other interfaces in distinct
routing tables. The same network can now be used in any doamin at the same
time without causing conflicts.
This diff is mostly mechanical and adds the necessary rdomain checks accross
net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6.
input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@


Revision tags: OPENBSD_4_5_BASE
# 1.90 08-Nov-2008 dlg

fix macros up so they use the do { } while (/* CONSTCOND */ 0) idiom

ok deraadt@ otto@


Revision tags: OPENBSD_4_4_BASE
# 1.89 24-May-2008 thib

Remove {tcp/udp}6_usrreq(); Since the normal ones now
take a proc argument, theres no need for these, since
they are just wrappers.

OK claudio@


# 1.88 23-May-2008 thib

Deal with the situation when TCP nfs mounts timeout and processes
get hung in nfs_reconnect() because they do not have the proper
privilages to bind to a socket, by adding a struct proc * argument
to sobind() (and the *_usrreq() routines, and finally in{6}_pcbbind)
and do the sobind() with proc0 in nfs_connect.

OK markus@, blambert@.
"go ahead" deraadt@.

Fixes an issue reported by bernd@ (Tested by bernd@).
Fixes PR5135 too.


# 1.87 06-May-2008 markus

remove tcp_drain code since it's not longer used; ok henning, feedback thib


Revision tags: OPENBSD_4_3_BASE
# 1.86 20-Feb-2008 markus

remove old unused TCP isn code; ok henning, dhartmei, mcbride


# 1.85 20-Feb-2008 markus

when creating a response, use the correct TCP header instead of
relying on the mbuf chain layout; with claudio@ and krw@; ok henning@


# 1.84 13-Dec-2007 reyk

implement sysctls to report IP, TCP, UDP, and ICMP statistics and
change netstat to use them instead of accessing kvm for it. more
protocols will be added later.

discussed with deraadt@ claudio@ gilles@
ok deraadt@


Revision tags: OPENBSD_4_2_BASE
# 1.83 25-Jun-2007 markus

branches: 1.83.2;
merge tcp_set_iss() and tcp_set_tsm(); ok mcbride, djm (on earlier version)


# 1.82 15-Jun-2007 markus

Drop the current random timestamps and the current ISN generation
code and replace both with a RFC1948 based method, so TCP clients
now have monotonic ISN/timestamps. The server side uses completely
random ISN/timestamps and does time-wait recycling (on port reuse).
ok djm@, mcbride@; thanks to lots of testers


Revision tags: OPENBSD_4_1_BASE
# 1.81 01-Feb-2007 jmc

branches: 1.81.2;
correct rfc; from Kris Katterjohn


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.80 11-Dec-2005 deraadt

bitfields must be off an int or such type


# 1.79 20-Nov-2005 brad

splimp -> splvm. mbuf allocation here.

ok henning@


# 1.78 15-Nov-2005 miod

Only two `h' in threshold.


Revision tags: OPENBSD_3_8_BASE
# 1.77 02-Aug-2005 markus

change the TCP reass queue from LIST to TAILQ;
ok henning claudio fgsch krw


# 1.76 04-Jul-2005 markus

remove TUBA, ok many


# 1.75 30-Jun-2005 markus

implement PMTU checks from
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html
i.e. don't act on ICMP-need-frag immediately if adhoc checks on the
advertised mtu fail. the mtu update is delayed until a tcp retransmit
happens. initial patch by Fernando Gont, tested by many.


# 1.74 24-May-2005 fgont

Ignore ICMP Source Quench messages meant for TCP connections. (Details in
http://www.gont.com.ar/drafts/icmp-attacks-against-tcp.html)
ok markus frantzen


# 1.73 05-Apr-2005 markus

add tcp sack stats, similar to freebsd; ok deraadt


Revision tags: OPENBSD_3_7_BASE
# 1.72 09-Mar-2005 markus

from freebsd:
1. set rcv_laststart/rcv_lastend after checking the tcp window
2. pass rcv_laststart and rcv_lastend on the stack (shrink tcp state)
ok henning, djm


# 1.71 04-Mar-2005 markus

- check th_ack against snd_una/max; from Raja Mukerji via hugh@
- limit pool to tcp_sackhole_limit entries (sysctl-able)
- stop sack option processing on pool_get errors
- use SEQ_MIN/SEQ_MAX
ok henning, hshoexer, deraadt


# 1.70 27-Feb-2005 markus

1. tcp_xmit_timer(): remove extra rtt decrement (t_rtttime is 0-based
while t_rtt was 1-based), update callers
2. define and use TCP_RTT_BASE_SHIFT instead of the hardcoded 2.
3. add missing shifts when t_srtt/t_rttvar are used.
4. update the comments: t_srtt uses 5 bits of fraction (not 3)
and t_rttvar uses 4 bits
5. remove obsolete/unused macros TCP_RTT_SCALE and TCP_RTTVAR_SCALE
6. make sure rttmin is not > TCPTV_REXMTMAX
parts from netbsd, ok mcbride, henning


# 1.69 10-Jan-2005 mcbride

Make sure bogus values don't make their way into tcp_xmit_timer() calculations.
- Ignore ts_ecr if it is 0, or the resulting rtt is out of range.
(use tp->t_rtttime instead)
- Initialise tcp_now to 1, to avoid the 500ms window where a valid ts_ecr
of 0 could be ignored.
- Convert out-of-range rtt values to valid ones in tcp_xmit_timer().

ok frantzen@ markus@


# 1.68 25-Nov-2004 markus

fix for race between invocation for timer and network input
1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
with mickey@; ok deraadt@


# 1.67 28-Oct-2004 mcbride

Modulate tcp_now by a random amount on a per-connection basis.

ok markus@ frantzen@


# 1.66 16-Sep-2004 markus

don't send partial segments if SS_ISSENDING is set, remember
TF_LASTIDLE across invocations of tcp_output (from freebsd);
ok mcbride


Revision tags: OPENBSD_3_6_BASE
# 1.65 15-Jul-2004 markus

branches: 1.65.2;
tcp_trace() expects short, not int; ok deraadt


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.64 08-Jun-2004 markus

factor out md5 code; ok+tests henning@, djm@, hshoexer@


# 1.63 25-Apr-2004 markus

add TCPCTL_DROP; ok deraadt, cedric, grange, ...


# 1.62 20-Apr-2004 markus

add tcps_rcvacktooold; ok deraadt


Revision tags: OPENBSD_3_5_BASE
# 1.61 02-Mar-2004 markus

branches: 1.61.2;
limit total number of queued out-of-order packets to NMBCLUSTERS/2; ok mcbride


# 1.60 27-Feb-2004 markus

implement tcp_drain() similar to ip_drain(); ok mcbride@


# 1.59 27-Feb-2004 markus

API change; counter for upcoming tcp_drain(); ok deraadt


# 1.58 15-Feb-2004 markus

switch to sysctl_int_arr(); ok itojun, henning, miod, deraadt


# 1.57 31-Jan-2004 markus

!sack_disable -> sack_enable; ok deraadt@


# 1.56 29-Jan-2004 markus

support for RFC3390 (Increasing TCP's Initial Window); ok deraadt, itojun


# 1.55 14-Jan-2004 markus

syncache+ipv6 support for TCP_SIGNATURE; with itojun; ok deraadt


# 1.54 13-Jan-2004 markus

bring back the old TCP_SIGNATURE code from tcp_input.c rev 1.45
and make it compile (does not work yet); ok deraadt@


# 1.53 07-Jan-2004 markus

syn_XXX_limit -> synXXXlimit for consistency; ok deraadt


# 1.52 06-Jan-2004 markus

import netbsd's version of David Borman's syncache code
http://www.kohala.com/start/borman.97jun06.txt; ok deraadt@, henning@


Revision tags: OPENBSD_3_4_BASE
# 1.51 09-Jun-2003 itojun

branches: 1.51.2;
backout following:
>use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().

PR 3283 fixed (confirmed)


# 1.50 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


# 1.49 29-May-2003 itojun

use m_pulldown not m_pullup2. fix some bugs in IPv6 tcp_trace().


# 1.48 26-May-2003 itojun

fix tcpcb size to make trpt happy


# 1.47 23-May-2003 itojun

don't #ifdef within struct tcpcb definition, as it is used in userland too.
dhartmei ok


Revision tags: UBC_SYNC_A
# 1.46 12-May-2003 jason

Nuke a whole bunch of commons; ok tedu (still more to come *sigh*)


Revision tags: OPENBSD_3_3_BASE
# 1.45 12-Feb-2003 jason

branches: 1.45.2;
Remove commons; inspired by netbsd.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.44 09-Jun-2002 itojun

whitespace


# 1.43 16-May-2002 kjc

bring in ECN support from KAME.
it consists of
- ECN support in TCP
- tunnel-egress and fragment reassembly rules in layer-3 not to lose
congestion info at tunnel-egress and fragment reassembly

to enable ECN in TCP, build a kernel with TCP_ECN, and then,
turn it on by "sysctl -w net.inet.tcp.ecn=1".

ok deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 provos

use timeout(9) to schedule TCP timers. this avoid traversing all
tcp connections during tcp_slowtimo. apdapted from thorpej@netbsd.org


# 1.40 02-Mar-2002 provos

disable immediate ack on TH_PUSH. make behaviour sysctl tuneable.
from netbsd; also fix a bug where setting TF_ACKNOW didn't actually
result in an ack.


# 1.39 01-Mar-2002 provos

remove tcp_fasttimo and convert delayed acks to the timeout(9) API instead.
adapated from netbsd. okay angelos@


# 1.38 15-Jan-2002 provos

allocate sackholes with pool


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.37 23-Jun-2001 angelos

branches: 1.37.4;
Keep stats on TCP/UDP hardware checksumming.


# 1.36 09-Jun-2001 angelos

Inclusion protection.


Revision tags: OPENBSD_2_9_BASE
# 1.35 13-Dec-2000 provos

more random tcp sequence numbers. okay deraadt@, angelos@


# 1.34 11-Dec-2000 itojun

nuke #ifdef TCP6 (no longer supported).
validate ICMPv6 too big messages (pmtud) based on pcb. we accept
certain amount of non-validated ones, as IPv6 mandates ICMPv6 (so even for
traffic from unconnected pcb, we need pmtud).
sync with kame


Revision tags: OPENBSD_2_8_BASE
# 1.33 14-Oct-2000 itojun

implement net.inet.tcp.rstppslimit. rate-limits outbound TCP RST traffic
to less than N per 1 second.


# 1.32 25-Sep-2000 provos

on expiry of pmtu route, retry higher mtu. okay angelos@


# 1.31 20-Sep-2000 provos

correctly calculate mss


# 1.30 18-Sep-2000 provos

Path MTU discovery based on NetBSD but with the decision to use the DF
flag delayed to ip_output(). That halves the code and reduces most of
the route lookups. okay deraadt@


# 1.29 11-Jul-2000 provos

compute correct window scale when recvpipe option is set in route; based
on diff from "Pete Kazmier" <pete@kazmier.com>


# 1.28 26-Jun-2000 art

Make the definition of tcpstat in tcp_var.h extern.


# 1.27 18-Jun-2000 beck

support ipv6 for tcp_ident


Revision tags: OPENBSD_2_7_BASE SMP_BASE
# 1.26 21-Dec-1999 provos

branches: 1.26.2;
option TCP_NEWRENO goes away, its the default case for TCP_SACK if
SACK is disabled for the connection or via sysctl


Revision tags: kame_19991208
# 1.25 08-Dec-1999 itojun

bring in KAME IPv6 code, dated 19991208.
replaces NRL IPv6 layer. reuses NRL pcb layer. no IPsec-on-v6 support.
see sys/netinet6/{TODO,IMPLEMENTATION} for more details.

GENERIC configuration should work fine as before. GENERIC.v6 works fine
as well, but you'll need KAME userland tools to play with IPv6 (will be
bringed into soon).


Revision tags: OPENBSD_2_6_BASE
# 1.24 06-Aug-1999 deraadt

back out all recent changes, which continue to be a source for nasty bugs


# 1.23 22-Jul-1999 niklas

Revert to 1.21


# 1.22 17-Jul-1999 provos

revert tcp_input.c to before 07/01/1999 - this seems to solve the mysterious
data corruptions and panics that people have experienced. by reverting
we loose tcp signatures and ipv6 cleanups, the code looked correct to me.


# 1.21 06-Jul-1999 cmetz

Added support for TCP MD5 option (RFC 2385).


# 1.20 02-Jul-1999 cmetz

Fixed a #ifdef defined()... typo that turned into a compilation failure.


Revision tags: OPENBSD_2_5_BASE
# 1.19 27-Mar-1999 provos

add SADB_X_BINDSA to pfkey allowing incoming SAs to refer to an outgoing
SA to be used, use this SA in ip_output if available. allow mobile road
warriors for bind SAs with wildcard dst and src addresses. check IPSEC
AUTH and ESP level when receiving packets, drop them if protection is
insufficient. add stats to show dropped packets because of insufficient
IPSEC protection. -- phew. this was all done in canada. dugsong and linh
provided the ride and company.


# 1.18 04-Feb-1999 deraadt

indent


# 1.17 04-Feb-1999 deraadt

use u_int32_t and u_int64_t for stats variables, instead of quad/long


# 1.16 11-Jan-1999 niklas

Make TCP_SACK compile with new netinet


# 1.15 11-Jan-1999 deraadt

netinet merge of NRL stuff. some indent and shrinkage needed; NRL/cmetz


# 1.14 18-Nov-1998 deraadt

indent right


# 1.13 17-Nov-1998 provos

NewReno, SACK and FACK support for TCP, adapted from code for BSDI
by Hari Balakrishnan (hari@lcs.mit.edu), Tom Henderson (tomh@cs.berkeley.edu)
and Venkat Padmanabhan (padmanab@cs.berkeley.edu) as part of the
Daedalus research group at the University of California,
(http://daedalus.cs.berkeley.edu). [I was able to do this on time spent
at the Center for Information Technology Integration (citi.umich.edu)]


# 1.12 28-Oct-1998 provos

- fix three bugs pointed out in Stevens, i.a. updating timestamps correctly
- fix a 4.4bsd-lite2 bug, when tcp options are present the maximum segment
size is not updated correctly, so that fast recovery forces out a segment
which is split in two segments by tcp_output(), the fix is adpated from
FreeBSD, the effective mss is recorded after option negotiation in 3way
handshake.
[I was able to fix this on time spent at Center for Information Technology
Integration (citi.umich.edu)]


Revision tags: OPENBSD_2_4_BASE
# 1.11 10-Jun-1998 beck

New TCPCTL_IDENT sysctl for identd without kmem insanity.


Revision tags: OPENBSD_2_3_BASE
# 1.10 18-Mar-1998 angelos

Add FreeBSD patch (check for SYN packets arriving at a socket in
LISTEN state with source address/port == destination address/port).


# 1.9 24-Jan-1998 mickey

sysctl for def sizes for tcp/udp send/recv queues


Revision tags: OPENBSD_2_2_BASE
# 1.8 09-Aug-1997 millert

The list of tcp/udp ports not to allocate dynamically is now
a bitmask configurable via sysctl([38]). The default values
have not changed. If one wants to change the list it should
be done early on in /etc/rc.


# 1.7 15-Jun-1997 deraadt

change byte counters to u_quad_t


# 1.6 06-Jun-1997 deraadt

add net.inet.tcp.{keepidle,keepintvl,slowhz}; mouse@Rodents.Montreal.QC.CA


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE
# 1.5 20-Sep-1996 deraadt

`solve' the syn bomb problem as well as currently known; add sysctl's for
SOMAXCONN (kern.somaxconn), SOMINCONN (kern.sominconn), and TCPTV_KEEP_INIT
(net.inet.tcp.keepinittime). when this is not enough (ie. overfull), start
doing tail drop, but slightly prefer the same port.


# 1.4 12-Sep-1996 tholo

TCP Persist handling; from 4.4BSD Lite2 (via NetBSD PR 2335)


# 1.3 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.2 14-Dec-1995 deraadt

from netbsd:
make netinet work on systems where pointers and longs are 64 bits
(like the alpha). Biggest problem: IP headers were overlayed with
structure which included pointers, and which therefore didn't overlay
properly on 64-bit machines. Solution: instead of threading pointers
through IP header overlays, add a "queue element" structure to do
the threading, and point it at the ip headers.


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision