History log of /freebsd-current/sys/netinet6/ip6_output.c
Revision Date Author Comments
# 530c2c30 20-Mar-2024 Andrew Gallatin <gallatin@FreeBSD.org>

ip6_output: Reduce cache misses on pktopts

When profiling an IP6 heavy workload, I noticed that we were
getting a lot of cache misses in ip6_output() around
ip6_pktopts. This was happening because the TCP stack passes
inp->in6p_outputopts even if all options are unused. So in the
common case of no options present, pkt_opts is not null, and is
checked repeatedly for different options. Since ip6_pktopts is
large (4 cachelines), and every field is checked, we take 4
cache misses (2 of which tend to be hidden by the adjacent line
prefetcher).

To fix this common case, I introduced a new flag in ip6_pktopts
(ip6po_valid) which tracks which options have been set. In the
common case where nothing is set, this causes just a single
cache miss to load. It also eliminates a test for some options
(if (opt != NULL && opt->val >= const) vs if ((optvalid & flag) !=0 )

To keep the struct the same size in 64-bit kernels, and to keep
the integer values (like ip6po_hlim, ip6po_tclass, etc) on the
same cacheline, I moved them to the top.

As suggested by zlei, the null check in MAKE_EXTHDR() becomes
redundant, and can be removed.

For our web server workload (with the ip6po_tclass option set),
this drops the CPI from 2.9 to 2.4 for ip6_output

Differential Revision: https://reviews.freebsd.org/D44204
Reviewed by: bz, glebius, zlei
No Objection from: melifaro
Sponsored by: Netflix Inc.


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# e3ba0d6a 26-Jul-2023 Gleb Smirnoff <glebius@FreeBSD.org>

inpcb: do not copy so_options into inp_flags2

Since f71cb9f74808 socket stays connnected with inpcb through latter's
lifetime and there is no reason to complicate things and copy these
flags.

Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D41198


# bc310a95 20-Jul-2023 Konstantin Belousov <kib@FreeBSD.org>

ip output: ensure that mbufs are mapped if ipsec is enabled

Ipsec needs access to packet headers to determine if a policy is
applicable. It seems that typically IP headers are mapped, but the code
is arguably needs to check this before blindly accessing them. Then,
operations like m_unshare() and m_makespace() are not yet ready for
unmapped mbufs.

Ensure that the packet is mapped before calling into IPSEC_OUTPUT().

PR: 272616
Reviewed by: jhb, markj
Sponsored by: NVidia networking
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D41112


# 317fa516 28-Feb-2023 Mark Johnston <markj@FreeBSD.org>

netinet: Remove the IP(V6)_RSS_LISTEN_BUCKET socket option

It has no effect, and an exp-run revealed that it is not in use.

PR: 261398 (exp-run)
Reviewed by: mjg, glebius
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D38822


# 3aff4ccd 27-Feb-2023 Mark Johnston <markj@FreeBSD.org>

netinet: Remove IP(V6)_BINDMULTI

This option was added in commit 0a100a6f1ee5 but was never completed.
In particular, there is no logic to map flowids to different listening
sockets, so it accomplishes basically the same thing as SO_REUSEPORT.
Meanwhile, we've since added SO_REUSEPORT_LB, which at least tries to
balance among listening sockets using a hash of the 4-tuple and some
optional NUMA policy.

The option was never documented or completed, and an exp-run revealed
nothing using it in the ports tree. Moreover, it complicates the
already very complicated in_pcbbind_setup(), and the checking in
in_pcbbind_check_bindmulti() is insufficient. So, let's remove it.

PR: 261398 (exp-run)
Reviewed by: glebius
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D38574


# 3d0d5b21 23-Jan-2023 Justin Hibbits <jhibbits@FreeBSD.org>

IfAPI: Explicitly include <net/if_private.h> in netstack

Summary:
In preparation of making if_t completely opaque outside of the netstack,
explicitly include the header. <net/if_var.h> will stop including the
header in the future.

Sponsored by: Juniper Networks, Inc.
Reviewed by: glebius, melifaro
Differential Revision: https://reviews.freebsd.org/D38200


# 21cc0918 16-Aug-2021 Elliott Mitchell <ehem+freebsd@m5p.com>

sys: Nuke double-semicolons

A distinct number of double-semicolons have ended up in FreeBSD. Take a
pass at getting rid of many of these harmless typos.

Reviewed by: emaste, rrs
Pull Request: https://github.com/freebsd/freebsd-src/pull/609
Differential Revision: https://reviews.freebsd.org/D31716


# 2e0e2739 13-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

netinet6: trim overly long lines in GET_PKTOPT_VAR(), fit into 80 chars


# 53af6903 06-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

tcp: remove INP_TIMEWAIT flag

Mechanically cleanup INP_TIMEWAIT from the kernel sources. After
0d7445193ab, this commit shall not cause any functional changes.

Note: this flag was very often checked together with INP_DROPPED.
If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we
will be able to remove most of this checks and turn them to
assertions. Some of them can be turned into assertions right now,
but that should be carefully done on a case by case basis.

Differential revision: https://reviews.freebsd.org/D36400


# 46ddeb6b 03-Oct-2022 Gleb Smirnoff <glebius@FreeBSD.org>

netinet6: retire ip6protosw.h

The netinet/ipprotosw.h and netinet6/ip6protosw.h were KAME relics, with
the former removed in f0ffb944d25 in 2001 and the latter survived until
today. It has been reduced down to only one useful declaration that
moves to ip6_var.h

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36726


# dda6376b 08-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

net: employ newly added pfil_mbuf_{in,out} where approriate

Reviewed by: glebius
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D36454


# 74ed2e8a 02-Sep-2022 Gleb Smirnoff <glebius@FreeBSD.org>

raw ip: fix regression with multicast and RSVP

With 61f7427f02a raw sockets protosw has wildcard pr_protocol. Protocol
of a specific pcb is stored in inp_ip_p.

Reviewed by: karels
Reported by: karels
Differential revision: https://reviews.freebsd.org/D36429
Fixes: 61f7427f02a307d28af674a12c45dd546e3898e4


# 50fa27e7 09-Jul-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

netinet6: fix interface handling for loopback traffic

Currently, processing of IPv6 local traffic is partially broken:
link-local connection fails and global unicast connect() takes
3 seconds to complete.
This happens due to the combination of multiple factors.
IPv6 code passes original interface "origifp" when passing
traffic via loopack to retain the scope that is mandatory for the
correct hadling of link-local traffic. First problem is that the logic
of passing source interface is not working correcly for TCP connections,
resulting in passing "origifp" on the first 2 connection attempts and
lo0 on the subsequent ones. Second problem is that source address
validation logic skips its checks iff the source interface is loopback,
which doesn't cover "origifp" case.
More detailed description is available at https://reviews.freebsd.org/D35732

Fix the first problem by untangling&simplifying ifp/origifp logic.
Fix the second problem by switching source address validation check to
using M_LOOP mbuf flag instead of interface type.

PR: 265089
Reviewed by: ae, bz(previous version)
Differential Revision: https://reviews.freebsd.org/D35732
MFC after: 2 weeks


# 7d98cc09 01-Apr-2022 Andrey V. Elsukov <ae@FreeBSD.org>

Fix ipfw fwd that doesn't work in some cases

For IPv4 use dst pointer as destination address in fib4_lookup().
It keeps destination address from IPv4 header and can be changed
when PACKET_TAG_IPFORWARD tag was set by packet filter.

For IPv6 override destination address with address from dst_sa.sin6_addr,
that was set from PACKET_TAG_IPFORWARD tag.

Reviewed by: eugen
MFC after: 1 week
PR: 256828, 261697, 255705
Differential Revision: https://reviews.freebsd.org/D34732


# 9ba11796 27-Jan-2022 Andrew Gallatin <gallatin@FreeBSD.org>

Fix a memory leak when ip_output_send() returns EAGAIN due to send tag issues

When ip_output_send() returns EAGAIN due to issues with send tags (route
change, lagg failover, etc), it must free the mbuf. This is because
ip_output_send() was written as a wrapper/replacement for a direct
call to if_output(), and the contract with if_output() has
historically been that it owns the mbufs once called. When
ip_output_send() failed to free mbufs, it violated this assumption
and lead to leaked mbufs.

This was noticed when using NIC TLS in combination with hardware
rate-limited connections. When seeing lots of NIC output drops
triggered ratelimit send tag changes, we noticed we were leaking
ktls_sessions, send tags and mbufs. This was due ip_output_send()
leaking mbufs which held references to ktls_sessions, which in
turn held references to send tags.

Many thanks to jbh, rrs, hselasky and markj for their help in
debugging this.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34054
Reviewed by: hselasky, jhb, rrs
MFC after: 2 weeks


# 9f5432d5 15-Dec-2021 Kristof Provost <kp@FreeBSD.org>

netinet6: ip6_setpktopt() requires NET_EPOCH

ip6_setpktopt() can call ifnet_byindex() which requires epoch. Mark the
function as requiring NET_EPOCH, and ensure we enter it priot to calling
it.

Reported-by: syzbot+92526116441688fea8a3@syzkaller.appspotmail.com
Reviewed by: glebius
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D33462


# d74b7bae 04-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

ifnet_byindex() actually requires network epoch

Sweep over potentially unsafe calls to ifnet_byindex() and wrap them
in epoch. Most of the code touched remains unsafe, as the returned
pointer is being used after epoch exit. Mark that with a comment.

Validate the index argument inside the function, reducing argument
validation requirement from the callers and making V_if_index
private to if.c.

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D33263


# 44775b16 24-Nov-2021 Mark Johnston <markj@FreeBSD.org>

netinet: Remove unneeded mb_unmapped_to_ext() calls

in_cksum_skip() now handles unmapped mbufs on platforms where they're
permitted.

Reviewed by: glebius, jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33097


# a8d54fc9 09-Aug-2021 Michael Tuexen <tuexen@FreeBSD.org>

ipv6: Fix getsockopt() for some IPPROTO_IPV6 level socket options

Fix getsockopt() for the IPPROTO_IPV6 level socket options with the
following names: IPV6_HOPOPTS, IPV6_RTHDR, IPV6_RTHDRDSTOPTS,
IPV6_DSTOPTS, and IPV6_NEXTHOP.

Reviewed by: markj
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D31458


# 2290dfb4 19-May-2021 Ryan Stone <rstone@FreeBSD.org>

Enter the net epoch before calling ip6_setpktopts

ip6_setpktopts() can look up ifnets via ifnet_by_index(), which
is only safe in the net epoch. Ensure that callers are in the net
epoch before calling this function.

Sponsored by: Dell EMC Isilon
MFC after: 4 weeks
Reviewed by: donner, kp
Differential Revision: https://reviews.freebsd.org/D30630


# bb4a7d94 04-Mar-2021 Kristof Provost <kp@FreeBSD.org>

net: Introduce IPV6_DSCP(), IPV6_ECN() and IPV6_TRAFFIC_CLASS() macros

Introduce convenience macros to retrieve the DSCP, ECN or traffic class
bits from an IPv6 header.

Use them where appropriate.

Reviewed by: ae (previous version), rscheff, tuexen, rgrimes
MFC after: 2 weeks
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D29056


# 1bd44b11 14-Feb-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Do not reference returned ifa in in6_ifawithifp().

The only place where in6_ifawithifp() is used is ip6_output(),
which uses the returned ifa to bump traffic counters.
Given ifa stability guarantees is provided by epoch, do not refcount ifa.

This eliminates 2 atomic ops from IPv6 fast path.

Reviewed By: rstone
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28649


# 3f43ada9 28-Jan-2021 Gleb Smirnoff <glebius@FreeBSD.org>

Catch up with 6edfd179c86: mechanically rename IFCAP_NOMAP to IFCAP_MEXTPG.

Originally IFCAP_NOMAP meant that the mbuf has external storage pointer
that points to unmapped address. Then, this was extended to array of
such pointers. Then, such mbufs were augmented with header/trailer.
Basically, extended mbufs are extended, and set of features is subject
to change. The new name should be generic enough to avoid further
renaming.


# 0c325f53 18-Oct-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Implement flowid calculation for outbound connections to balance
connections over multiple paths.

Multipath routing relies on mbuf flowid data for both transit
and outbound traffic. Current code fills mbuf flowid from inp_flowid
for connection-oriented sockets. However, inp_flowid is currently
not calculated for outbound connections.

This change creates simple hashing functions and starts calculating hashes
for TCP,UDP/UDP-Lite and raw IP if multipath routes are present in the
system.

Reviewed by: glebius (previous version),ae
Differential Revision: https://reviews.freebsd.org/D26523


# 868aabb4 08-Oct-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Add IP(V6)_VLAN_PCP to set 802.1 priority per-flow.

This adds a new IP_PROTO / IPV6_PROTO setsockopt (getsockopt)
option IP(V6)_VLAN_PCP, which can be set to -1 (interface
default), or explicitly to any priority between 0 and 7.

Note that for untagged traffic, explicitly adding a
priority will insert a special 801.1Q vlan header with
vlan ID = 0 to carry the priority setting

Reviewed by: gallatin, rrs
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D26409


# b092fd6c 17-Sep-2020 Navdeep Parhar <np@FreeBSD.org>

if_vxlan(4): add support for hardware assisted checksumming, TSO, and RSS.

This lets a VXLAN pseudo-interface take advantage of hardware checksumming (tx
and rx), TSO, and RSS if the NIC is capable of performing these operations on
inner VXLAN traffic.

A VXLAN interface inherits the capabilities of its vxlandev interface if one is
specified or of the interface that hosts the vxlanlocal address. If other
interfaces will carry traffic for that VXLAN then they must have the same
hardware capabilities.

On transmit, if_vxlan verifies that the outbound interface has the required
capabilities and then translates the CSUM_ flags to their inner equivalents.
This tells the hardware ifnet that it needs to operate on the inner frame and
not the outer VXLAN headers.

An event is generated when a VXLAN ifnet starts. This allows hardware drivers to
configure their devices to expect VXLAN traffic on the specified incoming port.

On receive, the hardware does RSS and checksum verification on the inner frame.
if_vxlan now does a direct netisr dispatch to take full advantage of RSS. It is
not very clear why it didn't do this already.

Future work:
Rx: it should be possible to avoid the first trip up the protocol stack to get
the frame to if_vxlan just so it can decapsulate and requeue for a second trip
up the stack. The hardware NIC driver could directly call an if_vxlan receive
routine for VXLAN traffic instead.

Rx: LRO. depends on what happens with the previous item. There will have to to
be a mechanism to indicate that it's time for if_vxlan to flush its LRO state.

Reviewed by: kib@
Relnotes: Yes
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D25873


# 662c1305 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

net: clean up empty lines in .c and .h files


# 19afc65a 30-Jul-2020 Mark Johnston <markj@FreeBSD.org>

ip6_output(): Check the return value of in6_getlinkifnet().

If the destination address has an embedded scope ID, make sure that it
corresponds to a valid ifnet before proceeding. Otherwise a sendto()
with a bogus link-local address can trigger a NULL pointer dereference.

Reported by: syzkaller
Reviewed by: ae
Fixes: r358572
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25887


# 95033af9 18-Jun-2020 Mark Johnston <markj@FreeBSD.org>

Add the SCTP_SUPPORT kernel option.

This is in preparation for enabling a loadable SCTP stack. Analogous to
IPSEC/IPSEC_SUPPORT, the SCTP_SUPPORT kernel option must be configured
in order to support a loadable SCTP implementation.

Discussed with: tuexen
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# 1483c1c5 28-May-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Replace ip6_ouput fib6_lookup_nh_<ext|basic> calls with fib6_lookup().

fib6_lookup_nh_ represents pre-epoch generation of fib api,
providing less guarantees over pointer validness and requiring
on-stack data copying.

Conversion is straight-forwarded, as the only 2 differences are
requirement of running in network epoch and the need to handle
RTF_GATEWAY case in the caller code.

Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D24973


# d7452d89 12-May-2020 Andrew Gallatin <gallatin@FreeBSD.org>

IPv6: sync IP_NO_SND_TAG_RL support from IPv4

The IP_NO_SND_TAG_RL flag to ip{,6}_output() means that the packets
being sent should bypass hardware rate limiting. This is typically used
by modern TCP stacks for rexmits.

This support was added to IPv4 in r352657, but never added to IPv6, even
though rack and bbr call ip6_output() with this flag.

Reviewed by: rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24822


# 84af4cc1 11-May-2020 Andrew Gallatin <gallatin@FreeBSD.org>

Fix the build

Back out the IPv6 portion of r360903, as the stamp_tag param
is apparently not supported in upstream FreeBSD.

Sponsored by: Netflix
Pointy hat to: gallatin


# 6043ac20 11-May-2020 Andrew Gallatin <gallatin@FreeBSD.org>

Ktls: never skip stamping tags for NIC TLS

The newer RACK and BBR TCP stacks have added a mechanism
to disable hardware packet pacing for TCP retransmits.
This mechanism works by skipping the send-tag stamp
on rate-limited connections when the TCP stack calls
ip_output() with the IP_NO_SND_TAG_RL flag set.

When doing NIC TLS, we must ignore this flag, as
NIC TLS packets must always be stamped. Failure
to stamp a NIC TLS packet will result in crypto
issues.

Reviewed by: hselasky, rrs
Sponsored by: Netflix, Mellanox


# 7b6c99d0 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Step 3: anonymize struct mbuf_ext_pgs and move all its fields into mbuf
within m_epg namespace.
All edits except the 'struct mbuf' declaration and mb_dupcl() were done
mechanically with sed:

s/->m_ext_pgs.nrdy/->m_epg_nrdy/g
s/->m_ext_pgs.hdr_len/->m_epg_hdrlen/g
s/->m_ext_pgs.trail_len/->m_epg_trllen/g
s/->m_ext_pgs.first_pg_off/->m_epg_1st_off/g
s/->m_ext_pgs.last_pg_len/->m_epg_last_len/g
s/->m_ext_pgs.flags/->m_epg_flags/g
s/->m_ext_pgs.record_type/->m_epg_record_type/g
s/->m_ext_pgs.enc_cnt/->m_epg_enc_cnt/g
s/->m_ext_pgs.tls/->m_epg_tls/g
s/->m_ext_pgs.so/->m_epg_so/g
s/->m_ext_pgs.seqno/->m_epg_seqno/g
s/->m_ext_pgs.stailq/->m_epg_stailq/g

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# 983066f0 25-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Convert route caching to nexthop caching.

This change is build on top of nexthop objects introduced in r359823.

Nexthops are separate datastructures, containing all necessary information
to perform packet forwarding such as gateway interface and mtu. Nexthops
are shared among the routes, providing more pre-computed cache-efficient
data while requiring less memory. Splitting the LPM code and the attached
data solves multiple long-standing problems in the routing layer,
drastically reduces the coupling with outher parts of the stack and allows
to transparently introduce faster lookup algorithms.

Route caching was (re)introduced to minimise (slow) routing lookups, allowing
for notably better performance for large TCP senders. Caching works by
acquiring rtentry reference, which is protected by per-rtentry mutex.
If the routing table is changed (checked by comparing the rtable generation id)
or link goes down, cache record gets withdrawn.

Nexthops have the same reference counting interface, backed by refcount(9).
This change merely replaces rtentry with the actual forwarding nextop as a
cached object, which is mostly mechanical. Other moving parts like cache
cleanup on rtable change remains the same.

Differential Revision: https://reviews.freebsd.org/D24340


# 23feb563 14-Apr-2020 Andrew Gallatin <gallatin@FreeBSD.org>

KTLS: Re-work unmapped mbufs to carry ext_pgs in the mbuf itself.

While the original implementation of unmapped mbufs was a large
step forward in terms of reducing cache misses by enabling mbufs
to carry more than a single page for sendfile, they are rather
cache unfriendly when accessing the ext_pgs metadata and
data. This is because the ext_pgs part of the mbuf is allocated
separately, and almost guaranteed to be cold in cache.

This change takes advantage of the fact that unmapped mbufs
are never used at the same time as pkthdr mbufs. Given this
fact, we can overlap the ext_pgs metadata with the mbuf
pkthdr, and carry the ext_pgs meta directly in the mbuf itself.
Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array
of pages) directly after the existing m_ext.

In order to be able to carry 5 pages (which is the minimum
required for a 16K TLS record which is not perfectly aligned) on
LP64, I've had to steal ext_arg2. The only user of this in the
xmit path is sendfile, and I've adjusted it to use arg1 when
using unmapped mbufs.

This change is almost entirely mechanical, except that we
change mb_alloc_ext_pgs() to no longer allow allocating
pkthdrs, the change to avoid ext_arg2 as mentioned above,
and the removal of the ext_pgs zone,

This change saves roughly 2% "raw" CPU (~59% -> 57%), or over
3% "scaled" CPU on a Netflix 100% software kTLS workload at
90+ Gb/s on Broadwell Xeons.

In a follow-on commit, I plan to remove some hacks to avoid
access ext_pgs fields of mbufs, since they will now be in
cache.

Many thanks to glebius for helping to make this better in
the Netflix tree.

Reviewed by: hselasky, jhb, rrs, glebius (early version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24213


# e02582d1 19-Mar-2020 Mark Johnston <markj@FreeBSD.org>

Fix synchronization in the IPV6_2292PKTOPTIONS set handler.

The inpcb needs to be locked when we update output packet options.
Otherwise it is possible for the IPV6_2292PKTOPTIONS handler to free
packet option structures while another thread is reading or updating
them.

Note that the option handler is still kind of broken. For instance it
frees all options before performing privilege checks for individual
options. However, this can be fixed separately.

Reported by: syzbot+52eb0fd4ddc119787f9d@syzkaller.appspotmail.com
Reviewed by: bz, tuexen
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24125


# 8483fce6 03-Mar-2020 Bjoern A. Zeeb <bz@FreeBSD.org>

ip6: retire in6_selectroute_fib() as promised 8 years ago

In r231852 I added in6_selectroute_fib() as a compat function with the
fibnum as an extra argument compared to in6_selectroute() to keep the
KPI stable.
Way too late retire this function again and add the fib to in6_selectroute()
which also only has a single consumer now and was an orphan function before.


# 000c42fa 03-Mar-2020 Bjoern A. Zeeb <bz@FreeBSD.org>

ip6_output: use new routing KPI when not passed a cached route

Implement the equivalent of r347375 (IPv4) for the IPv6 output path.
In IPv6 we get passed a cached route (and inp) by udp6_output()
depending on whether we acquired a write lock on the INP.
In case we neither bind nor connect a first UDP packet would come in
with a cached route (wlocked) and all further packets would not.
In case we bind and do not connect we never write-lock the inp.

When we do not pass in a cached route, rather than providing the
storage for a route locally and pass it over the old lookup code
and down the stack, use the new route lookup KPI and acquire all
details we need to send the packet.

Compared to the IPv4 code the IPv6 code has a couple of possible
complications: given an option with a routing hdr/caching route there,
and path mtu (ro_pmtu) case which now equally has to deal with the
possibility of having a route which is NULL passed in, and the
fwd_tag in case a firewall changes the next hop (something to
factor out in the future).

Sponsored by: Netflix
Reviewed by: glebius
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D23886


# 3db60531 25-Feb-2020 Bjoern A. Zeeb <bz@FreeBSD.org>

ip6_output: fix regression introduced in r358167 for ipv6 fragmentation

When moving the calculations for the optlen into the if (opt) block
which deals with possible extension headers I failed to initialise
unfragpartlen to the ipv6 header length if there were no extension
headers present. Correct that mistake to make IPv6 fragment length
calculcations work again.

Reported by: hselasky, kp
OKed by: hselasky, kp
MFC after: 3 days
X-MFC with: r358167
PR: 244393


# 3459050c 24-Feb-2020 Bjoern A. Zeeb <bz@FreeBSD.org>

Fix IPv6 checksums when exthdrs are present.

In two places in ip6_output we are doing (delayed) checksum calculations.
The initial logic came from SCTP in r205075,205104 and later I copied
and adjusted it for the TCP|UDP case in r235958.
The problem was that the original SCTP offsets were already wrong for any
case with extension headers present given IPv6 extension headers are not
part of the pseudo checksum calculations.
The later changes do not help in case there is checksum offloading as for
extension headers (incl. fragments) we do currrently never offload as we
have no infrastructure to know whether the NIC can handle these cases.

Correct the offsets for delayed checksum calculations and properly handle
mbuf flags. In addition harmonize the almost identical duplicate code.

While here eliminate the now unneeded variable hlen and add an always
missing mtod() call in the 1-b and 3 cases after the introduction of
the mb_unmapped_to_ext() calls.

Reported by: Francis Dupont (fdupont isc.org)
PR: 243675
MFC after: 6 days
Reviewed by: markj (earlier version), gallatin
Differential Revision: https://reviews.freebsd.org/D23760


# a1a6c01e 20-Feb-2020 Bjoern A. Zeeb <bz@FreeBSD.org>

ip6_output: improve extension header handling

Move IPv6 source address checks from after extension header heandling
to the top of the function. If we do not pass these checks there is
no reason to do a lot of work upfront.

Fold extension header preparations and length calculations together into
a single branch and macro rather than doing them sequentially.
Likewise move extension header concatination into a single branch block
only doing it if we recorded any extension header length length.

Reviewed by: melifaro (earlier version), markj, gallatin
Sponsored by: Netflix (partially, originally)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23740


# 7c1daefe 18-Feb-2020 Bjoern A. Zeeb <bz@FreeBSD.org>

ip6_output: update comments.

Clear up some comments and improve to panic messages.

No functional changes.

MFC after: 3 days


# b9555453 21-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Make ip6_output() and ip_output() require network epoch.

All callers that before may called into these functions
without network epoch now must enter it.


# e00ee1a9 01-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

In r343631 error code for a packet blocked by a firewall was
changed from EACCES to EPERM. This change was not intentional,
so fix that. Return EACCESS if a firewall forbids sending.

Noticed by: ae


# 1e4f4e56 09-Oct-2019 Gleb Smirnoff <glebius@FreeBSD.org>

ip6_output() has a complex set of gotos, and some can jump out of
the epoch section towards return statement. Since entering epoch
is cheap, it is easier to cover the whole function with epoch,
rather than try to properly maintain its state.


# cb49ec54 07-Oct-2019 Mark Johnston <markj@FreeBSD.org>

Improve locking in the IPV6_V6ONLY socket option handler.

Acquire the inp lock before checking whether the socket is already bound,
and around updates to the inp_vflag field.

MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21867


# b8a6e03f 07-Oct-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Widen NET_EPOCH coverage.

When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.

However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.

Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.

On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().

This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.

Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.

This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.

Reviewed by: gallatin, hselasky, cy, adrian, kristof
Differential Revision: https://reviews.freebsd.org/D19111


# b2e60773 26-Aug-2019 John Baldwin <jhb@FreeBSD.org>

Add kernel-side support for in-kernel TLS.

KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277


# 0ecd976e 02-Aug-2019 Bjoern A. Zeeb <bz@FreeBSD.org>

IPv6 cleanup: kernel

Finish what was started a few years ago and harmonize IPv6 and IPv4
kernel names. We are down to very few places now that it is feasible
to do the change for everything remaining with causing too much disturbance.

Remove "aliases" for IPv6 names which confusingly could indicate
that we are talking about a different data structure or field or
have two fields, one for each address family.
Try to follow common conventions used in FreeBSD.

* Rename sin6p to sin6 as that is how it is spelt in most places.
* Remove "aliases" (#defines) for:
- in6pcb which really is an inpcb and nothing separate
- sotoin6pcb which is sotoinpcb (as per above)
- in6p_sp which is inp_sp
- in6p_flowinfo which is inp_flow
* Try to use ia6 for in6_addr rather than in6p.
* With all these gone also rename the in6p variables to inp as
that is what we call it in most of the network stack including
parts of netinet6.

The reasons behind this cleanup are that we try to further
unify netinet and netinet6 code where possible and that people
will less ignore one or the other protocol family when doing
code changes as they may not have spotted places due to different
names for the same thing.

No functional changes.

Discussed with: tuexen (SCTP changes)
MFC after: 3 months
Sponsored by: Netflix


# 82334850 28-Jun-2019 John Baldwin <jhb@FreeBSD.org>

Add an external mbuf buffer type that holds multiple unmapped pages.

Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.

For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer. This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers). It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused. To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.

Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.

NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.

If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.

Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: ae, kp (firewalls)
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616


# 77a01441 11-Jun-2019 John Baldwin <jhb@FreeBSD.org>

Sort opt_foo.h #includes and add a missing blank line in ip_output().


# fb3bc596 24-May-2019 John Baldwin <jhb@FreeBSD.org>

Restructure mbuf send tags to provide stronger guarantees.

- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.

Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.

To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.

For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.

- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.

This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).

In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.

- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.

To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.

- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.

- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.

Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117


# c9d33708 10-May-2019 John Baldwin <jhb@FreeBSD.org>

Apply r280991 to ip6_fragment.

This uses m_dup_pkthdr() to copy all of the metadata about a packet to
each of its fragments including VLAN tags, mbuf tags, etc. instead of
hand-copying a few fields.

Reviewed by: bz
MFC after: 1 month
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117


# 50575ce1 25-Apr-2019 Andrew Gallatin <gallatin@FreeBSD.org>

Track TCP connection's NUMA domain in the inpcb

Drivers can now pass up numa domain information via the
mbuf numa domain field. This information is then used
by TCP syncache_socket() to associate that information
with the inpcb. The domain information is then fed back
into transmitted mbufs in ip{6}_output(). This mechanism
is nearly identical to what is done to track RSS hash values
in the inp_flowid.

Follow on changes will use this information for lacp egress
port selection, binding TCP pacers to the appropriate NUMA
domain, etc.

Reviewed by: markj, kib, slavash, bz, scottl, jtl, tuexen
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20028


# 2f041b74 19-Apr-2019 Michael Tuexen <tuexen@FreeBSD.org>

Improve input validation for the socket option IPV6_CHECKSUM.

When using the IPPROTO_IPV6 level socket option IPV6_CHECKSUM on a raw
IPv6 socket, ensure that the value is either -1 or a non-negative even
number.

Reviewed by: bz@, thj@
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D19966


# b252313f 31-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

New pfil(9) KPI together with newborn pfil API and control utility.

The KPI have been reviewed and cleansed of features that were planned
back 20 years ago and never implemented. The pfil(9) internals have
been made opaque to protocols with only returned types and function
declarations exposed. The KPI is made more strict, but at the same time
more extensible, as kernel uses same command structures that userland
ioctl uses.

In nutshell [KA]PI is about declaring filtering points, declaring
filters and linking and unlinking them together.

New [KA]PI makes it possible to reconfigure pfil(9) configuration:
change order of hooks, rehook filter from one filtering point to a
different one, disconnect a hook on output leaving it on input only,
prepend/append a filter to existing list of filters.

Now it possible for a single packet filter to provide multiple rulesets
that may be linked to different points. Think of per-interface ACLs in
Cisco or Juniper. None of existing packet filters yet support that,
however limited usage is already possible, e.g. default ruleset can
be moved to single interface, as soon as interface would pride their
filtering points.

Another future feature is possiblity to create pfil heads, that provide
not an mbuf pointer but just a memory pointer with length. That would
allow filtering at very early stages of a packet lifecycle, e.g. when
packet has just been received by a NIC and no mbuf was yet allocated.

Differential Revision: https://reviews.freebsd.org/D18951


# ef0111fd 09-Jan-2019 Hans Petter Selasky <hselasky@FreeBSD.org>

Fix loopback traffic when using non-lo0 link local IPv6 addresses.

The loopback interface can only receive packets with a single scope ID,
namely the scope ID of the loopback interface itself. To mitigate this
packets which use the scope ID are appearing as received by the real
network interface, see "origifp" in the patch. The current code would
drop packets which are designated for loopback which use a link-local
scope ID in the destination address or source address, because they
won't match the lo0's scope ID. To fix this restore the network
interface pointer from the scope ID in the destination address for
the problematic cases. See comments added in patch for a more detailed
description.

This issue was introduced with route caching (ae@).

Reviewed by: bz (network)
Differential Revision: https://reviews.freebsd.org/D18769
MFC after: 1 week
Sponsored by: Mellanox Technologies


# cc426dd3 11-Dec-2018 Mateusz Guzik <mjg@FreeBSD.org>

Remove unused argument to priv_check_cred.

Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by: The FreeBSD Foundation


# ec86402e 03-Sep-2018 Bjoern A. Zeeb <bz@FreeBSD.org>

Replicate r328271 from legacy IP to IPv6 using a single macro
to clear L2 and L3 route caches.
Also mark one function argument as __unused.

Reviewed by: karels, ae
Approved by: re (rgrimes)
Differential Revision: https://reviews.freebsd.org/D17007


# 56713d16 14-Jul-2018 Matt Macy <mmacy@FreeBSD.org>

acquire inp lock around ip6_pcbopt to fix IPV6_TCLASS panic

Simple fix to address panics relating to setting IPV6_TCLASS
with setsockopt(). The premise of this change is that it is
ok to call malloc with M_NOWAIT while holding a lock on the
in6p.

If it later turns out that it is not ok, then major surgery
will be required, as ip6_setpktopt() will have to be fixed
(as it also calls malloc with M_NOWAIT) which pulls in the
ip6_pcbopts(), ip6_setpktopts(), ip6_setpktopt() call chain.

Submitted by: Jason Eggnet
Reviewed by: rrs, transport, sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16201


# 1a43cff9 06-Jun-2018 Sean Bruno <sbruno@FreeBSD.org>

Load balance sockets with new SO_REUSEPORT_LB option.

This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures:
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

Limitations:
As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or
threads sharing the same socket).

This is a substantially different contribution as compared to its original
incarnation at svn r332894 and reverted at svn r332967. Thanks to rwatson@
for the substantive feedback that is included in this commit.

Submitted by: Johannes Lundberg <johalun0@gmail.com>
Obtained from: DragonflyBSD
Relnotes: Yes
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D11003


# 4a089e6b 06-Jun-2018 Andrey V. Elsukov <ae@FreeBSD.org>

Use m_copyback() function to write delayed checksum when it isn't located
in the first mbuf of the chain.

MFC after: 1 week


# 7875017c 24-Apr-2018 Sean Bruno <sbruno@FreeBSD.org>

Revert r332894 at the request of the submitter.

Submitted by: Johannes Lundberg <johalun0_gmail.com>
Sponsored by: Limelight Networks


# 7b7796ee 23-Apr-2018 Sean Bruno <sbruno@FreeBSD.org>

Load balance sockets with new SO_REUSEPORT_LB option

This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

Limitations
As DragonflyBSD, a load balance group is limited to 256 pcbs
(256 programs or threads sharing the same socket).

Submitted by: Johannes Lundberg <johanlun0@gmail.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D11003


# c187c034 23-Mar-2018 Jonathan T. Looney <jtl@FreeBSD.org>

Remove some unneccessary variable sets in IPv6 code, as detected by
clang's static analyzer.

Reviewed by: bz
MFC after: 2 weeks
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D10940


# 72bfa0bf 23-Mar-2018 Sean Bruno <sbruno@FreeBSD.org>

Revert r331379 as the "simple" lock changes have revealed a deeper problem
and need for a rethink.

Submitted by: Jason Eggleston <jason@eggnet.com>
Sponsored by: Limelight Networks


# effaab88 23-Mar-2018 Kristof Provost <kp@FreeBSD.org>

netpfil: Introduce PFIL_FWD flag

Forwarded packets passed through PFIL_OUT, which made it difficult for
firewalls to figure out if they were forwarding or producing packets. This in
turn is an issue for pf for IPv6 fragment handling: it needs to call
ip6_output() or ip6_forward() to handle the fragments. Figuring out which was
difficult (and until now, incorrect).
Having pfil distinguish the two removes an ugly piece of code from pf.

Introduce a new variant of the netpfil callbacks with a flags variable, which
has PFIL_FWD set for forwarded packets. This allows pf to reliably work out if
a packet is forwarded.

Reviewed by: ae, kevans
Differential Revision: https://reviews.freebsd.org/D13715


# 06b479a6 22-Mar-2018 Sean Bruno <sbruno@FreeBSD.org>

Refactor ip6_getpcbopt() for better locking and memory management

Created GET_PKTOPT_EXT_HDR() and GET_PKTOPT_SOCKADDR() macros to
handle safely fetching options from in6p_outputopts, including
properly dealing with in6p locking and preparing memory for
sooptcopyout().

Changed the function signature of ip6_getpcbopt() to allow the
function to acquire and release locks on in6p as needed.

Submitted by: Jason Eggleston <jason@eggnet.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14619


# 2a499acf 22-Mar-2018 Sean Bruno <sbruno@FreeBSD.org>

Simple locking fixes in ip_ctloutput, ip6_ctloutput, rip_ctloutput.

Submitted by: Jason Eggleston <jason@eggnet.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14624


# 5cbeca44 22-Mar-2018 Sean Bruno <sbruno@FreeBSD.org>

Handle locking and memory safety for IPV6_PATHMTU in ip6_ctloutput().

Submitted by: Jason Eggleston <jason@eggnet.com>
Reviewed by: ae
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14622


# 37d4fc1e 22-Mar-2018 Sean Bruno <sbruno@FreeBSD.org>

Improve write locking in ip6_ctloutput() with macros.

Submitted by: Jason Eggleston <jason@eggnet.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14620


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# ae69ad88 27-Jul-2017 Bjoern A. Zeeb <bz@FreeBSD.org>

After inpcb route caching was put back in place there is no need for
flowtable anymore (as flowtable was never considered to be useful in
the forwarding path).

Reviewed by: np
Differential Revision: https://reviews.freebsd.org/D11448


# 8b07e00e 30-May-2017 Jonathan T. Looney <jtl@FreeBSD.org>

Fix an unnecessary/incorrect check in the PKTOPT_EXTHDRCPY macro.

This macro allocates memory and, if malloc does not return NULL, copies
data into the new memory. However, it doesn't just check whether malloc
returns NULL. It also checks whether we called malloc with M_NOWAIT. That
is not necessary.

While it may be that malloc() will only return NULL when the M_NOWAIT flag
is set, we don't need to check for this when checking malloc's return
value. Further, in this case, the check was not completely accurate,
because it checked for flags == M_NOWAIT, rather than treating it as a bit
field and checking for (flags & M_NOWAIT).

Reviewed by: ae
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D10942


# ce9ac139 09-May-2017 Navdeep Parhar <np@FreeBSD.org>

ip6_output runs with the inp lock held, just like ip_output.


# d78c0804 22-Apr-2017 Kristof Provost <kp@FreeBSD.org>

Rename variable for clarity

Rename the mtu variable in ip6_fragment(), because mtu is misleading. The
variable actually holds the fragment length.
No functional change.

Suggested by: ae


# 00eab743 20-Apr-2017 Kristof Provost <kp@FreeBSD.org>

pf: Fix possible incorrect IPv6 fragmentation

When forwarding pf tracks the size of the largest fragment in a fragmented
packet, and refragments based on this size.
It failed to ensure that this size was a multiple of 8 (as is required for all
but the last fragment), so it could end up generating incorrect fragments.

For example, if we received an 8 byte and 12 byte fragment pf would emit a first
fragment with 12 bytes of payload and the final fragment would claim to be at
offset 8 (not 12).

We now assert that the fragment size is a multiple of 8 in ip6_fragment(), so
other users won't make the same mistake.

Reported by: Antonios Atlasis <aatlasis at secfu net>
MFC after: 3 days


# 8c1960d5 25-Mar-2017 Mike Karels <karels@FreeBSD.org>

Fix reference count leak with L2 caching.

ip_forward, TCP/IPv6, and probably SCTP leaked references to L2 cache
entry because they used their own routes on the stack, not in_pcb routes.
The original model for route caching was callers that provided a route
structure to ip{,6}input() would keep the route, and this model was used
for L2 caching as well. Instead, change L2 caching to be done by default
only when using a route structure in the in_pcb; the pcb deallocation
code frees L2 as well as L3 cacches. A separate change will add route
caching to TCP/IPv6.

Another suggestion was to have the transport protocols indicate willingness
to use L2 caching, but this approach keeps the changes in the network
level

Reviewed by: ae gnn
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D10059


# dce33a45 05-Mar-2017 Ermal Luçi <eri@FreeBSD.org>

The patch provides the same socket option as Linux IP_ORIGDSTADDR.
Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD.

The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application.

This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets.

Reviewed by: adrian, aw
Approved by: ae (mentor)
Sponsored by: rsync.net
Differential Revision: D9235


# fbbd9655 28-Feb-2017 Warner Losh <imp@FreeBSD.org>

Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96


# c10c5b1e 11-Feb-2017 Ermal Luçi <eri@FreeBSD.org>

Committed without approval from mentor.

Reported by: gnn


# ed55edce 09-Feb-2017 Ermal Luçi <eri@FreeBSD.org>

The patch provides the same socket option as Linux IP_ORIGDSTADDR.
Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD.

The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application.

This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets.

Sponsored-by: rsync.net
Differential Revision: D9235
Reviewed-by: adrian


# fcf59617 06-Feb-2017 Andrey V. Elsukov <ae@FreeBSD.org>

Merge projects/ipsec into head/.

Small summary
-------------

o Almost all IPsec releated code was moved into sys/netipsec.
o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel
option IPSEC_SUPPORT added. It enables support for loading
and unloading of ipsec.ko and tcpmd5.ko kernel modules.
o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by
default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type
support was removed. Added TCP/UDP checksum handling for
inbound packets that were decapsulated by transport mode SAs.
setkey(8) modified to show run-time NAT-T configuration of SA.
o New network pseudo interface if_ipsec(4) added. For now it is
build as part of ipsec.ko module (or with IPSEC kernel).
It implements IPsec virtual tunnels to create route-based VPNs.
o The network stack now invokes IPsec functions using special
methods. The only one header file <netipsec/ipsec_support.h>
should be included to declare all the needed things to work
with IPsec.
o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed.
Now these protocols are handled directly via IPsec methods.
o TCP_SIGNATURE support was reworked to be more close to RFC.
o PF_KEY SADB was reworked:
- now all security associations stored in the single SPI namespace,
and all SAs MUST have unique SPI.
- several hash tables added to speed up lookups in SADB.
- SADB now uses rmlock to protect access, and concurrent threads
can do SA lookups in the same time.
- many PF_KEY message handlers were reworked to reflect changes
in SADB.
- SADB_UPDATE message was extended to support new PF_KEY headers:
SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They
can be used by IKE daemon to change SA addresses.
o ipsecrequest and secpolicy structures were cardinally changed to
avoid locking protection for ipsecrequest. Now we support
only limited number (4) of bundled SAs, but they are supported
for both INET and INET6.
o INPCB security policy cache was introduced. Each PCB now caches
used security policies to avoid SP lookup for each packet.
o For inbound security policies added the mode, when the kernel does
check for full history of applied IPsec transforms.
o References counting rules for security policies and security
associations were changed. The proper SA locking added into xform
code.
o xform code was also changed. Now it is possible to unregister xforms.
tdb_xxx structures were changed and renamed to reflect changes in
SADB/SPDB, and changed rules for locking and refcounting.

Reviewed by: gnn, wblock
Obtained from: Yandex LLC
Relnotes: yes
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D9352


# f3e7afe2 18-Jan-2017 Hans Petter Selasky <hselasky@FreeBSD.org>

Implement kernel support for hardware rate limited sockets.

- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.

- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.

- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().

- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.

- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.

- How rate limiting works:

1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.

2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.

3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.

4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.

Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months


# aec9c8d5 17-Oct-2016 George V. Neville-Neil <gnn@FreeBSD.org>

Limit the number of mbufs that can be allocated for IPV6_2292PKTOPTIONS
(and IPV6_PKTOPTIONS).

PR: 100219
Submitted by: Joseph Kong
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D5157


# cc94f0c2 13-Oct-2016 Gleb Smirnoff <glebius@FreeBSD.org>

- Revert r300854, r303657 which tried to fix regression from r297225.
- Fix the regression proper way using RO_RTFREE().

Submitted by: ae


# c3bef61e 15-Sep-2016 Kevin Lo <kevlo@FreeBSD.org>

Remove the 4.3BSD compatible macro m_copy(), use m_copym() instead.

Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D7878


# 0f5687f2 23-Aug-2016 Mike Karels <karels@FreeBSD.org>

Fix L2 caching for UDP over IPv6

ip6_output() was missing cache invalidation code analougous to
ip_output.c. r304545 disabled L2 caching for UDP/IPv6 as a workaround.
This change adds the missing cache invalidation code and reverts
r304545.

Reviewed by: gnn
Approved by: gnn (mentor)
Tested by: peter@, Mike Andrews
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D7591


# 723758b7 01-Aug-2016 Andrey V. Elsukov <ae@FreeBSD.org>

Fix NULL pointer dereference.
ro pointer can be NULL when IPSec consumes mbuf.

PR: 211486
MFC after: 3 days


# d4c22202 01-Aug-2016 Andrew Gallatin <gallatin@FreeBSD.org>

Rework IPV6 TCP path MTU discovery to match IPv4

- Re-write tcp_ctlinput6() to closely mimic the IPv4 tcp_ctlinput()

- Now that tcp_ctlinput6() updates t_maxseg, we can allow ip6_output()
to send TCP packets without looking at the tcp host cache for every
single transmit.

- Make the icmp6 code mimic the IPv4 code & avoid returning
PRC_HOSTDEAD because it is so expensive.

Without these changes in place, every TCP6 pmtu discovery or host
unreachable ICMP resulted in a call to in6_pcbnotify() which walks the
tcbinfo table with the write lock held. Because the tcbinfo table is
shared between IPv4 and IPv6, this causes huge scalabilty issues on
servers with lots of (~100K) TCP connections, to the point where even
a small percent of IPv6 traffic had a disproportionate impact on
overall throughput.

Reviewed by: bz, rrs, ae (all earlier versions), lstewart (in Netflix's tree)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D7272


# be393491 13-Jul-2016 Dimitry Andric <dim@FreeBSD.org>

Fix a page fault in ip6_setpktopt(), occurring when the pflog module is
loaded, and syncthing is started, which uses setsockopt(IPV6_PKGINFO).

This is because pflog interfaces do not normally have an IPv6 address,
causing the ND_IFINFO() macro to dereference a NULL pointer.

Reviewed by: ae
PR: 210943
MFC after: 3 days


# 4c105402 08-Jun-2016 Andrey V. Elsukov <ae@FreeBSD.org>

Cleanup unneded include "opt_ipfw.h".

It was used for conditional build IPFIREWALL_FORWARD support.
But IPFIREWALL_FORWARD option was removed a long time ago.


# 6d768226 02-Jun-2016 George V. Neville-Neil <gnn@FreeBSD.org>

This change re-adds L2 caching for TCP and UDP, as originally added in D4306
but removed due to other changes in the system. Restore the llentry pointer
to the "struct route", and use it to cache the L2 lookup (ARP or ND6) as
appropriate.

Submitted by: Mike Karels
Differential Revision: https://reviews.freebsd.org/D6262


# 6351b385 27-May-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Plug route reference underleak that happens with FLOWTABLE after r297225.

Submitted by: Mike Karels <mike karels.net>


# 1d4d43c0 19-May-2016 Andrey V. Elsukov <ae@FreeBSD.org>

Remove ip6 adjusting from the place where pointer couldn't be changed.
And add comment after calling PFIL hooks, where it could be changed.


# 9fdab4c0 19-May-2016 Andrey V. Elsukov <ae@FreeBSD.org>

Remove ip6 pointer initialization and strange check from the beginning
of ip6_output(). It isn't used until the first time adjusted.
Remove the comment about adjusting where it is actually initialized.


# 5e0a6f31 19-May-2016 Mark Johnston <markj@FreeBSD.org>

Move IPv6 malloc tag definitions into the IPv6 code.


# f0937b2c 18-May-2016 Andrey V. Elsukov <ae@FreeBSD.org>

Since PFIL can change destination address, use its always actual value
from mbuf when calculating path mtu. Remove now unused finaldst variable.
Also constify dst argument in ip6_getpmtu() and ip6_getpmtu_ctl().

Reviewed by: melifaro
Obtained from: Yandex LLC
Sponsored by: Yandex LLC


# 4ee7e5a6 17-May-2016 Andrey V. Elsukov <ae@FreeBSD.org>

Call RO_RTFREE() when we have detected the change of destination
address, otherwise the old route will be used with new destination.

MFC after: 1 week


# 155d72c4 15-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/net* : for pointers replace 0 with NULL.

Mostly cosmetical, no functional change.

Found with devel/coccinelle.


# 84cc0778 24-Mar-2016 George V. Neville-Neil <gnn@FreeBSD.org>

FreeBSD previously provided route caching for TCP (and UDP). Re-add
route caching for TCP, with some improvements. In particular, invalidate
the route cache if a new route is added, which might be a better match.
The cache is automatically invalidated if the old route is deleted.

Submitted by: Mike Karels
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D4306


# 56a5f52e 29-Feb-2016 Gleb Smirnoff <glebius@FreeBSD.org>

New way to manage reference counting of mbuf external storage.

The m_ext.ext_cnt pointer becomes a union. It can now hold the refcount
value itself. To tell that m_ext.ext_flags flag EXT_FLAG_EMBREF is used.
The first mbuf to attach a cluster stores the refcount. The further mbufs
to reference the cluster point at refcount in the first mbuf. The first
mbuf is freed only when the last reference is freed.

The benefit over refcounts stored in separate slabs is that now refcounts
of different, unrelated mbufs do not share a cache line.

For EXT_EXTREF mbufs the zone_ext_refcnt is no longer needed, and m_extadd()
becomes void, making widely used M_EXTADD macro safe.

For EXT_SFBUF mbufs the sf_ext_ref() is removed, which was an optimization
exactly against the cache aliasing problem with regular refcounting.

Discussed with: rrs, rwatson, gnn, hiren, sbruno, np
Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D5396
Sponsored by: Netflix


# bacf6684 04-Jan-2016 Alexander V. Chernikov <melifaro@FreeBSD.org>

Finish r293098: make ip6_getpmtu() and ip6_getpmtu_ctl() use new routing API


# 0d4df029 03-Jan-2016 Alexander V. Chernikov <melifaro@FreeBSD.org>

Handle IPV6_PATHMTU option by spliting ip6_getpmtu_ctl() from ip6_getpmtu().
Add ro_mtu field to 'struct route' to be able to pass lookup MTU back to
the caller.

Currently, ip6_getpmtu() has 2 totally different use cases:
1) control plane (IPV6_PATHMTU req), where we just need to calculate MTU
and return it, w/o any reusability.
2) Actual ip6_output() data path where we (nearly) always use the provided
route lookup data. If this data is not 'valid' we need to perform another
lookup and save the result (which cannot be re-used by ip6_output()).

Given that, handle 1) by calling separate function doing rte lookup itself.
Resulting MTU is calculated by (newly-added) ip6_calcmtu() used by both
ip6_getpmtu_ctl() and ip6_getpmtu().
For 2) instead of storing ref'ed rte, store mtu (the only needed data
from the lookup result) inside newly-added ro_mtu field.
'struct route' was shrinked by 8(or 4 bytes) in r292978. Grow it again
by 4 bytes. New ro_mtu field will be used in other places like
ip/tcp_output (EMSGSIZE handling from output routines).

Reviewed by: ae


# 912568c8 30-Dec-2015 Jonathan T. Looney <jtl@FreeBSD.org>

Add the appropriate case statement for IPV6_BINDMULTI so the option can be
retrieved with getsockopt().

CID: 1229928
Differential Revision: https://reviews.freebsd.org/D4737
Reviewed by: adrian
Sponsored by: Juniper Networks


# 637670e7 15-Nov-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Bring back the ability of passing cached route via nd6_output_ifp().


# 1fe201c3 16-Sep-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Simplify the way of attaching IPv6 link-layer header.

Problem description:
How do we currently perform layer 2 resolution and header imposition:

For IPv4 we have the following chain:
ip_output() -> (ether|atm|whatever)_output() -> arpresolve()

Lookup is done in proper place (link-layer output routine) and it is possible
to provide cached lle data.

For IPv6 situation is more complex:
ip6_output() -> nd6_output() -> nd6_output_ifp() -> (whatever)_output() ->
nd6_storelladdr()

We have ip6_ouput() which calls nd6_output() instead of link output routine.
nd6_output() does the following:
* checks if lle exists, creates it if needed (similar to arpresolve())
* performes lle state transitions (similar to arpresolve())
* calls nd6_output_ifp() which pushes packets to link output routine along
with running SeND/MAC hooks regardless of lle state
(e.g. works as run-hooks placeholder).

After that, iface output routine like ether_output() calls nd6_storelladdr()
which performs lle lookup once again.

As a result, we perform lookup twice for each outgoing packet for most types
of interfaces. We also need to maintain runtime-checked table of 'nd6-free'
interfaces (see nd6_need_cache()).

Fix this behavior by eliminating first ND lookup. To be more specific:
* make all nd6_output() consumers use nd6_output_ifp() instead
* rename nd6_output[_slow]() to nd6_resolve_[slow]()
* convert nd6_resolve() and nd6_resolve_slow() to arpresolve() semantics,
e.g. copy L2 address to buffer instead of pushing packet towards lower
layers
* Make all nd6_storelladdr() users use nd6_resolve()
* eliminate nd6_storelladdr()

The resulting callchain is the following:
ip6_output() -> nd6_output_ifp() -> (whatever)_output() -> nd6_resolve()

Error handling:
Currently sending packet to non-existing la results in ip6_<output|forward>
-> nd6_output() -> nd6_output _lle() which returns 0.
In new scenario packet is propagated to <ether|whatever>_output() ->
nd6_resolve() which will return EWOULDBLOCK, and that result
will be converted to 0.

(And EWOULDBLOCK is actually used by IB/TOE code).

Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D1469


# 68bb8d62 06-Sep-2015 Adrian Chadd <adrian@FreeBSD.org>

Add support for receiving flowtype, flowid and RSS bucket information as part of recvmsg().

Submitted by: Tiwei Bie <btw@mail.ustc.edu.cn>
Differential Revision: https://reviews.freebsd.org/D3562


# 331dff07 08-Aug-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Simplify ip[6] simploop:
Do not pass 'dst' sockaddr to ip[6]_mloopback:
- We have explicit check for AF_INET in ip_output()
- We assume ip header inside passed mbuf in ip_mloopback
- We assume ip6 header inside passed mbuf in ip6_mloopback


# cb207f93 03-Jul-2015 Andrey V. Elsukov <ae@FreeBSD.org>

Keep IPv6 address specified by IPV6_PKTINFO socket option in kernel
internal form to be able handle link-local IPv6 addresses.

Reported by: kp
Tested by: kp


# 654bdb5a 07-May-2015 Andrey V. Elsukov <ae@FreeBSD.org>

Mark data checksum as valid for multicast packets, that we send back
to myself via simloop.
Also remove duplicate check under #ifdef DIAGNOSTIC.

PR: 180065
MFC after: 1 week


# 79831849 31-Mar-2015 Kristof Provost <kp@FreeBSD.org>

Preserve IPv6 fragment IDs accross reassembly and refragmentation

When forwarding fragmented IPv6 packets and filtering with PF we
reassemble and refragment. That means we generate new fragment headers
and a new fragment ID.

We already save the fragment IDs so we can do the reassembly so it's
straightforward to apply the incoming fragment ID on the refragmented
packets.

Differential Revision: https://reviews.freebsd.org/D2188
Approved by: gnn (mentor)


# 8f1beb88 04-Mar-2015 Andrey V. Elsukov <ae@FreeBSD.org>

Fix deadlock in IPv6 PCB code.

When several threads are trying to send datagram to the same destination,
but fragmentation is disabled and datagram size exceeds link MTU,
ip6_output() calls pfctlinput2(PRC_MSGSIZE). It does notify all
sockets wanted to know MTU to this destination. And since all threads
hold PCB lock while sending, taking the lock for each PCB in the
in6_pcbnotify() leads to deadlock.

RFC 3542 p.11.3 suggests notify all application wanted to receive
IPV6_PATHMTU ancillary data for each ICMPv6 packet too big message.
But it doesn't require this, when we don't receive ICMPv6 message.

Change ip6_notify_pmtu() function to be able use it directly from
ip6_output() to notify only one socket, and to notify all sockets
when ICMPv6 packet too big message received.

PR: 197059
Differential Revision: https://reviews.freebsd.org/D1949
Reviewed by: no objection from #network
Obtained from: Yandex LLC
MFC after: 1 week
Sponsored by: Yandex LLC


# 6c269f69 15-Feb-2015 Gleb Smirnoff <glebius@FreeBSD.org>

Factor out ip6_fragment() function, to be used in IPv6 stack and pf(4).

Submitted by: Kristof Provost
Differential Revision: D1766


# e5ee7060 15-Feb-2015 Gleb Smirnoff <glebius@FreeBSD.org>

Move ip6_deletefraghdr() to frag6.c.

Suggested by: bz


# 0b438b0f 15-Feb-2015 Gleb Smirnoff <glebius@FreeBSD.org>

Factor out ip6_deletefraghdr() function, to be shared between IPv6
stack and pf(4).

Submitted by: Kristof Provost
Reviewed by: ae
Differential Revision: D1764


# b2bdc62a 18-Jan-2015 Adrian Chadd <adrian@FreeBSD.org>

Refactor / restructure the RSS code into generic, IPv4 and IPv6 specific
bits.

The motivation here is to eventually teach netisr and potentially
other networking subsystems a bit more about how RSS work queues / buckets
are configured so things have a hope of auto-configuring in the future.

* net/rss_config.[ch] takes care of the generic bits for doing
configuration, hash function selection, etc;
* topelitz.[ch] is now in net/ rather than netinet/;
* (and would be in libkern if it didn't directly include RSS_KEYSIZE;
that's a later thing to fix up.)
* netinet/in_rss.[ch] now just contains the IPv4 specific methods;
* and netinet/in6_rss.[ch] now just contains the IPv6 specific methods.

This should have no functional impact on anyone currently using
the RSS support.

Differential Revision: D1383
Reviewed by: gnn, jfv (intel driver bits)


# ffec6ee5 12-Jan-2015 Gleb Smirnoff <glebius@FreeBSD.org>

Do not go one layer down to check ifqueue length. First, not all drivers
use ifqueue at all. Second, there is no point in this lockless check.
Either positive or negative result of the check could be incorrect after
a tick.

Sponsored by: Nginx, Inc.


# ed6a66ca 05-Jan-2015 Robert Watson <rwatson@FreeBSD.org>

To ease changes to underlying mbuf structure and the mbuf allocator, reduce
the knowledge of mbuf layout, and in particular constants such as M_EXT,
MLEN, MHLEN, and so on, in mbuf consumers by unifying various alignment
utility functions (M_ALIGN(), MH_ALIGN(), MEXT_ALIGN() in a single
M_ALIGN() macro, implemented by a now-inlined m_align() function:

- Move m_align() from uipc_mbuf.c to mbuf.h; mark as __inline.
- Reimplement M_ALIGN(), MH_ALIGN(), and MEXT_ALIGN() using m_align().
- Update consumers around the tree to simply use M_ALIGN().

This change eliminates a number of cases where mbuf consumers must be aware
of whether or not mbufs returned by the allocator use external storage, but
also assumptions about the size of the returned mbuf. This will make it
easier to introduce changes in how we use external storage, as well as
features such as variable-size mbufs.

Differential Revision: https://reviews.freebsd.org/D1436
Reviewed by: glebius, trasz, gnn, bz
Sponsored by: EMC / Isilon Storage Division


# 0275b2e3 11-Dec-2014 Andrey V. Elsukov <ae@FreeBSD.org>

Remove flag/flags argument from the following functions:
ipsec_getpolicybyaddr()
ipsec4_checkpolicy()
ip_ipsec_output()
ip6_ipsec_output()

The only flag used here was IP_FORWARDING.

Obtained from: Yandex LLC
Sponsored by: Yandex LLC


# c2529042 01-Dec-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Start process of removing the use of the deprecated "M_FLOWID" flag
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.

This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.

"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.

Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.

MFC after: 1 month
Sponsored by: Mellanox Technologies


# f9723c77 20-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Simplify API: use new NHOP_LOOKUP_AIFP flag to select what ifp
we need to return.
Rename fib[64]_lookup_nh_basic to fib[64]_lookup_nh, add flags
fields for all relevant functions.


# 7f948f12 16-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Finish r274175: do control plane MTU tracking.

Update route MTU in case of ifnet MTU change.
Add new RTF_FIXEDMTU to track explicitly specified MTU.

Old behavior:
ifconfig em0 mtu 1500->9000 -> all routes traversing em0 do not change MTU.
User has to manually update all routes.
ifconfig em0 mtu 9000->1500 -> all routes traversing em0 do not change MTU.
However, if ip[6]_output finds route with rt_mtu > interface mtu, rt_mtu
gets updated.

New behavior:
ifconfig em0 mtu 1500->9000 -> all interface routes in all fibs gets updated
with new MTU unless RTF_FIXEDMTU flag set on them.
ifconfig em0 mtu 9000->1500 -> all routes in all fibs gets updated with new
MTU unless RTF_FIXEDMTU flag set on them AND rt_mtu is less than ifp mtu.

route add ... -mtu XXX automatically sets RTF_FIXEDMTU flag.
route change .. -mtu 0 automatically removes RTF_FIXEDMTU flag.

PR: 194238
MFC after: 1 month
CR: D1125


# 603eaf79 09-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Renove faith(4) and faithd(8) from base. It looks like industry
have chosen different (and more traditional) stateless/statuful
NAT64 as translation mechanism. Last non-trivial commits to both
faith(4) and faithd(8) happened more than 12 years ago, so I assume
it is time to drop RFC3142 in FreeBSD.

No objections from: net@


# 257480b8 04-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Convert netinet6/ to use new routing API.

* Remove &ifpp from ip6_output() in favor of ri->ri_nh_info
* Provide different wrappers to in6_selectsrc:
Currently it is used by 2 differenct type of customers:
- socket-based one, which all are unsure about provided
address scope and
- in-kernel ones (ND code mostly), which don't have
any sockets, options, crededentials, etc.
So, we provide two different wrappers to in6_selectsrc()
returning select source.
* Make different versions of selectroute():
Currenly selectroute() is used in two scenarios:
- SAS, via in6_selecsrc() -> in6_selectif() -> selectroute()
- output, via in6_output -> wrapper -> selectroute()
Provide different versions for each customer:
- fib6_lookup_nh_basic()-based in6_selectif() which is
capable of returning interface only, without MTU/NHOP/L2
calculations
- full-blown fib6_selectroute() with cached route/multipath/
MTU/L2
* Stop using routing table for link-local address lookups
* Add in6_ifawithifp_lla() to make for-us check faster for link-local
* Add in6_splitscope / in6_setllascope for faster embed/deembed scopes


# f0cace5d 12-Oct-2014 Robert Watson <rwatson@FreeBSD.org>

When deciding whether to call m_pullup() even though there is adequate
data in an mbuf, use M_WRITABLE() instead of a direct test of M_EXT;
the latter both unnecessarily exposes mbuf-allocator internals in the
protocol stack and is also insufficient to catch all cases of
non-writability.

(NB: m_pullup() does not actually guarantee that a writable mbuf is
returned, so further refinement of all of these code paths continues to
be required.)

Reviewed by: bz
MFC after: 3 days
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D900


# 9c57a5b6 01-Oct-2014 Hiroki Sato <hrs@FreeBSD.org>

Add an additional routing table lookup when m->m_pkthdr.fibnum is changed
at a PFIL hook in ip{,6}_output(). IPFW setfib rule did not perform
a routing table lookup when the destination address was not changed.

CR: D805


# 9196891f 10-Sep-2014 Andrey V. Elsukov <ae@FreeBSD.org>

Add additional checks for IPV6_PKTINFO handling (RFC 3542):

* Return ENETDOWN when interface specified by ipi6_ifindex is not
enabled for IPv6 use.
* Return EADDRNOTAVAIL when ipi6_ifindex specifies an interface, but the
address ipi6_addr is not available for use on that interface.
* Return EINVAL when ipi6_addr is multicast address.

Obtained from: Yandex LLC
Sponsored by: Yandex LLC


# b174de32 08-Sep-2014 Adrian Chadd <adrian@FreeBSD.org>

Add IP_NODEFAULTFLOWID awareness to ip6_output().

Differential Revision: https://reviews.freebsd.org/D527


# c7c0d948 11-Jul-2014 Adrian Chadd <adrian@FreeBSD.org>

Add IPv6 flowid, bindmulti and RSS awareness.


# aaf2cfc0 27-May-2014 VANHULLEBUS Yvan <vanhu@FreeBSD.org>

Fixed IPv4-in-IPv6 and IPv6-in-IPv4 IPsec tunnels.
For IPv6-in-IPv4, you may need to do the following command
on the tunnel interface if it is configured as IPv4 only:
ifconfig <interface> inet6 -ifdisabled

Code logic inspired from NetBSD.

PR: kern/169438
Submitted by: emeric.poupon@netasq.com
Reviewed by: fabient, ae
Obtained from: NETASQ


# e3a7aa6f 04-Mar-2014 Gleb Smirnoff <glebius@FreeBSD.org>

- Remove rt_metrics_lite and simply put its members into rtentry.
- Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This
removes another cache trashing ++ from packet forwarding path.
- Create zini/fini methods for the rtentry UMA zone. Via initialize
mutex and counter in them.
- Fix reporting of rmx_pksent to routing socket.
- Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode.

The change is mostly targeted for stable/10 merge. For head,
rt_pksent is expected to just disappear.

Discussed with: melifaro
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 0ff96b4f 17-Feb-2014 Gleb Smirnoff <glebius@FreeBSD.org>

o Remove at compile time the HASH_ALL code, that was never
tested and is unfinished. However, I've tested my version,
it works okay. As before it is unfinished: timeout aren't
driven by TCP session state. To enable the HASH_ALL mode,
one needs in kernel config:

options FLOWTABLE_HASH_ALL

o Reduce the alignment on flentry to 64 bytes. Without
the FLOWTABLE_HASH_ALL option, twice less memory would
be consumed by flows.
o API to ip_output()/ip6_output() got even more thin: 1 liner.
o Remove unused unions. Simply use fle->f_key[].
o Merge all IPv4 code into flowtable_lookup_ipv4(), and do same
flowtable_lookup_ipv6(). Stop copying data to on stack
sockaddr structures, simply use key[] on stack.
o Move code from flowtable_lookup_common() that actually works
on insertion into flowtable_insert().

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 5d6d7e75 07-Feb-2014 Gleb Smirnoff <glebius@FreeBSD.org>

o Revamp API between flowtable and netinet, netinet6.
- ip_output() and ip_output6() simply call flowtable_lookup(),
passing mbuf and address family. That's the only code under
#ifdef FLOWTABLE in the protocols code now.
o Revamp statistics gathering and export.
- Remove hand made pcpu stats, and utilize counter(9).
- Snapshot of statistics is available via 'netstat -rs'.
- All sysctls are moved into net.flowtable namespace, since
spreading them over net.inet isn't correct.
o Properly separate at compile time INET and INET6 parts.
o General cleanup.
- Remove chain of multiple flowtables. We simply have one for
IPv4 and one for IPv6.
- Flowtables are allocated in flowtable.c, symbols are static.
- With proper argument to SYSINIT() we no longer need flowtable_ready.
- Hash salt doesn't need to be per-VNET.
- Removed rudimentary debugging, which use quite useless in dtrace era.

The runtime behavior of flowtable shouldn't be changed by this commit.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 76039bc8 26-Oct-2013 Gleb Smirnoff <glebius@FreeBSD.org>

The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 7caf4ab7 15-Oct-2013 Gleb Smirnoff <glebius@FreeBSD.org>

- Utilize counter(9) to accumulate statistics on interface addresses. Add
four counters to struct ifaddr. This kills '+=' on a variables shared
between processors for every packet.
- Nuke struct if_data from struct ifaddr.
- In ip_input() do not put a reference on ifaddr, instead update statistics
right now in place and do IN_IFADDR_RUNLOCK(). These removes atomic(9)
for every packet. [1]
- To properly support NET_RT_IFLISTL sysctl used by getifaddrs(3), in
rtsock.c fill if_data fields using counter_u64_fetch().
- Accidentially fix bug in COMPAT_32 version of NET_RT_IFLISTL, which
took if_data not from the ifaddr, but from ifaddr's ifnet. [2]

Submitted by: melifaro [1], pluknet[2]
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 3fa98cf9 15-Oct-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Remove unsigned < 0 check.


# 1b4381af 24-Aug-2013 Andre Oppermann <andre@FreeBSD.org>

Restructure the mbuf pkthdr to make it fit for upcoming capabilities and
features. The changes in particular are:

o Remove rarely used "header" pointer and replace it with a 64bit protocol/
layer specific union PH_loc for local use. Protocols can flexibly overlay
their own 8 to 64 bit fields to store information while the packet is
worked on.

o Mechanically convert IP reassembly, IGMP/MLD and ATM to use pkthdr.PH_loc
instead of pkthdr.header.

o Extend csum_flags to 64bits to allow for additional future offload
information to be carried (e.g. iSCSI, IPsec offload, and others).

o Move the RSS hash type enumerator from abusing m_flags to its own 8bit
rsstype field. Adjust accessor macros.

o Add cosqos field to store Class of Service / Quality of Service information
with the packet. It is not yet supported in any drivers but allows us to
get on par with Cisco/Juniper in routing applications (plus MPLS QoS) with
a modernized ALTQ.

o Add four 8 bit fields l[2-5]hlen to store the relative header offsets
from the start of the packet. This is important for various offload
capabilities and to relieve the drivers from having to parse the packet
and protocol headers to find out location of checksums and other
information. Header parsing in drivers is a lot of copy-paste and
unhandled corner cases which we want to avoid.

o Add another flexible 64bit union to map various additional persistent
packet information, like ether_vtag, tso_segsz and csum fields.
Depending on the csum_flags settings some fields may have different usage
making it very flexible and adaptable to future capabilities.

o Restructure the CSUM flags to better signify their outbound (down the
stack) and inbound (up the stack) use. The CSUM flags used to be a bit
chaotic and rather poorly documented leading to incorrect use in many
places. Bring clarity into their use through better naming.
Compatibility mappings are provided to preserve the API. The drivers
can be corrected one by one and MFC'd without issue.

o The size of pkthdr stays the same at 48/56bytes (32/64bit architectures).

Sponsored by: The FreeBSD Foundation


# efdf104b 04-Jul-2013 Mikolaj Golub <trociny@FreeBSD.org>

In r227207, to fix the issue with possible NULL inp_socket pointer
dereferencing, when checking for SO_REUSEPORT option (and SO_REUSEADDR
for multicast), INP_REUSEPORT flag was introduced to cache the socket
option. It was decided then that one flag would be enough to cache
both SO_REUSEPORT and SO_REUSEADDR: when processing SO_REUSEADDR
setsockopt(2), it was checked if it was called for a multicast address
and INP_REUSEPORT was set accordingly.

Unfortunately that approach does not work when setsockopt(2) is called
before binding to a multicast address: the multicast check fails and
INP_REUSEPORT is not set.

Fix this by adding INP_REUSEADDR flag to unconditionally cache
SO_REUSEADDR.

PR: 179901
Submitted by: Michael Gmelin freebsd grem.de (initial version)
Reviewed by: rwatson
MFC after: 1 week


# 4871fc4a 16-May-2013 Julian Elischer <julian@FreeBSD.org>

Finally change the mbuf to have its own fib field instead of stealing
4 flag bits. This was supposed to happen in 8.0, and again in 2012..

MFC after: never


# 9cb8d207 09-Apr-2013 Andrey V. Elsukov <ae@FreeBSD.org>

Use IP6STAT_INC/IP6STAT_DEC macros to update ip6 stats.

MFC after: 1 week


# 10e5acc3 15-Mar-2013 Gleb Smirnoff <glebius@FreeBSD.org>

- Use m_getcl() instead of hand allocating.
- Do not calculate constant length values at run time,
CTASSERT() their sanity.
- Remove superfluous cleaning of mbuf fields after allocation.
- Replace compat macros with function calls.

Sponsored by: Nginx, Inc.


# 7b07d1be 14-Mar-2013 Gleb Smirnoff <glebius@FreeBSD.org>

- Use m_getcl() instead of hand allocating.
- Use m_get()/m_gethdr() instead of macros.
- Remove superfluous cleaning of mbuf fields after allocation.

Sponsored by: Nginx, Inc.


# f8fe3dc9 19-Dec-2012 Andrey V. Elsukov <ae@FreeBSD.org>

When we have some address to forward (e.g. it was specified with ipfw fwd),
we should pass it as first argument into in6_selectroute_fib function to
initiate new route lookup.

MFC after: 1 week


# 16607317 19-Dec-2012 Andrey V. Elsukov <ae@FreeBSD.org>

Make dst_sa initialization only when it is actually needed.

MFC after: 1 week


# 61d88f34 19-Dec-2012 Andrey V. Elsukov <ae@FreeBSD.org>

The selectroute functions does own account of EHOSTUNREACH errors,
no need to do it twice.

MFC after: 1 week


# eb1b1807 05-Dec-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually


# ffdbf9da 01-Nov-2012 Andrey V. Elsukov <ae@FreeBSD.org>

Remove the recently added sysctl variable net.pfil.forward.
Instead, add protocol specific mbuf flags M_IP_NEXTHOP and
M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain
contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup
only when this flag is set.

Suggested by: andre


# c1de64a4 25-Oct-2012 Andrey V. Elsukov <ae@FreeBSD.org>

Remove the IPFIREWALL_FORWARD kernel option and make possible to turn
on the related functionality in the runtime via the sysctl variable
net.pfil.forward. It is turned off by default.

Sponsored by: Yandex LLC
Discussed with: net@
MFC after: 2 weeks


# 6f56329a 22-Oct-2012 Xin LI <delphij@FreeBSD.org>

Remove __P.

Submitted by: kevlo
Reviewed by: md5(1)
MFC after: 2 months


# ab16a5bd 19-Aug-2012 Mikolaj Golub <trociny@FreeBSD.org>

In ip6_ctloutput() guard inp_flags modifications with INP_WLOCK.

MFC after: 2 weeks


# 3b43b783 31-Jul-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

In case of IPsec he have to do delayed checksum calculations before
adding any extension header, or rather before calling into IPsec
processing as we may send the packet and not return to IPv6 output
processing here.

PR: kern/170116
MFC After: 3 days


# 68c99a60 30-Jul-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

Improve the should-never-hit printf to ease debugging in case we'd ever hit
it again when doing the delayed IPv6 checksum calculations.

MFC after: 3 days


# 5dbbe4fd 28-Jul-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

For consistency put the IPsec comment iside the #fidef section.

MFC after: 3 days


# bf984051 04-Jul-2012 Gleb Smirnoff <glebius@FreeBSD.org>

When ip_output()/ip6_output() is supplied a struct route *ro argument,
it skips FLOWTABLE lookup. However, the non-NULL ro has dual meaning
here: it may be supplied to provide route, and it may be supplied to
store and return to caller the route that ip_output()/ip6_output()
finds. In the latter case skipping FLOWTABLE lookup is pessimisation.

The difference between struct route filled by FLOWTABLE and filled
by rtalloc() family is that the former doesn't hold a reference on
its rtentry. Reference is hold by flow entry, and it is about to
be released in future. Thus, route filled by FLOWTABLE shouldn't
be passed to RTFREE() macro.

- Introduce new flag for struct route/route_in6, that marks route
not holding a reference on rtentry.
- Introduce new macro RO_RTFREE() that cleans up a struct route
depending on its kind.
- All callers to ip_output()/ip6_output() that do supply non-NULL
but empty route should use RO_RTFREE() to free results of
lookup.
- ip_output()/ip6_output() now do FLOWTABLE lookup always when
ro->ro_rt == NULL.

Tested by: tuexen (SCTP part)


# a6cff10f 30-May-2012 Michael Tuexen <tuexen@FreeBSD.org>

Seperate SCTP checksum offloading for IPv4 and IPv6.
While there: remove some trainling whitespaces.

MFC after: 3 days
X-MFC with: 236170


# 356ab07e 28-May-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

It turns out that too many drivers are not only parsing the L2/3/4
headers for TSO but also for generic checksum offloading. Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.

To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.

Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.

This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.

Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.

Individual driver updates will have to follow, as will SCTP.

Reported by: gallatin, dim, ..
Reviewed by: gallatin (glanced at?)
MFC after: 3 days
X-MFC with: r235961,235959,235958


# c69baa7e 26-May-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

Correctly get the payload length in host byte order. While we
already plan to support >64k payload here, the IPv6 header payload
length obviously is only 16 bit and the calculations need to be right.

Reported by: dim
Tested by: dim
MFC after: 1 day
X-MFC: with r235958


# e7b92e27 24-May-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

MFp4 bz_ipv6_fast:

Add support for delayed checksum calculations in the IPv6
output path. We currently cannot offload to the card if we
add extension headers (which incl. fragmentation).

Fix two SCTP offload support copy&paste bugs: calculate
checksums if fragmenting and no need to flag IPv4 header
checksums in the IPv6 forwarding path.

Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems

Reviewed by: gnn (as part of the whole)
MFC After: 3 days


# 5aa7e8ed 24-Feb-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

In selectroute() add a missing fibnum argument to an in6_rtalloc()
call in an #if 0 section.

In in6_selecthlim() optimize a case where in6p cannot be NULL due to an
earlier check.

More consistently use u_int instead of int for fibnum function arguments.

Sponsored by: Cisco Systems, Inc.
MFC after: 3 days


# 81d5d46b 03-Feb-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

Add multi-FIB IPv6 support to the core network stack supplementing
the original IPv4 implementation from r178888:

- Use RT_DEFAULT_FIB in the IPv4 implementation where noticed.
- Use rt*fib() KPI with explicit RT_DEFAULT_FIB where applicable in
the NFS code.
- Use the new in6_rt* KPI in TCP, gif(4), and the IPv6 network stack
where applicable.
- Split in6_rtqtimo() and in6_mtutimo() as done in IPv4 and equally
prevent multiple initializations of callouts in in6_inithead().
- Use wrapper functions where needed to preserve the current KPI to
ease MFCs. Use BURN_BRIDGES to indicate expected future cleanup.
- Fix (related) comments (both technical or style).
- Convert to rtinit() where applicable and only use custom loops where
currently not possible otherwise.
- Multicast group, most neighbor discovery address actions and faith(4)
are locked to the default FIB. Individual IPv6 addresses will only
appear in the default FIB, however redirect information and prefixes
of connected subnets are automatically propagated to all FIBs by
default (mimicking IPv4 behavior as closely as possible).

Sponsored by: Cisco Systems, Inc.


# ee799639 03-Feb-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

Add SO_SETFIB option support on PF_INET6 sockets and allow inheriting the
FIB number from the process, as set by setfib(2), on socket creation.

Sponsored by: Cisco Systems, Inc.


# fc06cd42 06-Nov-2011 Mikolaj Golub <trociny@FreeBSD.org>

Cache SO_REUSEPORT socket option in inpcb-layer in order to avoid
inp_socket->so_options dereference when we may not acquire the lock on
the inpcb.

This fixes the crash due to NULL pointer dereference in
in_pcbbind_setup() when inp_socket->so_options in a pcb returned by
in_pcblookup_local() was checked.

Reported by: dave jones <s.dave.jones@gmail.com>, Arnaud Lacombe <lacombar@gmail.com>
Suggested by: rwatson
Glanced by: rwatson
Tested by: dave jones <s.dave.jones@gmail.com>


# 6090ab8b 19-Sep-2011 Hiroki Sato <hrs@FreeBSD.org>

Copy ip6po_minmtu and ip6po_prefer_tempaddr in ip6_copypktopts(). This fixes
inconsistency when options are specified by both setsockopt() and ancillary
data types.

PR: kern/158307
Approved by: re (bz)


# 8a006adb 20-Aug-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Add support for IPv6 to ipfw fwd:
Distinguish IPv4 and IPv6 addresses and optional port numbers in
user space to set the option for the correct protocol family.
Add support in the kernel for carrying the new IPv6 destination
address and port.
Add support to TCP and UDP for IPv6 and fix UDP IPv4 to not change
the address in the IP header.
Add support for IPv6 forwarding to a non-local destination.
Add a regession test uitilizing VIMAGE to check all 20 possible
combinations I could think of.

Obtained from: David Dolson at Sandvine Incorporated
(original version for ipfw fwd IPv6 support)
Sponsored by: Sandvine Incorporated
PR: bin/117214
MFC after: 4 weeks
Approved by: re (kib)


# 6d79f3f6 27-Nov-2010 Rebecca Cran <brucec@FreeBSD.org>

Fix more continuous/contiguous typos (cf. r215955)


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 5f6bf451 24-Sep-2010 Attilio Rao <attilio@FreeBSD.org>

IP_BINDANY is not correctly handled in getsockopt() case.
Fix it by specifying the correct bits.

Sponsored by: Sandvine Incorporated
Reviewed by: bz, emaste, rstone
Obtained from: Sandvine Incorporated
MFC after: 10 days


# 1f93b772 11-May-2010 Kip Macy <kmacy@FreeBSD.org>

try working around panic by validating rt and lle

MFC after: 3 days


# 77931dd5 09-May-2010 Kip Macy <kmacy@FreeBSD.org>

Add flowtable support to IPv6

Tested by: qingli@

Reviewed by: qingli@
MFC after: 3 days


# 54bb4167 05-Apr-2010 Randall Stewart <rrs@FreeBSD.org>

MFC of 2 items to fix the csum for v6 issue:
Revision 205075 and 205104:

---------205075----------
With the recent change of the sctp checksum to support offload,
no delayed checksum was added to the ip6 output code. This
causes cards that do not support SCTP checksum offload to
have SCTP packets that are IPv6 NOT have the sctp checksum
performed. Thus you could not communicate with a peer. This
adds the missing bits to make the checksum happen for these cards.
-------------------------
---------205104----------
The proper fix for the delayed SCTP checksum is to
have the delayed function take an argument as to the offset
to the SCTP header. This allows it to work for V4 and V6.
This of course means changing all callers of the function
to either pass the header len, if they have it, or create
it (ip_hl << 2 or sizeof(ip6_hdr)).
-------------------------
PR: 144529


# 1966e5b5 12-Mar-2010 Randall Stewart <rrs@FreeBSD.org>

The proper fix for the delayed SCTP checksum is to
have the delayed function take an argument as to the offset
to the SCTP header. This allows it to work for V4 and V6.
This of course means changing all callers of the function
to either pass the header len, if they have it, or create
it (ip_hl << 2 or sizeof(ip6_hdr)).
PR: 144529
MFC after: 2 weeks


# 9b03990a 12-Mar-2010 Randall Stewart <rrs@FreeBSD.org>

With the recent change of the sctp checksum to support offload,
no delayed checksum was added to the ip6 output code. This
causes cards that do not support SCTP checksum offload to
have SCTP packets that are IPv6 NOT have the sctp checksum
performed. Thus you could not communicate with a peer. This
adds the missing bits to make the checksum happen for these cards.

PR: 144529
MFC after: 2 weeks


# 2ae7ec29 07-Feb-2010 Julian Elischer <julian@FreeBSD.org>

MFC of 197952 and 198075

Virtualize the pfil hooks so that different jails may chose different
packet filters. ALso allows ipfw to be enabled on on ejail and disabled
on another. In 8.0 it's a global setting.
and
Unbreak the VIMAGE build with IPSEC, broken with r197952 by
virtualizing the pfil hooks.
For consistency add the V_ to virtualize the pfil hooks in here as well.


# 0b4b0b0f 10-Oct-2009 Julian Elischer <julian@FreeBSD.org>

Virtualize the pfil hooks so that different jails may chose different
packet filters. ALso allows ipfw to be enabled on on ejail and disabled
on another. In 8.0 it's a global setting.

Sitting aroung in tree waiting to commit for: 2 months
MFC after: 2 months


# 3d2a8d36 05-Sep-2009 Qing Li <qingli@FreeBSD.org>

MFC r196864

This patch fixes the following issues:
- Interface link-local address is not reachable within the
node that owns the interface, this is due to the mismatch
in address scope as the result of the installed interface
address loopback route. Therefore for each interface
address loopback route, the rt_gateway field (of AF_LINK
type) will be used to track which interface a given
address belongs to. This will aid the address source to
use the proper interface for address scope/zone validation.
- The loopback address is not reachable. The root cause is
the same as the above.
- Empty nd6 entries are created for the IPv6 loopback addresses
only for validation reason. Doing so will eliminate as much
of the special case (loopback addresses) handling code
as possible, however, these empty nd6 entries should not
be returned to the userland applications such as the
"ndp" command.
Since both of the above issues contain common files, these
files are committed together.

Reviewed by: bz
Approved by: re


# 9452b0d2 05-Sep-2009 Qing Li <qingli@FreeBSD.org>

This patch fixes the following issues:
- Interface link-local address is not reachable within the
node that owns the interface, this is due to the mismatch
in address scope as the result of the installed interface
address loopback route. Therefore for each interface
address loopback route, the rt_gateway field (of AF_LINK
type) will be used to track which interface a given
address belongs to. This will aid the address source to
use the proper interface for address scope/zone validation.
- The loopback address is not reachable. The root cause is
the same as the above.
- Empty nd6 entries are created for the IPv6 loopback addresses
only for validation reason. Doing so will eliminate as much
of the special case (loopback addresses) handling code
as possible, however, these empty nd6 entries should not
be returned to the userland applications such as the
"ndp" command.
Since both of the above issues contain common files, these
files are committed together.

Reviewed by: bz
MFC after: immediately


# 530c0060 01-Aug-2009 Robert Watson <rwatson@FreeBSD.org>

Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by: bz
Approved by: re (vimage blanket)


# eddfbb76 14-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)


# 8c0fec80 23-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Modify most routines returning 'struct ifaddr *' to return references
rather than pointers, requiring callers to properly dispose of those
references. The following routines now return references:

ifaddr_byindex
ifa_ifwithaddr
ifa_ifwithbroadaddr
ifa_ifwithdstaddr
ifa_ifwithnet
ifaof_ifpforaddr
ifa_ifwithroute
ifa_ifwithroute_fib
rt_getifa
rt_getifa_fib
IFP_TO_IA
ip_rtaddr
in6_ifawithifp
in6ifa_ifpforlinklocal
in6ifa_ifpwithaddr
in6_ifadd
carp_iamatch6
ip6_getdstifaddr

Remove unused macro which didn't have required referencing:

IFP_TO_IA6

This closes many small races in which changes to interface
or address lists while an ifaddr was in use could lead to use of freed
memory (etc). In a few cases, add missing if_addr_list locking
required to safely acquire references.

Because of a lack of deep copying support, we accept a race in which
an in6_ifaddr pointed to by mbuf tags and extracted with
ip6_getdstifaddr() doesn't hold a reference while in transmit. Once
we have mbuf tag deep copy support, this can be fixed.

Reviewed by: bz
Obtained from: Apple, Inc. (portions)
MFC after: 6 weeks (portions)


# 8d8bc018 08-Jun-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

After r193232 rt_tables in vnet.h are no longer indirectly dependent on
the ROUTETABLES kernel option thus there is no need to include opt_route.h
anymore in all consumers of vnet.h and no longer depend on it for module
builds.

Remove the hidden include in flowtable.h as well and leave the two
explicit #includes in ip_input.c and ip_output.c.


# f44270e7 01-Jun-2009 Pawel Jakub Dawidek <pjd@FreeBSD.org>

- Rename IP_NONLOCALOK IP socket option to IP_BINDANY, to be more consistent
with OpenBSD (and BSD/OS originally). We can't easly do it SOL_SOCKET option
as there is no more space for more SOL_SOCKET options, but this option also
fits better as an IP socket option, it seems.
- Implement this functionality also for IPv6 and RAW IP sockets.
- Always compile it in (don't use additional kernel options).
- Remove sysctl to turn this functionality on and off.
- Introduce new privilege - PRIV_NETINET_BINDANY, which allows to use this
functionality (currently only unjail root can use it).

Discussed with: julian, adrian, jhb, rwatson, kmacy


# 71ce264c 09-May-2009 Warner Losh <imp@FreeBSD.org>

Implement RFC 5095 more fully. Rather than marking this no-op code as
BURN_BRIDGES, just remove it. Adjust comments.

Reviewed by: dwhite, emaste, battlez


# 33cde130 29-Apr-2009 Bruce M Simpson <bms@FreeBSD.org>

Bite the bullet, and make the IPv6 SSM and MLDv2 mega-commit:
import from p4 bms_netdev. Summary of changes:

* Connect netinet6/in6_mcast.c to build.
The legacy KAME KPIs are mostly preserved.
* Eliminate now dead code from ip6_output.c.
Don't do mbuf bingo, we are not going to do RFC 2292 style
CMSG tricks for multicast options as they are not required
by any current IPv6 normative reference.
* Refactor transports (UDP, raw_ip6) to do own mcast filtering.
SCTP, TCP unaffected by this change.
* Add ip6_msource, in6_msource structs to in6_var.h.
* Hookup mld_ifinfo state to in6_ifextra, allocate from
domifattach path.
* Eliminate IN6_LOOKUP_MULTI(), it is no longer referenced.
Kernel consumers which need this should use in6m_lookup().
* Refactor IPv6 socket group memberships to use a vector (like IPv4).
* Update ifmcstat(8) for IPv6 SSM.
* Add witness lock order for IN6_MULTI_LOCK.
* Move IN6_MULTI_LOCK out of lower ip6_output()/ip6_input() paths.
* Introduce IP6STAT_ADD/SUB/INC/DEC as per rwatson's IPv4 cleanup.
* Update carp(4) for new IPv6 SSM KPIs.
* Virtualize ip6_mrouter socket.
Changes mostly localized to IPv6 MROUTING.
* Don't do a local group lookup in MROUTING.
* Kill unused KAME prototypes in6_purgemkludge(), in6_restoremkludge().
* Preserve KAME DAD timer jitter behaviour in MLDv1 compatibility mode.
* Bump __FreeBSD_version to 800084.
* Update UPDATING.

NOTE WELL:
* This code hasn't been tested against real MLDv2 queriers
(yet), although the on-wire protocol has been verified in Wireshark.
* There are a few unresolved issues in the socket layer APIs to
do with scope ID propagation.
* There is a LOR present in ip6_output()'s use of
in6_setscope() which needs to be resolved. See comments in mld6.c.
This is believed to be benign and can't be avoided for the moment
without re-introducing an indirect netisr.

This work was mostly derived from the IGMPv3 implementation, and
has been sponsored by a third party.


# 1263305f 03-Mar-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Start removing IPv6 Type 0 Routing header code.
RH0 was deprecated by RFC 5095.

While most of the code had been disabled by #if 0 already, leave a
bit of infrastructure for possible RH2 code and a log message under
BURN_BRIDGES in case a user still tries to send RH0 packets.

Reviewed by: gnn (a bit back, earlier version)


# 33553d6e 27-Feb-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

For all files including net/vnet.h directly include opt_route.h and
net/route.h.

Remove the hidden include of opt_route.h and net/route.h from net/vnet.h.

We need to make sure that both opt_route.h and net/route.h are included
before net/vnet.h because of the way MRT figures out the number of FIBs
from the kernel option. If we do not, we end up with the default number
of 1 when including net/vnet.h and array sizes are wrong.

This does not change the list of files which depend on opt_route.h
but we can identify them now more easily.


# 97aa4a51 08-Feb-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Try to remove/assimilate as much of formerly IPv4/6 specific
(duplicate) code in sys/netipsec/ipsec.c and fold it into
common, INET/6 independent functions.

The file local functions ipsec4_setspidx_inpcb() and
ipsec6_setspidx_inpcb() were 1:1 identical after the change
in r186528. Rename to ipsec_setspidx_inpcb() and remove the
duplicate.

Public functions ipsec[46]_get_policy() were 1:1 identical.
Remove one copy and merge in the factored out code from
ipsec_get_policy() into the other. The public function left
is now called ipsec_get_policy() and callers were adapted.

Public functions ipsec[46]_set_policy() were 1:1 identical.
Rename file local ipsec_set_policy() function to
ipsec_set_policy_internal().
Remove one copy of the public functions, rename the other
to ipsec_set_policy() and adapt callers.

Public functions ipsec[46]_hdrsiz() were logically identical
(ignoring one questionable assert in the v6 version).
Rename the file local ipsec_hdrsiz() to ipsec_hdrsiz_internal(),
the public function to ipsec_hdrsiz(), remove the duplicate
copy and adapt the callers.
The v6 version had been unused anyway. Cleanup comments.

Public functions ipsec[46]_in_reject() were logically identical
apart from statistics. Move the common code into a file local
ipsec46_in_reject() leaving vimage+statistics in small AF specific
wrapper functions. Note: unfortunately we already have a public
ipsec_in_reject().

Reviewed by: sam
Discussed with: rwatson (renaming to *_internal)
MFC after: 26 days
X-MFC: keep wrapper functions for public symbols?


# 97590249 17-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Another step assimilating IPv[46] PCB code:
normalize IN6P_* compat flags usage to their equialent
INP_* counterpart.

Discussed with: rwatson
Reviewed by: rwatson
MFC after: 4 weeks


# dcdb4371 16-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Use inc_flags instead of the inc_isipv6 alias which so far
had been the only flag with random usage patterns.
Switch inc_flags to be used as a real bit field by using
INC_ISIPV6 with bitops to check for the 'isipv6' condition.

While here fix a place or two where in case of v4 inc_flags
were not properly initialized before.[1]

Found by: rwatson during review [1]
Discussed with: rwatson
Reviewed by: rwatson
MFC after: 4 weeks


# fc384fa5 15-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Another step assimilating IPv[46] PCB code - directly use
the inpcb names rather than the following IPv6 compat macros:
in6pcb,in6p_sp, in6p_ip6_nxt,in6p_flowinfo,in6p_vflag,
in6p_flags,in6p_socket,in6p_lport,in6p_fport,in6p_ppcb and
sotoin6pcb().

Apart from removing duplicate code in netipsec, this is a pure
whitespace, not a functional change.

Discussed with: rwatson
Reviewed by: rwatson (version before review requested changes)
MFC after: 4 weeks (set the timer and see then)


# 6e6b3f7c 14-Dec-2008 Qing Li <qingli@FreeBSD.org>

This main goals of this project are:
1. separating L2 tables (ARP, NDP) from the L3 routing tables
2. removing as much locking dependencies among these layers as
possible to allow for some parallelism in the search operations
3. simplify the logic in the routing code,

The most notable end result is the obsolescent of the route
cloning (RTF_CLONING) concept, which translated into code reduction
in both IPv4 ARP and IPv6 NDP related modules, and size reduction in
struct rtentry{}. The change in design obsoletes the semantics of
RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland
applications such as "arp" and "ndp" have been modified to reflect
those changes. The output from "netstat -r" shows only the routing
entries.

Quite a few developers have contributed to this project in the
past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and
Andre Oppermann. And most recently:

- Kip Macy revised the locking code completely, thus completing
the last piece of the puzzle, Kip has also been conducting
active functional testing
- Sam Leffler has helped me improving/refactoring the code, and
provided valuable reviews
- Julian Elischer setup the perforce tree for me and has helped
me maintaining that branch before the svn conversion


# 4b79449e 02-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 6f4da201 15-Oct-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Check that the mbuf len is positive (like we do in the v4 case).

Read the other way round this means that even with the checks
the m_len turned negative in some cases which led to panics.
The reason to my understanding seems to be that the checks are wrong
(also for v4) ignoring possible padding when checking cmsg_len or
padding after data when adjusting the mbuf.
Doing proper cheks seems to break applications like named so
further investigation and regression tests are needed.

PR: kern/119123
Tested by: Ashish Shukla wahjava gmail.com
MFC after: 3 days


# 8b615593 02-Oct-2008 Marko Zec <zec@FreeBSD.org>

Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 603724d3 17-Aug-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch


# cc29ac7d 29-Jul-2008 Robert Watson <rwatson@FreeBSD.org>

Marginally decomplicate set/getsockopt code in ip6_output.c by simply
using the passed arguments explicitly and unconditionally rather than
testing them and calling panic(). The result is the same but easier
to read.

MFC after: 3 days


# ea26d587 25-Mar-2008 Ruslan Ermilov <ru@FreeBSD.org>

Replaced the misleading uses of a historical artefact M_TRYWAIT with M_WAIT.
Removed dead code that assumed that M_TRYWAIT can return NULL; it's not true
since the advent of MBUMA.

Reviewed by: arch

There are ongoing disputes as to whether we want to switch to directly using
UMA flags M_WAITOK/M_NOWAIT for mbuf(9) allocation.


# 9e3bdede 14-Mar-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Correct IPsec behaviour with a 'use' level in SP but no SA available.
In that case return an continue processing the packet without IPsec.

PR: 121384
MFC after: 5 days
Reported by: Cyrus Rahman (crahman gmail.com)
Tested by: Cyrus Rahman (crahman gmail.com) [slightly older version]


# 8cfbd299 14-Mar-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Correct reference counting on the SP for outgoing IPv6 IPsec connections.

PR: 121374
Reported by: Cyrus Rahman (crahman gmail.com)
Tested by: Cyrus Rahman (crahman gmail.com)
MFC after: 5 days


# 41aa71dd 14-Mar-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Replace the function name in two identical printfs
by __func__, __LINE__ so we can distinguish them
when people report a problem.

PR: 121373
MFC after: 5 days


# c26fe973 02-Feb-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Rather than passing around a cached 'priv', pass in an ucred to
ipsec*_set_policy and do the privilege check only if needed.

Try to assimilate both ip*_ctloutput code blocks calling ipsec*_set_policy.

Reviewed by: rwatson


# 79ba3952 24-Jan-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Replace the last susers calls in netinet6/ with privilege checks.

Introduce a new privilege allowing to set certain IP header options
(hop-by-hop, routing headers).

Leave a few comments to be addressed later.

Reviewed by: rwatson (older version, before addressing his comments)


# 9233d8f3 08-Jan-2008 David E. O'Brien <obrien@FreeBSD.org>

un-__P()


# b48287a3 10-Dec-2007 David E. O'Brien <obrien@FreeBSD.org>

Clean up VCS Ids.


# 016fb9d9 21-Nov-2007 Mike Makonnen <mtm@FreeBSD.org>

Instead of manually freeing the packet options structure (and not even doing
a good job of it) in the copypktopts() function, just call ip6_clearpktopts()
directly. Otherwise, the callers of this function would end up freeing the
memory twice.

Reviewed by: jinmei
PR: kern/116360


# 2a463222 05-Jul-2007 Xin LI <delphij@FreeBSD.org>

Space cleanup

Approved by: re (rwatson)


# 1272577e 05-Jul-2007 Xin LI <delphij@FreeBSD.org>

ANSIfy[1] plus some style cleanup nearby.

Discussed with: gnn, rwatson
Submitted by: Karl Sj?dahl - dunceor <dunceor gmail com> [1]
Approved by: re (rwatson)


# b2630c29 02-Jul-2007 George V. Neville-Neil <gnn@FreeBSD.org>

Commit the change from FAST_IPSEC to IPSEC. The FAST_IPSEC
option is now deprecated, as well as the KAME IPsec code.
What was FAST_IPSEC is now IPSEC.

Approved by: re
Sponsored by: Secure Computing


# 2cb64cb2 01-Jul-2007 George V. Neville-Neil <gnn@FreeBSD.org>

Commit IPv6 support for FAST_IPSEC to the tree.
This commit includes only the kernel files, the rest of the files
will follow in a second commit.

Reviewed by: bz
Approved by: re
Supported by: Secure Computing


# c2259ba4 13-Jun-2007 Robert Watson <rwatson@FreeBSD.org>

Include priv.h to pick up suser(9) definitions, missed in an earlier
commit.

Warnings spotted by: kris


# 43bc7a9c 04-Aug-2006 Brooks Davis <brooks@FreeBSD.org>

With exception of the if_name() macro, all definitions in net_osdep.h
were unused or already in if_var.h so add if_name() to if_var.h and
remove net_osdep.h along with all references to it.

Longer term we may want to kill off if_name() entierly since all modern
BSDs have if_xname variables rendering it unnecessicary.


# 656faadc 12-May-2006 Max Laier <mlaier@FreeBSD.org>

Remove ip6fw. Since ipfw has full functional IPv6 support now and - in
contrast to ip6fw - is properly lockes, it is time to retire ip6fw.


# 604afec4 01-Feb-2006 Christian S.J. Peron <csjp@FreeBSD.org>

Somewhat re-factor the read/write locking mechanism associated with the packet
filtering mechanisms to use the new rwlock(9) locking API:

- Drop the variables stored in the phil_head structure which were specific to
conditions and the home rolled read/write locking mechanism.
- Drop some includes which were used for condition variables
- Drop the inline functions, and convert them to macros. Also, move these
macros into pfil.h
- Move pfil list locking macros intp phil.h as well
- Rename ph_busy_count to ph_nhooks. This variable will represent the number
of IN/OUT hooks registered with the pfil head structure
- Define PFIL_HOOKED macro which evaluates to true if there are any
hooks to be ran by pfil_run_hooks
- In the IP/IP6 stacks, change the ph_busy_count comparison to use the new
PFIL_HOOKED macro.
- Drop optimization in pfil_run_hooks which checks to see if there are any
hooks to be ran, and returns if not. This check is already performed by the
IP stacks when they call:

if (!PFIL_HOOKED(ph))
goto skip_hooks;

- Drop in assertion which makes sure that the number of hooks never drops
below 0 for good measure. This in theory should never happen, and if it
does than there are problems somewhere
- Drop special logic around PFIL_WAITOK because rw_wlock(9) does not sleep
- Drop variables which support home rolled read/write locking mechanism from
the IPFW firewall chain structure.
- Swap out the read/write firewall chain lock internal to use the rwlock(9)
API instead of our home rolled version
- Convert the inlined functions to macros

Reviewed by: mlaier, andre, glebius
Thanks to: jhb for the new locking API


# fc4c8258 13-Jan-2006 Robert Watson <rwatson@FreeBSD.org>

When storing the results of malloc() in a pointer to a pointer, check
the pointer to a pointer for NULL, not the pointer for NULL.

Noticed by: Coverity Prevent analysis tool
MFC after: 3 days


# 743eee66 21-Oct-2005 SUZUKI Shinsuke <suz@FreeBSD.org>

sync with KAME regarding NDP

- introduced fine-grain-timer to manage ND-caches and IPv6 Multicast-Listeners
- supports Router-Preference <draft-ietf-ipv6-router-selection-07.txt>
- better prefix lifetime management
- more spec-comformant DAD advertisement
- updated RFC/internet-draft revisions

Obtained from: KAME
Reviewed by: ume, gnn
MFC after: 2 month


# 4ecbe331 21-Oct-2005 SUZUKI Shinsuke <suz@FreeBSD.org>

sync with KAME (renamed a macro IPV6_DADOUTPUT to IPV6_UNSPECSRC)

Obtained from: KAME


# 7ba26d99 07-Sep-2005 David E. O'Brien <obrien@FreeBSD.org>

IPv6 was improperly defining its malloc type the same as IPv4 (M_IPMADDR,
M_IPMOPTS, M_MRTABLE). Thus we had conflicting instantiations.
Create an IPv6-specific type to overcome this.


# e0aec682 30-Aug-2005 Andre Oppermann <andre@FreeBSD.org>

Use the correct mbuf type for MGET().


# e770771a 28-Jul-2005 Hajimu UMEMOTO <ume@FreeBSD.org>

simplied the fix to FreeBSD-SA-04:06.ipv6. The previous one worried
too much even though we actually validate the parameters. This code
also is more compatible with other *BSDs, which do copyin within
setsockopt().

Submitted by: Keiichi SHIMA <keiichi__at__iijlab.net>
Reviewed by: security-officer (nectar)
Obtained from: KAME


# a1f7e5f8 24-Jul-2005 Hajimu UMEMOTO <ume@FreeBSD.org>

scope cleanup. with this change
- most of the kernel code will not care about the actual encoding of
scope zone IDs and won't touch "s6_addr16[1]" directly.
- similarly, most of the kernel code will not care about link-local
scoped addresses as a special case.
- scope boundary check will be stricter. For example, the current
*BSD code allows a packet with src=::1 and dst=(some global IPv6
address) to be sent outside of the node, if the application do:
s = socket(AF_INET6);
bind(s, "::1");
sendto(s, some_global_IPv6_addr);
This is clearly wrong, since ::1 is only meaningful within a single
node, but the current implementation of the *BSD kernel cannot
reject this attempt.

Submitted by: JINMEI Tatuya <jinmei__at__isl.rdc.toshiba.co.jp>
Obtained from: KAME


# 885adbfa 21-Jul-2005 Hajimu UMEMOTO <ume@FreeBSD.org>

always copy ip6_pktopt. remove needcopy and needfree
argument/structure member accordingly.

Submitted by: Keiichi SHIMA <keiichi__at__iijlab.net>
Obtained from: KAME


# d5e3406d 21-Jul-2005 Hajimu UMEMOTO <ume@FreeBSD.org>

be consistent on naming advanced API functions; use ip6_XXXpktopt(s).

Submitted by: Keiichi SHIMA <keiichi__at__iijlab.net>
Obtained from: KAME


# 8507acb1 21-Jul-2005 Hajimu UMEMOTO <ume@FreeBSD.org>

NULL is not zero.

Submitted by: Keiichi SHIMA <keiichi__at__iijlab.net>
Obtained from: KAME


# 18b35df8 20-Jul-2005 Hajimu UMEMOTO <ume@FreeBSD.org>

update comments:
- RFC2292bis -> RFC3542
- typo fixes

Submitted by: Keiichi SHIMA <keiichi__at__iijlab.net>
Obtained from: KAME


# fc74a9f9 10-Jun-2005 Brooks Davis <brooks@FreeBSD.org>

Stop embedding struct ifnet at the top of driver softcs. Instead the
struct ifnet or the layer 2 common structure it was embedded in have
been replaced with a struct ifnet pointer to be filled by a call to the
new function, if_alloc(). The layer 2 common structure is also allocated
via if_alloc() based on the interface type. It is hung off the new
struct ifnet member, if_l2com.

This change removes the size of these structures from the kernel ABI and
will allow us to better manage them as interfaces come and go.

Other changes of note:
- Struct arpcom is no longer referenced in normal interface code.
Instead the Ethernet address is accessed via the IFP2ENADDR() macro.
To enforce this ac_enaddr has been renamed to _ac_enaddr.
- The second argument to ether_ifattach is now always the mac address
from driver private storage rather than sometimes being ac_enaddr.

Reviewed by: sobomax, sam


# 403cbcf5 14-May-2005 George V. Neville-Neil <gnn@FreeBSD.org>

Fixes for various nits found by the Coverity tool.

In particular 2 missed return values and an inappropriate bcopy from
a possibly NULL pointer.

Reviewed by: jake
Approved by: rwatson
MFC after: 1 week


# 8195404b 18-Apr-2005 Brooks Davis <brooks@FreeBSD.org>

Add IPv6 support to IPFW and Dummynet.

Submitted by: Mariano Tortoriello and Raffaele De Lorenzo (via luigi)


# 283f9f8a 27-Feb-2005 Hajimu UMEMOTO <ume@FreeBSD.org>

initialized the last arg to ip6_process_hopopts(), because the recent
code requires it to be 0 when a jumbo payload option is contained.

PR: kern/77934
Submitted by: Gerd Rausch <gerd@juniper.net>
Obtained from: KAME
MFC after: 2 days


# caf43b02 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes, separate for KAME


# 763f534e 02-Oct-2004 Doug White <dwhite@FreeBSD.org>

Disable MTU feedback in IPv6 if the sender writes data that must be fragmented.
Discussed extensively with KAME. The API author's intent isn't clear at this
point, so rather than remove the code entirely, #if 0 out and put a big
comment in for now. The IPV6_RECVPATHMTU sockopt is available if the
application wants to be notified of the path MTU to optimize packet sizes.

Thanks to JINMEI Tatuya <jinmei@isl.rdc.toshiba.co.jp> for putting up
with my incessant badgering on this issue, and fenner for pointing out
the API issue and suggesting solutions.


# d6a8d588 28-Sep-2004 Max Laier <mlaier@FreeBSD.org>

Add an additional struct inpcb * argument to pfil(9) in order to enable
passing along socket information. This is required to work around a LOR with
the socket code which results in an easy reproducible hard lockup with
debug.mpsafenet=1. This commit does *not* fix the LOR, but enables us to do
so later. The missing piece is to turn the filter locking into a leaf lock
and will follow in a seperate (later) commit.

This will hopefully be MT5'ed in order to fix the problem for RELENG_5 in
forseeable future.

Suggested by: rwatson
A lot of work by: csjp (he'd be even more helpful w/o mentor-reviews ;)
Reviewed by: rwatson, csjp
Tested by: -pf, -ipfw, LINT, csjp and myself
MFC after: 3 days

LOR IDs: 14 - 17 (not fixed yet)


# c21fd232 27-Aug-2004 Andre Oppermann <andre@FreeBSD.org>

Always compile PFIL_HOOKS into the kernel and remove the associated kernel
compile option. All FreeBSD packet filters now use the PFIL_HOOKS API and
thus it becomes a standard part of the network stack.

If no hooks are connected the entire packet filter hooks section and related
activities are jumped over. This removes any performance impact if no hooks
are active.

Both OpenBSD and DragonFlyBSD have integrated PFIL_HOOKS permanently as well.


# 1f44b0a1 14-Aug-2004 David Malone <dwmalone@FreeBSD.org>

Get rid of the RANDOM_IP_ID option and make it a sysctl. NetBSD
have already done this, so I have styled the patch on their work:

1) introduce a ip_newid() static inline function that checks
the sysctl and then decides if it should return a sequential
or random IP ID.

2) named the sysctl net.inet.ip.random_id

3) IPv6 flow IDs and fragment IDs are now always random.
Flow IDs and frag IDs are significantly less common in the
IPv6 world (ie. rarely generated per-packet), so there should
be smaller performance concerns.

The sysctl defaults to 0 (sequential IP IDs).

Reviewed by: andre, silby, mlaier, ume
Based on: NetBSD
MFC after: 2 months


# 6f8aee22 13-May-2004 Bill Paul <wpaul@FreeBSD.org>

Fix a bug which I discovered recently while doing IPv6 testing at
Wind River. In the IPv4 output path, one of the tests in ip_output()
checks how many slots are actually available in the interface output
queue before attempting to send a packet. If, for example, we need
to transmit a packet of 32K bytes over an interface with an MTU of
1500, we know it's going to take about 21 fragments to do it. If
there's less than 21 slots left in the output queue, there's no point
in transmitting anything at all: IP does not do retransmission, so
sending only some of the fragments would just be a waste of bandwidth.
(In an extreme case, if you're sending a heavy stream of fragmented
packets, you might find yourself sending nothing by the first fragment
of all your packets.) So if ip_output() notices there's not enough
room in the output queue to send the frame, it just dumps the packet
and returns ENOBUFS to the app.

It turns out ip6_output() lacks this code. Consequently, this caused
the netperf UDPIPV6_STREAM test to produce very poor results with large
write sizes. This commit adds code to check the remaining space in the
output queue and junk fragmented packets if they're too big to be
sent, just like with IPv4. (I can't imagine anyone's running an NFS
server using UDP over IPv6, but if they are, this will likely make them
a lot happier. :)


# f36cfd49 07-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson


# a5d1aae3 26-Mar-2004 Hajimu UMEMOTO <ume@FreeBSD.org>

Validate IPv6 socket options more carefully to avoid a panic.

PR: kern/61513
Reviewed by: cperciva, nectar


# da0f4099 17-Feb-2004 Hajimu UMEMOTO <ume@FreeBSD.org>

IPSEC and FAST_IPSEC have the same internal API now;
so merge these (IPSEC has an extra ipsecstat)

Submitted by: "Bjoern A. Zeeb" <bzeeb+freebsd@zabbadoz.net>


# 8b00e59d 08-Feb-2004 Hajimu UMEMOTO <ume@FreeBSD.org>

- obey ip6po_minmtu.
- notify a proper path MTU to applications.

Obtained from: KAME


# f073c60f 03-Feb-2004 Hajimu UMEMOTO <ume@FreeBSD.org>

pass pcb rather than so. it is expected that per socket policy
works again.


# a46f7e7c 23-Dec-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

Catch a few places where NULL (pointer) was used where 0 (integer) was
expected (fix build).


# a89ec05e 22-Dec-2003 Peter Wemm <peter@FreeBSD.org>

Catch a few places where NULL (pointer) was used where 0 (integer) was
expected.


# aef03e95 21-Dec-2003 SUZUKI Shinsuke <suz@FreeBSD.org>

fixed a bug that IPv6 routing header does not work properly if specified from userland application

reviewed by: ume


# 03a1bc3e 16-Dec-2003 SUZUKI Shinsuke <suz@FreeBSD.org>

fixed an IPv6 path MTU discovery failure owing to a lack of initialization

Reviewed by: ume
Approved by: re (scottl)
MFC after: 1 day


# 289b28bd 23-Nov-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

pktopt may be null.

Approved by: re (rwatson)


# 97d8d152 20-Nov-2003 Andre Oppermann <andre@FreeBSD.org>

Introduce tcp_hostcache and remove the tcp specific metrics from
the routing table. Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.

It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination. Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.

tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.

It removes significant locking requirements from the tcp stack with
regard to the routing table.

Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)


# e5f467a2 17-Nov-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

correct to look right interface.


# 7138d65c 08-Nov-2003 Sam Leffler <sam@FreeBSD.org>

replace explicit changes to rt_refcnt by RT_ADDREF and RT_REMREF
macros that expand to include assertions when the system is built
with INVARIANTS

Supported by: FreeBSD Foundation


# 07027f9d 06-Nov-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

correct behavior when ipv6mr_interface is 0. Matthias Drochner

Notified by: itojun
Obtained from: NetBSD


# 0f9ade71 04-Nov-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

- cleanup SP refcnt issue.
- share policy-on-socket for listening socket.
- don't copy policy-on-socket at all. secpolicy no longer contain
spidx, which saves a lot of memory.
- deep-copy pcb policy if it is an ipsec policy. assign ID field to
all SPD entries. make it possible for racoon to grab SPD entry on
pcb.
- fixed the order of searching SA table for packets.
- fixed to get a security association header. a mode is always needed
to compare them.
- fixed that the incorrect time was set to
sadb_comb_{hard|soft}_usetime.
- disallow port spec for tunnel mode policy (as we don't reassemble).
- an user can define a policy-id.
- clear enc/auth key before freeing.
- fixed that the kernel crashed when key_spdacquire() was called
because key_spdacquire() had been implemented imcopletely.
- preparation for 64bit sequence number.
- maintain ordered list of SA, based on SA id.
- cleanup secasvar management; refcnt is key.c responsibility;
alloc/free is keydb.c responsibility.
- cleanup, avoid double-loop.
- use hash for spi-based lookup.
- mark persistent SP "persistent".
XXX in theory refcnt should do the right thing, however, we have
"spdflush" which would touch all SPs. another solution would be to
de-register persistent SPs from sptree.
- u_short -> u_int16_t
- reduce kernel stack usage by auto variable secasindex.
- clarify function name confusion. ipsec_*_policy ->
ipsec_*_pcbpolicy.
- avoid variable name confusion.
(struct inpcbpolicy *)pcb_sp, spp (struct secpolicy **), sp (struct
secpolicy *)
- count number of ipsec encapsulations on ipsec4_output, so that we
can tell ip_output() how to handle the packet further.
- When the value of the ul_proto is ICMP or ICMPV6, the port field in
"src" of the spidx specifies ICMP type, and the port field in "dst"
of the spidx specifies ICMP code.
- avoid from applying IPsec transport mode to the packets when the
kernel forwards the packets.

Tested by: nork
Obtained from: KAME


# 29bc2c48 31-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

do not insert a dest option header (even specified by a user) that
should be placed before a routing header, unless a routing header
really exists.

Obtained from: KAME


# 02b9a206 26-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

re-add wrongly disappered IPV6_CHECKSUM stuff by introducing
ip6_raw_ctloutput().

Obtained from: KAME


# c302f5bc 24-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

remove the ip6r0_addr and ip6r0_slmap members from ip6_rthdr0{}
according to rfc2292bis.

Obtained from: KAME


# f95d4633 24-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

Switch Advanced Sockets API for IPv6 from RFC2292 to RFC3542
(aka RFC2292bis). Though I believe this commit doesn't break
backward compatibility againt existing binaries, it breaks
backward compatibility of API.
Now, the applications which use Advanced Sockets API such as
telnet, ping6, mld6query and traceroute6 use RFC3542 API.

Obtained from: KAME


# 9a4f9608 21-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

- change scope to zone.
- change node-local to interface-local.
- better error handling of address-to-scope mapping.
- use in6_clearscope().

Obtained from: KAME


# 31b3783c 20-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

correct linkmtu handling.

Obtained from: KAME


# 31b1bfe1 17-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

- add dom_if{attach,detach} framework.
- transition to use ifp->if_afdata.

Obtained from: KAME


# 953ad2fb 10-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

nuke SCOPEDROUTING. Though it was there for a long time,
it was never enabled.


# 7efe5d92 08-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

- fix typo in comments.
- style.
- NULL is not 0.
- some variables were renamed.
- nuke unused logic.
(there is no functional change.)

Obtained from: KAME


# 68974f29 07-Oct-2003 Sam Leffler <sam@FreeBSD.org>

must lock route when the caller provided a route but not
an interface; otherwise the subsequent unlock blows up

Suffered by: Marcel Moolenaar <marcel@xcllnt.net>
Supported by: FreeBSD Foundation


# 40e39bbb 06-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

return(code) -> return (code)
(reduce diffs against KAME)


# d1dd20be 03-Oct-2003 Sam Leffler <sam@FreeBSD.org>

Locking for updates to routing table entries. Each rtentry gets a mutex
that covers updates to the contents. Note this is separate from holding
a reference and/or locking the routing table itself.

Other/related changes:

o rtredirect loses the final parameter by which an rtentry reference
may be returned; this was never used and added unwarranted complexity
for locking.
o minor style cleanups to routing code (e.g. ansi-fy function decls)
o remove the logic to bump the refcnt on the parent of cloned routes,
we assume the parent will remain as long as the clone; doing this avoids
a circularity in locking during delete
o convert some timeouts to MPSAFE callouts

Notes:

1. rt_mtx in struct rtentry is guarded by #ifdef _KERNEL as user-level
applications cannot/do-no know about mutex's. Doing this requires
that the mutex be the last element in the structure. A better solution
is to introduce an externalized version of struct rtentry but this is
a major task because of the intertwining of rtentry and other data
structures that are visible to user applications.
2. There are known LOR's that are expected to go away with forthcoming
work to eliminate many held references. If not these will be resolved
prior to release.
3. ATM changes are untested.

Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS (partly)


# 29234943 01-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

Obey RANDOM_IP_ID.

Requested by: sam


# 8373d51d 01-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

randomize IPv6 fragment ID.

Obtained from: KAME


# b140bc1f 29-Sep-2003 Sam Leffler <sam@FreeBSD.org>

Correct pfil_run_hooks return handling: if the return value is non-zero
then the mbuf has been consumed by a hook; otherwise beware of a null
mbuf return (gack). In particular the bridge was doing the wrong thing.
While in the ipv6 code make it's handling of pfil_run_hooks identical
to netbsd.

Pointed out by: Pyun YongHyeon <yongari@kt-is.co.kr>


# 134ea224 23-Sep-2003 Sam Leffler <sam@FreeBSD.org>

o update PFIL_HOOKS support to current API used by netbsd
o revamp IPv4+IPv6+bridge usage to match API changes
o remove pfil_head instances from protosw entries (no longer used)
o add locking
o bump FreeBSD version for 3rd party modules

Heavy lifting by: "Max Laier" <max@love2party.net>
Supported by: FreeBSD Foundation
Obtained from: NetBSD (bits of pfil.h and pfil.c)


# 8608c4c1 20-Feb-2003 Jonathan Lemon <jlemon@FreeBSD.org>

Remove unused variables in the IPSEC case.

Submitted by: Lars Eggert <larse@ISI.EDU>


# 340c35de 19-Feb-2003 Jonathan Lemon <jlemon@FreeBSD.org>

Add a TCP TIMEWAIT state which uses less space than a fullblown TCP
control block. Allow the socket and tcpcb structures to be freed
earlier than inpcb. Update code to understand an inp w/o a socket.

Reviewed by: hsu, silby, jayanth
Sponsored by: DARPA, NAI Labs


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 0e7dea83 06-Jan-2003 Sam Leffler <sam@FreeBSD.org>

purge extraneous clears of M_PKTHDR since M_MOVE_PKTHDR does this already


# 9967cafc 30-Dec-2002 Sam Leffler <sam@FreeBSD.org>

Correct mbuf packet header propagation. Previously, packet headers
were sometimes propagated using M_COPY_PKTHDR which actually did
something between a "move" and a "copy" operation. This is replaced
by M_MOVE_PKTHDR (which copies the pkthdr contents and "removes" it
from the source mbuf) and m_dup_pkthdr which copies the packet
header contents including any m_tag chain. This corrects numerous
problems whereby mbuf tags could be lost during packet manipulations.

These changes also introduce arguments to m_tag_copy and m_tag_copy_chain
to specify if the tag copy work should potentially block. This
introduces an incompatibility with openbsd which we may want to revisit.

Note that move/dup of packet headers does not handle target mbufs
that have a cluster bound to them. We may want to support this;
for now we watch for it with an assert.

Finally, M_COPYFLAGS was updated to include M_FIRSTFRAG|M_LASTFRAG.

Supported by: Vernier Networks
Reviewed by: Robert Watson <rwatson@FreeBSD.org>


# 35f6695b 31-Oct-2002 Hajimu UMEMOTO <ume@FreeBSD.org>

plugged memory leakage in some erroneous cases

Obtained from: KAME
MFC after: 1 week


# b9234faf 15-Oct-2002 Sam Leffler <sam@FreeBSD.org>

Tie new "Fast IPsec" code into the build. This involves the usual
configuration stuff as well as conditional code in the IPv4 and IPv6
areas. Everything is conditional on FAST_IPSEC which is mutually
exclusive with IPSEC (KAME IPsec implmentation).

As noted previously, don't use FAST_IPSEC with INET6 at the moment.

Reviewed by: KAME, rwatson
Approved by: silence
Supported by: Vernier Networks


# 5d846453 15-Oct-2002 Sam Leffler <sam@FreeBSD.org>

Replace aux mbufs with packet tags:

o instead of a list of mbufs use a list of m_tag structures a la openbsd
o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit
ABI/module number cookie
o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and
use this in defining openbsd-compatible m_tag_find and m_tag_get routines
o rewrite KAME use of aux mbufs in terms of packet tags
o eliminate the most heavily used aux mbufs by adding an additional struct
inpcb parameter to ip_output and ip6_output to allow the IPsec code to
locate the security policy to apply to outbound packets
o bump __FreeBSD_version so code can be conditionalized
o fixup ipfilter's call to ip_output based on __FreeBSD_version

Reviewed by: julian, luigi (silent), -arch, -net, darren
Approved by: julian, silence from everyone else
Obtained from: openbsd (mostly)
MFC after: 1 month


# 4a6a94d8 22-Aug-2002 Archie Cobbs <archie@FreeBSD.org>

Replace (ab)uses of "NULL" where "0" is really meant.


# 12253795 24-Jul-2002 Hajimu UMEMOTO <ume@FreeBSD.org>

make sure to set/unset INP_IPV4 according to a value
of IN6P_IPV6_V6ONLY

Reviewed by: Keiichi SHIMA <keiichi@iij.ad.jp>


# 854d3b19 22-Jul-2002 Hajimu UMEMOTO <ume@FreeBSD.org>

do not refer to IN6P_BINDV6ONLY anymore.

Obtained from: KAME
MFC after: 1 week


# 88ff5695 18-Apr-2002 SUZUKI Shinsuke <suz@FreeBSD.org>

just merged cosmetic changes from KAME to ease sync between KAME and FreeBSD.
(based on freebsd4-snap-20020128)

Reviewed by: ume
MFC after: 1 week


# 44731cab 01-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on: smp@


# 1183d014 29-Mar-2002 Hajimu UMEMOTO <ume@FreeBSD.org>

Fix cached route problem.

Submitted by: Keiichi SHIMA <keiichi@iij.ad.jp> (KAME)
Reviewed by: JINMEI Tatuya <jinmei@isl.rdc.toshiba.co.jp> (KAME)
MFC after: 1 week


# 72b1d826 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove duplicate extern declarations to silence warnings.


# d49d0ca7 09-Dec-2001 Andrew R. Reiter <arr@FreeBSD.org>

- Replace M_WAIT with M_TRYWAIT since the M_WAIT flag is deprecated.

Spotted by: bde


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# f9132ceb 05-Sep-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Wrap array accesses in macros, which also happen to be lvalues:

ifnet_addrs[i - 1] -> ifaddr_byindex(i)
ifindex2ifnet[i] -> ifnet_byindex(i)

This is intended to ease the conversion to SMPng.


# 89349143 08-Jul-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

soopt_mcopyout() frees mbuf if error occurs, and DOES NOT free it if it is
successful.
This part was lacked during merge.

Obtained from: KAME
MFC after: 1 week


# 3efe99eb 07-Jul-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

The m_free call in the ip6_fw_ctl_ptr == NULL case apparently
tries to free uninitialized mbuf.
This was my mistake during recent KAME merge. This part is for
*BSD other than FreeBSD.

Submitted by: Alexander N. Kabaev <ak03@gte.com>


# 0554093b 24-Jun-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

disallow setsockopt(IPV6_V6ONLY) for already bound sockets.

Obtained from: KAME
MFC after: 10 days


# 3e617560 24-Jun-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

decrease warning

Obtained from: KAME
MFC after: 10 days


# 99fe1b37 24-Jun-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

Nuke the comment about MIP6. We don't have MIP6 code, yet.

MFC after: 10 days


# 33841545 10-Jun-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

Sync with recent KAME.
This work was based on kame-20010528-freebsd43-snap.tgz and some
critical problem after the snap was out were fixed.
There are many many changes since last KAME merge.

TODO:
- The definitions of SADB_* in sys/net/pfkeyv2.h are still different
from RFC2407/IANA assignment because of binary compatibility
issue. It should be fixed under 5-CURRENT.
- ip6po_m member of struct ip6_pktopts is no longer used. But, it
is still there because of binary compatibility issue. It should
be removed under 5-CURRENT.

Reviewed by: itojun
Obtained from: KAME
MFC after: 3 weeks


# 12ae55c6 23-May-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

Fix memory leak.

Submitted by: itojun


# 6c0bea35 20-Jan-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

When ip6_fw_ctl() or soopt_mcopyout() return without success,
don't free mbuf. It is already freed by these routins.

PR: kern/24248


# 2a0c503e 21-Dec-2000 Bosko Milekic <bmilekic@FreeBSD.org>

* Rename M_WAIT mbuf subsystem flag to M_TRYWAIT.
This is because calls with M_WAIT (now M_TRYWAIT) may not wait
forever when nothing is available for allocation, and may end up
returning NULL. Hopefully we now communicate more of the right thing
to developers and make it very clear that it's necessary to check whether
calls with M_(TRY)WAIT also resulted in a failed allocation.
M_TRYWAIT basically means "try harder, block if necessary, but don't
necessarily wait forever." The time spent blocking is tunable with
the kern.ipc.mbuf_wait sysctl.
M_WAIT is now deprecated but still defined for the next little while.

* Fix a typo in a comment in mbuf.h

* Fix some code that was actually passing the mbuf subsystem's M_WAIT to
malloc(). Made it pass M_WAITOK instead. If we were ever to redefine the
value of the M_WAIT flag, this could have became a big problem.


# cf9fa8e7 29-Oct-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Move suser() and suser_xxx() prototypes and a related #define from
<sys/proc.h> to <sys/systm.h>.

Correctly document the #includes needed in the manpage.

Add one now needed #include of <sys/systm.h>.
Remove the consequent 48 unused #includes of <sys/proc.h>.


# fe937674 28-Oct-2000 Josef Karthauser <joe@FreeBSD.org>

Count per-address statistics for IP fragments.

Requested by: ru
Obtained from: BSD/OS


# 5da9f8fa 19-Oct-2000 Josef Karthauser <joe@FreeBSD.org>

Augment the 'ifaddr' structure with a 'struct if_data' to keep
statistics on a per network address basis.

Teach the IPv4 and IPv6 input/output routines to log packets/bytes
against the network address connected to the flow.

Teach netstat to display the per-address stats for IP protocols
when 'netstat -i' is evoked, instead of displaying the per-interface
stats.


# 20cb9f9e 23-Sep-2000 Hajimu UMEMOTO <ume@FreeBSD.org>

Make ip6fw as loadable module.


# 1469c434 12-Aug-2000 Hajimu UMEMOTO <ume@FreeBSD.org>

Make compilable with -DIPFILTER.
Because I don't use ipfilter at all, this is not tested.
I don't know if ipfilter is work for IPv6.

Submitted by: yoshiaki@kt.rim.or.jp


# c4ac87ea 31-Jul-2000 Darren Reed <darrenr@FreeBSD.org>

activate pfil_hooks and covert ipfilter to use it


# 686cdd19 04-Jul-2000 Jun-ichiro itojun Hagino <itojun@FreeBSD.org>

sync with kame tree as of july00. tons of bug fixes/improvements.

API changes:
- additional IPv6 ioctls
- IPsec PF_KEY API was changed, it is mandatory to upgrade setkey(8).
(also syntax change)


# 06a429a3 24-May-2000 Archie Cobbs <archie@FreeBSD.org>

Just need to pass the address family to if_simloop(), not the whole sockaddr.


# f63e7634 09-Mar-2000 Yoshinobu Inoue <shin@FreeBSD.org>

Initialize mbuf pointer at getting ipsec policy.
Without this, kernel will panic at getsockopt() of IPSEC_POLICY.
Also make compilable libipsec/test-policy.c which tries getsockopt() of
IPSEC_POLICY.

Approved by: jkh

Submitted by: sakane@kame.net


# 7d0d8dc3 03-Mar-2000 Yoshinobu Inoue <shin@FreeBSD.org>

CMSG_XXX macros alignment fixes to follow RFC2292.

Approved by: jkh

Submitted by: Partly from tech@openbsd
Reviewed by: itojun


# 210d0432 29-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

Add ip6fw.
Yes it is almost code freeze, but as the result of many thought, now I
think this should be added before 4.0...

make world check, kernel build check is done.

Reviewed by: green
Obtained from: KAME project


# 67e394e0 28-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

Backout diffs which should not be included.


# 577a30ee 27-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

#This is a null commit to give correct description for the previous change.
#Please forget the strange log message of the previous commit .

IPv6 multicast routing.
kernel IPv6 multicast routing support.
pim6 dense mode daemon
pim6 sparse mode daemon
netstat support of IPv6 multicast routing statistics

Merging to the current and testing with other existing multicast routers
is done by Tatsuya Jinmei <jinmei@kame.net>, who writes and maintainances
the base code in KAME distribution.

Make world check and kernel build check was also successful.

Obtained from: KAME project


# 91ec0a1e 27-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

Sorry I didn't commit these files at the commit just a few minutes before.
(IPv6 multicast routing)
I think I mistakenly touched TAB and the last arg sys/netinet6 to
the cvs commit changed to sys/netinet6/in6_proto.c.


# fa310a7e 12-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

added missing IPV6_PORTRANGE case.


# 6a800098 22-Dec-1999 Yoshinobu Inoue <shin@FreeBSD.org>

IPSEC support in the kernel.
pr_input() routines prototype is also changed to support IPSEC and IPV6
chained protocol headers.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 369dc8ce 21-Dec-1999 Eivind Eklund <eivind@FreeBSD.org>

Change incorrect NULLs to 0s


# ae5bcbff 09-Dec-1999 Yoshinobu Inoue <shin@FreeBSD.org>

rtcalloc() is removed because it turned out not to be necessary for FreeBSD.
(It was added as a part of KAME patch)

Specified by: jdp@polstra.com


# cfa1ca9d 07-Dec-1999 Yoshinobu Inoue <shin@FreeBSD.org>

udp IPv6 support, IPv6/IPv4 tunneling support in kernel,
packet divert at kernel for IPv6/IPv4 translater daemon

This includes queue related patch submitted by jburkhol@home.com.

Submitted by: queue related patch from jburkhol@home.com
Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# e1da8747 22-Nov-1999 Yoshinobu Inoue <shin@FreeBSD.org>

Removed IPSEC and IPV6FIREWALL because they are not ready yet.


# 82cd038d 21-Nov-1999 Yoshinobu Inoue <shin@FreeBSD.org>

KAME netinet6 basic part(no IPsec,no V6 Multicast Forwarding, no UDP/TCP
for IPv6 yet)

With this patch, you can assigne IPv6 addr automatically, and can reply to
IPv6 ping.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project