History log of /freebsd-current/sys/net/if_stf.c
Revision Date Author Comments
# 71625ec9 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c comment pattern

Remove /^/[*/]\s*\$FreeBSD\$.*\n/


# c373e1d6 22-Apr-2023 Zhenlei Huang <zlei@FreeBSD.org>

if_stf: Delete unreachable code

As the flag M_WAITOK is passed to ip_encap_attach(), then the function
will never return NULL, and the following code within NULL check branch
will be unreachable.

No functional change intended.

Reviewed by: kp
Fixes: 6d8fdfa9d5e7d Rework IP encapsulation handling code
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39746


# 2c2b37ad 13-Jan-2023 Justin Hibbits <jhibbits@FreeBSD.org>

ifnet/API: Move struct ifnet definition to a <net/if_private.h>

Hide the ifnet structure definition, no user serviceable parts inside,
it's a netstack implementation detail. Include it temporarily in
<net/if_var.h> until all drivers are updated to use the accessors
exclusively.

Reviewed by: glebius
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D38046


# 91ebcbe0 21-Sep-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

if_clone: migrate some consumers to the new KPI.

Convert most of the cloner customers who require custom params
to the new if_clone KPI.

Reviewed by: kp
Differential Revision: https://reviews.freebsd.org/D36636
MFC after: 2 weeks


# 439da7f0 30-Nov-2021 Kristof Provost <kp@FreeBSD.org>

if_stf: KASAN fix

In in_stf_input() we grabbed a pointer to the IPv4 header and later did
an m_pullup() before we look at the IPv6 header. However, m_pullup()
could rearrange the mbuf chain and potentially invalidate the pointer to
the IPv4 header.

Avoid this issue by copying the IP header rather than getting a pointer
to it.

Reported by: markj, Jenkins (KASAN job)
Reviewed by: markj
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D33192


# b46512f7 09-Nov-2021 Kristof Provost <kp@FreeBSD.org>

if_stf: add dtrace probe points

Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D33038


# 19dc6445 08-Nov-2021 Kristof Provost <kp@FreeBSD.org>

if_stf: add 6rd support

Implement IPv6 Rapid Deployment (RFC5969) on top of the existing 6to4
(RFC3056) if_stf code.

PR: 253328
Reviewed by: hrs
Obtained from: pfSense
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D33037


# 8e45fed3 04-Nov-2021 Kristof Provost <kp@FreeBSD.org>

if_stf: enable use in vnet jails

The cloner must be per-vnet so that cloned interfaces get destroyed when
the vnet goes away. Otherwise we fail assertions in vnet_if_uninit():

panic: vnet_if_uninit:475 tailq &V_ifnet=0xfffffe01665fe070 not empty
cpuid = 19
time = 1636107064
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe015d0cac60
vpanic() at vpanic+0x187/frame 0xfffffe015d0cacc0
panic() at panic+0x43/frame 0xfffffe015d0cad20
vnet_if_uninit() at vnet_if_uninit+0x7b/frame 0xfffffe015d0cad30
vnet_destroy() at vnet_destroy+0x170/frame 0xfffffe015d0cad60
prison_deref() at prison_deref+0x9b0/frame 0xfffffe015d0cadd0
sys_jail_remove() at sys_jail_remove+0x119/frame 0xfffffe015d0cae00
amd64_syscall() at amd64_syscall+0x12e/frame 0xfffffe015d0caf30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe015d0caf30
--- syscall (508, FreeBSD ELF64, sys_jail_remove), rip = 0x8011e920a, rsp = 0x7fffffffe788, rbp = 0x7fffffffe810 ---
KDB: enter: panic

MFC after: 3 weeks
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D32849


# 3576121c 05-Nov-2021 Kristof Provost <kp@FreeBSD.org>

if_stf: style(9) pass

As stated in style(9): "Values in return statements should be enclosed
in parentheses."

MFC after: 3 weeks
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D32848


# c8ee75f2 10-Oct-2021 Gleb Smirnoff <glebius@FreeBSD.org>

Use network epoch to protect local IPv4 addresses hash.

The modification to the hash are already naturally locked by
in_control_sx. Convert the hash lists to CK lists. Remove the
in_ifaddr_rmlock. Assert the network epoch where necessary.

Most cases when the hash lookup is done the epoch is already entered.
Cover a few cases, that need entering the epoch, which mostly is
initial configuration of tunnel interfaces and multicast addresses.

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D32584


# 2144431c 08-Oct-2021 Gleb Smirnoff <glebius@FreeBSD.org>

Remove in_ifaddr_lock acquisiton to access in_ifaddrhead.

An IPv4 address is embedded into an ifaddr which is freed
via epoch. And the in_ifaddrhead is already a CK list. Use
the network epoch to protect against use after free.

Next step would be to CK-ify the in_addr hash and get rid of the...

Reviewed by: melifaro
Differential Revision: https://reviews.freebsd.org/D32434


# bb4a7d94 04-Mar-2021 Kristof Provost <kp@FreeBSD.org>

net: Introduce IPV6_DSCP(), IPV6_ECN() and IPV6_TRAFFIC_CLASS() macros

Introduce convenience macros to retrieve the DSCP, ECN or traffic class
bits from an IPv6 header.

Use them where appropriate.

Reviewed by: ae (previous version), rscheff, tuexen, rgrimes
MFC after: 2 weeks
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D29056


# 6ad7446c 02-Jul-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Complete conversions from fib<4|6>_lookup_nh_<basic|ext> to fib<4|6>_lookup().

fib[46]_lookup_nh_ represents pre-epoch generation of fib api, providing less guarantees
over pointer validness and requiring on-stack data copying.

With no callers remaining, remove fib[46]_lookup_nh_ functions.

Submitted by: Neel Chauhan <neel AT neelc DOT org>
Differential Revision: https://reviews.freebsd.org/D25445


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# 4b24e5b1 10-Oct-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Interface output method must be executed in network epoch, so if_addr_rlock()
isn't needed here.


# b8a6e03f 07-Oct-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Widen NET_EPOCH coverage.

When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.

However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.

Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.

On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().

This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.

Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.

This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.

Reviewed by: gallatin, hselasky, cy, adrian, kristof
Differential Revision: https://reviews.freebsd.org/D19111


# ca1163bd 30-Mar-2019 Mark Johnston <markj@FreeBSD.org>

Do not perform DAD on stf(4) interfaces.

stf(4) interfaces are not multicast-capable so they can't perform DAD.
They also did not set IFF_DRV_RUNNING when an address was assigned, so
the logic in nd6_timer() would periodically flag such an address as
tentative, resulting in interface flapping.

Fix the problem by setting IFF_DRV_RUNNING when an address is assigned,
and do some related cleanup:
- In in6if_do_dad(), remove a redundant check for !UP || !RUNNING.
There is only one caller in the tree, and it only looks at whether
the return value is non-zero.
- Have in6if_do_dad() return false if the interface is not
multicast-capable.
- Set ND6_IFF_NO_DAD when an address is assigned to an stf(4) interface
and the interface goes UP as a result. Note that this is not
sufficient to fix the problem because the new address is marked as
tentative and DAD is started before in6_ifattach() is called.
However, setting no_dad is formally correct.
- Change nd6_timer() to not flag addresses as tentative if no_dad is
set.

This is based on a patch from Viktor Dukhovni.

Reported by: Viktor Dukhovni <ietf-dane@dukhovni.org>
Reviewed by: ae
MFC after: 3 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19751


# 232485a1 26-Oct-2018 Eugene Grosbein <eugen@FreeBSD.org>

Prevent stf(4) from panicing due to unprotected access to INADDR_HASH.

PR: 220078
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D12457
Tested-by: Cassiano Peixoto and others


# 6d8fdfa9 05-Jun-2018 Andrey V. Elsukov <ae@FreeBSD.org>

Rework IP encapsulation handling code.

Currently it has several disadvantages:
- it uses single mutex to protect internal structures. It is used by
data- and control- path, thus there are no parallelism at all.
- it uses single list to keep encap handlers for both INET and INET6
families.
- struct encaptab keeps unneeded information (src, dst, masks, protosw),
that isn't used by code in the source tree.
- matches are prioritized and when many tunneling interfaces are
registered, encapcheck handler of each interface is invoked for each
packet. The search takes O(n) for n interfaces. All this work is done
with exclusive lock held.

What this patch includes:
- the datapath is converted to be lockless using epoch(9) KPI.
- struct encaptab now linked using CK_LIST.
- all unused fields removed from struct encaptab. Several new fields
addedr: min_length is the minimum packet length, that encapsulation
handler expects to see; exact_match is maximum number of bits, that
can return an encapsulation handler, when it wants to consume a packet.
- IPv6 and IPv4 handlers are stored in separate lists;
- added new "encap_lookup_t" method, that will be used later. It is
targeted to speedup lookup of needed interface, when gif(4)/gre(4) have
many interfaces.
- the need to use protosw structure is eliminated. The only pr_input
method was used from this structure, so I don't see the need to keep
using it.
- encap_input_t method changed to avoid using mbuf tags to store softc
pointer. Now it is passed directly trough encap_input_t method.
encap_getarg() funtions is removed.
- all sockaddr structures and code that uses them removed. We don't have
any code in the tree that uses them. All consumers use encap_attach_func()
method, that relies on invoking of encapcheck() to determine the needed
handler.
- introduced struct encap_config, it contains parameters of encap handler
that is going to be registered by encap_attach() function.
- encap handlers are stored in lists ordered by exact_match value, thus
handlers that need more bits to match will be checked first, and if
encapcheck method returns exact_match value, the search will be stopped.
- all current consumers changed to use new KPI.

Reviewed by: mmacy
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D15617


# 46d0f824 18-May-2018 Matt Macy <mmacy@FreeBSD.org>

net: fix set but not used


# d7c5a620 18-May-2018 Matt Macy <mmacy@FreeBSD.org>

ifnet: Replace if_addr_lock rwlock with epoch + mutex

Run on LLNW canaries and tested by pho@

gallatin:
Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5
based ConnectX 4-LX NIC, I see an almost 12% improvement in received
packet rate, and a larger improvement in bytes delivered all the way
to userspace.

When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1,
I see, using nstat -I mce0 1 before the patch:

InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree
4.98 0.00 4.42 0.00 4235592 33 83.80 4720653 2149771 1235 247.32
4.73 0.00 4.20 0.00 4025260 33 82.99 4724900 2139833 1204 247.32
4.72 0.00 4.20 0.00 4035252 33 82.14 4719162 2132023 1264 247.32
4.71 0.00 4.21 0.00 4073206 33 83.68 4744973 2123317 1347 247.32
4.72 0.00 4.21 0.00 4061118 33 80.82 4713615 2188091 1490 247.32
4.72 0.00 4.21 0.00 4051675 33 85.29 4727399 2109011 1205 247.32
4.73 0.00 4.21 0.00 4039056 33 84.65 4724735 2102603 1053 247.32

After the patch

InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree
5.43 0.00 4.20 0.00 3313143 33 84.96 5434214 1900162 2656 245.51
5.43 0.00 4.20 0.00 3308527 33 85.24 5439695 1809382 2521 245.51
5.42 0.00 4.19 0.00 3316778 33 87.54 5416028 1805835 2256 245.51
5.42 0.00 4.19 0.00 3317673 33 90.44 5426044 1763056 2332 245.51
5.42 0.00 4.19 0.00 3314839 33 88.11 5435732 1792218 2499 245.52
5.44 0.00 4.19 0.00 3293228 33 91.84 5426301 1668597 2121 245.52

Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch

Reviewed by: gallatin
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15366


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# f227f64a 27-Jul-2017 Luiz Otavio O Souza <loos@FreeBSD.org>

Remove the unused mutex since r273220.

MFC after: 1 week
Sponsored by: Rubicon Communications, LLC (Netgate)


# 350e6227 08-Mar-2017 Andrey V. Elsukov <ae@FreeBSD.org>

Remove now unneded cast.


# d177868c 29-Jan-2017 Luiz Otavio O Souza <loos@FreeBSD.org>

The stf(4) interface name does not conform with the default naming
convention for interfaces, because only one stf(4) interface can exist
in the system.

This disallow the use of unit numbers different than 0, however, it is
possible to create the clone without specify the unit number (wildcard).

In the wildcard case we must update the interface name before return.

This fix an infinite recursion in pf code that keeps track of network
interfaces and groups:

1 - a group for the cloned type of the interface is added (stf in this
case);
2 - the system will now try to add an interface named stf (instead of
stf0) to stf group;
3 - when pfi_kif_attach() tries to search for an already existing 'stf'
interface, the 'stf' group is returned and thus the group is added
as an interface of itself;

This will now cause a crash at the first attempt to traverse the groups
which the stf interface belongs (which loops over itself).

Obtained from: pfSense
MFC after: 2 weeks
Sponsored by: Rubicon Communications, LLC (Netgate)


# 0792bcbb 16-Dec-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Convert if_stf(4) to new routing api.


# 926381e1 31-Jul-2015 Andrey V. Elsukov <ae@FreeBSD.org>

Ansify if_stf.c


# a5965d15 30-Jul-2015 Andrey V. Elsukov <ae@FreeBSD.org>

Build if_stf(4) module only when both INET and INET6 support are enabled.


# cc0a3c8c 29-Jul-2015 Andrey V. Elsukov <ae@FreeBSD.org>

Convert in_ifaddr_lock and in6_ifaddr_lock to rmlock.

Both are used to protect access to IP addresses lists and they can be
acquired for reading several times per packet. To reduce lock contention
it is better to use rmlock here.

Reviewed by: gnn (previous version)
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D3149


# 06cd035a 23-Dec-2014 Andrey V. Elsukov <ae@FreeBSD.org>

Remove if_stf.h. It contains only one function declaration used by if_stf(4).
Also make in_stf_protosw structure static.


# 2dfcd0ae 01-Dec-2014 Andrey V. Elsukov <ae@FreeBSD.org>

Remove unneded check. No need to do m_pullup to the size that we prepended.

MFC after: 1 week
Sponsored by: Yandex LLC


# f9723c77 20-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Simplify API: use new NHOP_LOOKUP_AIFP flag to select what ifp
we need to return.
Rename fib[64]_lookup_nh_basic to fib[64]_lookup_nh, add flags
fields for all relevant functions.


# 36f34ac7 09-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix nd6_output_flush() prototype.
Remove 'net/route_internal.h' header from stf.


# 033074c4 09-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Replace 'struct route *' if_output() argument with 'struct nhop_info *'.
Leave 'struct route' as is for legacy routing api users.
Remove most of rtalloc_ign*-derived functions.


# 1a75e3b2 06-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Make checks for rt_mtu generic:

Some virtual if drivers has (ab)used ifa ifa_rtrequest hook to enforce
route MTU to be not bigger that interface MTU. While ifa_rtrequest hooking
might be an option in some situation, it is not feasible to do MTU checks
there: generic (or per-domain) routing code is perfectly capable of doing
this.

We currrently have 3 places where MTU is altered:

1) route addition.
In this case domain overrides radix _addroute callback (in[6]_addroute)
and all necessary checks/fixes are/can be done there.

2) route change (especially, GW change).
In this case, there are no explicit per-domain calls, but one can
override rte by setting ifa_rtrequest hook to domain handler
(inet6 does this).

3) ifconfig ifaceX mtu YYYY
In this case, we have no callbacks, but ip[6]_output performes runtime
checks and decreases rt_mtu if necessary.

Generally, the goals are to be able to handle all MTU changes in
control plane, not in runtime part, and properly deal with increased
interface MTU.

This commit changes the following:
* removes hooks setting MTU from drivers side
* adds proper per-doman MTU checks for case 1)
* adds generic MTU check for case 2)

* The latter is done by using new dom_ifmtu callback since
if_mtu denotes L3 interface MTU, e.g. maximum trasmitted _packet_ size.
However, IPv6 mtu might be different from if_mtu one (e.g. default 1280)
for some cases, so we need an abstract way to know maximum MTU size
for given interface and domain.
* moves rt_setmetrics() before MTU/ifa_rtrequest hooks since it copies
user-supplied data which must be checked.
* removes RT_LOCK_ASSERT() from other ifa_rtrequest hooks to be able to
use this functions on new non-inserted rte.

More changes will follow soon.

MFC after: 1 month
Sponsored by: Yandex LLC


# 69b74805 04-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Convert gif and stf to use new routing api.


# 8c3cfe0b 04-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Hide 'struct rtentry' and all its macro inside new header:
net/route_internal.h
The goal is to make its opaque for all code except route/rtsock and
proto domain _rmx.


# b4e8f808 19-Oct-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Switch IPv4 output path to use new routing api.

The goals of the new API is to provide consumers with minimal
needed information, but as fast as possible. So we provide
full nexthop info copied into alighed on-cache structure
instead of rte/ia pointers, their refcounts and locks.
This does not provide solution for protecting from egress
ifp destruction, but does not make it any worse.

Current changes:

nhops:
Add fib4_lookup_prepend() function which stores either full
L2+L3 prepend info (e.g. MAC header in case of plain IPv4) or
L3 info with NH_FLAGS_L2_INCOMPLETE flag indicating that no valid L2
info exists and we have to take "slow" path.

ip_output:
Currently ip[ 46]_output consumers use 'struct route' for
the following purposes:
1) double lookup avoidance(route caching)
2) plain route caching
3) get path MTU to be able to notify source.
The former pattern is mostly used by various tunnels
(gif, gre, stf). (Actually, gre is the only remaining,
others were already converted. Their locking model did
not scale good enogh to benefit from such caching, so
we have (temporarily) removed it without any performance
loss).
Plain route caching used by SCTP is simply wrong and should be removed.
Temporary break it for now just to be able to compile.
Optimize path mtu reporting by providing it in new 'route_info' stucture.

Minimize games with @ia locking/refcounting for route lookup:
add special nhop[46]_extended structure to store more route attributes.
Pointer to given structure can be passed to fib4_lookup_prepend() to indicate
we want this info (we actually needs it for UDP and raw IP).

ether_output:
Provide light-weight ether_output2() call to deal with
transmitting L2 frame (e.g. properly handle broadcast/simloop/bridge/
other L2 hooks before actually transmitting frame by if_transmit()).
Add a hack based on new RT_NHOP ro_flag to distinguish which version should
we call. Better way is probably to add a new "if_output_frame" driver
callbacks.

Next steps:
* Convert ip_fastfwd part
* Implement auto-growing array for per-radix nexthops
* Implement LLE tracking for nexthop calculations to be able to
immediately provide all necessary info in single route lookup
for gateway routes
* Switch radix locking scheme to runtime/cfg lock
* Implement multipath support for rtsock
* Implement "tracked nexthops" for tunnels (e.g. _proper_
nexthop caching)
* Add IPv6 support for remaining parts (postponed not to
interfere with user/ae/inet6 branch)
* Consider adding "if_output_frame" driver call to
ease logical frame pushing.


# d74b9a2c 17-Oct-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

* Remove route caching in if_stf.
* Copy necessary in6_ifa on stack instead of playing with refcounts.


# 3751dddb 19-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Mechanically convert to if_inc_counter().


# 73d76e77 14-Aug-2014 Kevin Lo <kevlo@FreeBSD.org>

Change pr_output's prototype to avoid the need for explicit casts.
This is a follow up to r269699.

Phabric: D564
Reviewed by: jhb


# 8f5a8818 07-Aug-2014 Kevin Lo <kevlo@FreeBSD.org>

Merge 'struct ip6protosw' and 'struct protosw' into one. Now we have
only one protocol switch structure that is shared between ipv4 and ipv6.

Phabric: D476
Reviewed by: jhb


# af3b2549 27-Jun-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Pull in r267961 and r267973 again. Fix for issues reported will follow.


# 37a107a4 27-Jun-2014 Glen Barber <gjb@FreeBSD.org>

Revert r267961, r267973:

These changes prevent sysctl(8) from returning proper output,
such as:

1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory


# 3da1cf1e 27-Jun-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after: 2 weeks
Sponsored by: Mellanox Technologies


# e3a7aa6f 04-Mar-2014 Gleb Smirnoff <glebius@FreeBSD.org>

- Remove rt_metrics_lite and simply put its members into rtentry.
- Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This
removes another cache trashing ++ from packet forwarding path.
- Create zini/fini methods for the rtentry UMA zone. Via initialize
mutex and counter in them.
- Fix reporting of rmx_pksent to routing socket.
- Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode.

The change is mostly targeted for stable/10 merge. For head,
rt_pksent is expected to just disappear.

Discussed with: melifaro
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 76039bc8 26-Oct-2013 Gleb Smirnoff <glebius@FreeBSD.org>

The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 47e8d432 25-Apr-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Add const qualifier to the dst parameter of the ifnet if_output method.


# e37e7917 27-Dec-2012 Andrey V. Elsukov <ae@FreeBSD.org>

Add an ability to set net.link.stf.permit_rfc1918 from the loader.

MFC after: 2 weeks


# 51743c5f 27-Dec-2012 Andrey V. Elsukov <ae@FreeBSD.org>

Add net.link.stf.permit_rfc1918 sysctl variable. It can be used to allow
the use of private IPv4 addresses with stf(4).

MFC after: 2 weeks


# eb1b1807 05-Dec-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually


# 8f134647 22-Oct-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Switch the entire IPv4 stack to keep the IP packet header
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.

After this change a packet processed by the stack isn't
modified at all[2] except for TTL.

After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.

[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.

[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.

Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>


# 42a58907 16-Oct-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Make the "struct if_clone" opaque to users of the cloning API. Users
now use function calls:

if_clone_simple()
if_clone_advanced()

to initialize a cloner, instead of macros that initialize if_clone
structure.

Discussed with: brooks, bz, 1 year ago


# 9823d527 10-Oct-2012 Kevin Lo <kevlo@FreeBSD.org>

Revert previous commit...

Pointyhat to: kevlo (myself)


# a10cee30 09-Oct-2012 Kevin Lo <kevlo@FreeBSD.org>

Prefer NULL over 0 for pointers


# 2541fcd9 17-Aug-2012 John Baldwin <jhb@FreeBSD.org>

Unexpand a couple of TAILQ_FOREACH()s.


# 99ab4b12 15-Jul-2012 Alexander V. Chernikov <melifaro@FreeBSD.org>

Permit changing MTU in 6to4 relay.

This behavior is recommended by RFC 4213 clause 3.2.

Sometimes fragmentation is the least evil.
For example, some Linux IPVS kernels forwards
ICMPv6 checksums to real servers incorrectly.

Reviewed by: hrs(previous version)
Approved by: kib(mentor)
MFC after: 1 week


# 6472ac3d 07-Nov-2011 Ed Schouten <ed@FreeBSD.org>

Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.

The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.


# a34c6aeb 03-Jul-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Tag mbufs of all incoming frames or packets with the interface's FIB
setting (either default or if supported as set by SIOCSIFFIB, e.g.
from ifconfig).

Submitted by: Alexander V. Chernikov (melifaro ipfw.ru)
Reviewed by: julian
MFC after: 2 weeks


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# e50d35e6 03-May-2010 Maxim Sobolev <sobomax@FreeBSD.org>

Add new tunable 'net.link.ifqmaxlen' to set default send interface
queue length. The default value for this parameter is 50, which is
quite low for many of today's uses and the only way to modify this
parameter right now is to edit if_var.h file. Also add read-only
sysctl with the same name, so that it's possible to retrieve the
current value.

MFC after: 1 month


# 530c0060 01-Aug-2009 Robert Watson <rwatson@FreeBSD.org>

Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by: bz
Approved by: re (vimage blanket)


# eddfbb76 14-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)


# 3893212d 25-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Update if_stf and if_tun to use if_addr_rlock()/if_addr_runlock() rather
than IF_ADDR_LOCK()/IF_ADDR_UNLOCK() when iterating ifp->if_addrhead.

MFC after: 6 weeks


# 2d9cfaba 25-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Add a new global rwlock, in_ifaddr_lock, which will synchronize use of the
in_ifaddrhead and INADDR_HASH address lists.

Previously, these lists were used unsynchronized as they were effectively
never changed in steady state, but we've seen increasing reports of
writer-writer races on very busy VPN servers as core count has gone up
(and similar configurations where address lists change frequently and
concurrently).

For the time being, use rwlocks rather than rmlocks in order to take
advantage of their better lock debugging support. As a result, we don't
enable ip_input()'s read-locking of INADDR_HASH until an rmlock conversion
is complete and a performance analysis has been done. This means that one
class of reader-writer races still exists.

MFC after: 6 weeks
Reviewed by: bz


# fe0ecfd6 24-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Make stf_getsrcifa6() return a reference to an in6_ifaddr rather than
a pointer, and dispose of the references when no longer needed.

MFC after: 6 weeks


# bcf11e8d 05-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# 82324826 20-Apr-2009 Robert Watson <rwatson@FreeBSD.org>

Prefer ifa_link (structure field) to ifa_list (macro alias for it).

MFC after: 2 weeks


# 989c0cb5 20-Apr-2009 Robert Watson <rwatson@FreeBSD.org>

Prefer if_addrhead (FreeBSD) to if_addrlist (BSD compat) naming for the
interface address list in if_stf.c.

Acquire interface address list locks around address list access.

MFC after: 2 months


# 279aa3d4 16-Apr-2009 Kip Macy <kmacy@FreeBSD.org>

Change if_output to take a struct route as its fourth argument in order
to allow passing a cached struct llentry * down to L2

Reviewed by: rwatson


# 4b79449e 02-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 8b615593 02-Oct-2008 Marko Zec <zec@FreeBSD.org>

Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 0dae32f2 24-Sep-2008 David Malone <dwmalone@FreeBSD.org>

Some people's 6to4 routers seem to have been blowing up because of
the unlocked route caching in if_stf. Add a mutex that protects
access to cached route. This seemed to fix problems for Pekka Savola.

Nick Sayer had similar problems, and in his case completly disabling
the route cache seemed to help. Add a sysctl net.link.stf.route_cache
that can be used to turn off route caching in if_stf.

PR: 122283
MFC after: 2 weeks
Tested by: Pekka Savola, Nick Sayer.


# 603724d3 17-Aug-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch


# 8b07e49a 09-May-2008 Julian Elischer <julian@FreeBSD.org>

Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.

Constraints:
------------

I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.

One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".

One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.

This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.

To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.

The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.

The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.

In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.

One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).

You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.

This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.

Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.

Packets fall into one of a number of classes.

1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..

setfib -3 ping target.example.com # will use fib 3 for ping.

It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.

2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)

3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).

4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.

5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.

6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.

Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)

In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.

In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.

Early testing experience:
-------------------------

Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.

For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.

Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.

ipfw has grown 2 new keywords:

setfib N ip from anay to any
count ip from any to any fib N

In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.

SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.

Where to next:
--------------------

After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.

Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.

My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.

When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.

Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.

This work was sponsored by Ironport Systems/Cisco

Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco


# 30d239bc 24-Oct-2007 Robert Watson <rwatson@FreeBSD.org>

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# bc60490a 23-Sep-2007 Christian S.J. Peron <csjp@FreeBSD.org>

Certain consumers of rtalloc like gif(4) and if_stf(4) lookup the
route and once they are done with it, call rtfree(). rtfree() should
only be used when we are certain we hold the last reference to the
route. This bug results in console messages like the following:

rtfree: 0xc40f7000 has 1 refs

This patch switches the rtfree() to use RTFREE_LOCKED() instead,
which should handle the reference counting on the route better.

Approved by: re@ (gnn)
Reviewed by: bms
Reported by: many via net@ and current@
Tested by: many


# aed55708 22-Oct-2006 Robert Watson <rwatson@FreeBSD.org>

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# 43bc7a9c 04-Aug-2006 Brooks Davis <brooks@FreeBSD.org>

With exception of the if_name() macro, all definitions in net_osdep.h
were unused or already in if_var.h so add if_name() to if_var.h and
remove net_osdep.h along with all references to it.

Longer term we may want to kill off if_name() entierly since all modern
BSDs have if_xname variables rendering it unnecessicary.


# 6b7330e2 09-Jul-2006 Sam Leffler <sam@FreeBSD.org>

Revise network interface cloning to take an optional opaque
parameter that can specify configuration parameters:
o rev cloner api's to add optional parameter block
o add SIOCCREATE2 that accepts parameter data
o rev vlan support to use new api (maintain old code)

Reviewed by: arch@


# 4b97d7af 29-Jun-2006 Yaroslav Tykhiy <ytykhiy@gmail.com>

There is a consensus that ifaddr.ifa_addr should never be NULL,
except in places dealing with ifaddr creation or destruction; and
in such special places incomplete ifaddrs should never be linked
to system-wide data structures. Therefore we can eliminate all the
superfluous checks for "ifa->ifa_addr != NULL" and get ready
to the system crashing honestly instead of masking possible bugs.

Suggested by: glebius, jhb, ru


# 06dc090f 29-Jun-2006 Yaroslav Tykhiy <ytykhiy@gmail.com>

Use TAILQ_FOREACH.


# 16d878cc 02-Jun-2006 Christian S.J. Peron <csjp@FreeBSD.org>

Fix the following bpf(4) race condition which can result in a panic:

(1) bpf peer attaches to interface netif0
(2) Packet is received by netif0
(3) ifp->if_bpf pointer is checked and handed off to bpf
(4) bpf peer detaches from netif0 resulting in ifp->if_bpf being
initialized to NULL.
(5) ifp->if_bpf is dereferenced by bpf machinery
(6) Kaboom

This race condition likely explains the various different kernel panics
reported around sending SIGINT to tcpdump or dhclient processes. But really
this race can result in kernel panics anywhere you have frequent bpf attach
and detach operations with high packet per second load.

Summary of changes:

- Remove the bpf interface's "driverp" member
- When we attach bpf interfaces, we now set the ifp->if_bpf member to the
bpf interface structure. Once this is done, ifp->if_bpf should never be
NULL. [1]
- Introduce bpf_peers_present function, an inline operation which will do
a lockless read bpf peer list associated with the interface. It should
be noted that the bpf code will pickup the bpf_interface lock before adding
or removing bpf peers. This should serialize the access to the bpf descriptor
list, removing the race.
- Expose the bpf_if structure in bpf.h so that the bpf_peers_present function
can use it. This also removes the struct bpf_if; hack that was there.
- Adjust all consumers of the raw if_bpf structure to use bpf_peers_present

Now what happens is:

(1) Packet is received by netif0
(2) Check to see if bpf descriptor list is empty
(3) Pickup the bpf interface lock
(4) Hand packet off to process

From the attach/detach side:

(1) Pickup the bpf interface lock
(2) Add/remove from bpf descriptor list

Now that we are storing the bpf interface structure with the ifnet, there is
is no need to walk the bpf interface list to locate the correct bpf interface.
We now simply look up the interface, and initialize the pointer. This has a
nice side effect of changing a bpf interface attach operation from O(N) (where
N is the number of bpf interfaces), to O(1).

[1] From now on, we can no longer check ifp->if_bpf to tell us whether or
not we have any bpf peers that might be interested in receiving packets.

In collaboration with: sam@
MFC after: 1 month


# 303989a2 09-Nov-2005 Ruslan Ermilov <ru@FreeBSD.org>

Use sparse initializers for "struct domain" and "struct protosw",
so they are easier to follow for the human being.


# 4e7e0183 08-Nov-2005 Andrew Thompson <thompsa@FreeBSD.org>

Move the cloned interface list management in to if_clone. For some drivers the
softc lists and associated mutex are now unused so these have been removed.

Calling if_clone_detach() will now destroy all the cloned interfaces for the
driver and in most cases is all thats needed to unload.

Idea by: brooks
Reviewed by: brooks


# febd0759 12-Oct-2005 Andrew Thompson <thompsa@FreeBSD.org>

Change the reference counting to count the number of cloned interfaces for each
cloner. This ensures that ifc->ifc_units is not prematurely freed in
if_clone_detach() before the clones are destroyed, resulting in memory modified
after free. This could be triggered with if_vlan.

Assert that all cloners have been destroyed when freeing the memory.

Change all simple cloners to destroy their clones with ifc_simple_destroy() on
module unload so the reference count is properly updated. This also cleans up
the interface destroy routines and allows future optimisation.

Discussed with: brooks, pjd, -current
Reviewed by: brooks


# 01399f34 26-Jun-2005 David Malone <dwmalone@FreeBSD.org>

Fix some long standing bugs in writing to the BPF device attached to
a DLT_NULL interface. In particular:

1) Consistently use type u_int32_t for the header of a
DLT_NULL device - it continues to represent the address
family as always.
2) In the DLT_NULL case get bpf_movein to store the u_int32_t
in a sockaddr rather than in the mbuf, to be consistent
with all the DLT types.
3) Consequently fix a bug in bpf_movein/bpfwrite which
only permitted packets up to 4 bytes less than the MTU
to be written.
4) Fix all DLT_NULL devices to have the code required to
allow writing to their bpf devices.
5) Move the code to allow writing to if_lo from if_simloop
to looutput, because it only applies to DLT_NULL devices
but was being applied to other devices that use if_simloop
possibly incorrectly.

PR: 82157
Submitted by: Matthew Luckie <mjl@luckie.org.nz>
Approved by: re (scottl)


# b03965dd 13-Jun-2005 Brooks Davis <brooks@FreeBSD.org>

Initialze ifp->if_softc.

Submitted by: ume


# fc74a9f9 10-Jun-2005 Brooks Davis <brooks@FreeBSD.org>

Stop embedding struct ifnet at the top of driver softcs. Instead the
struct ifnet or the layer 2 common structure it was embedded in have
been replaced with a struct ifnet pointer to be filled by a call to the
new function, if_alloc(). The layer 2 common structure is also allocated
via if_alloc() based on the interface type. It is hung off the new
struct ifnet member, if_l2com.

This change removes the size of these structures from the kernel ABI and
will allow us to better manage them as interfaces come and go.

Other changes of note:
- Struct arpcom is no longer referenced in normal interface code.
Instead the Ethernet address is accessed via the IFP2ENADDR() macro.
To enforce this ac_enaddr has been renamed to _ac_enaddr.
- The second argument to ether_ifattach is now always the mac address
from driver private storage rather than sometimes being ac_enaddr.

Reviewed by: sobomax, sam


# 89bc9a31 23-Feb-2005 Sam Leffler <sam@FreeBSD.org>

the rt parameter to ifa_rtrequest callbacks should always be non-null;
eliminate grauitous ptr checks that follow ptr deref's

Noticed by: Coverity Prevent analysis tool


# 529ed56f 11-Jan-2005 Hajimu UMEMOTO <ume@FreeBSD.org>

don't see NBPFILTER.


# 2d106a00 11-Jan-2005 Hajimu UMEMOTO <ume@FreeBSD.org>

remove HAVE_OLD_BPF part.


# 9b1a7076 11-Jan-2005 Hajimu UMEMOTO <ume@FreeBSD.org>

fix typo.


# c398230b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# 3e019dea 15-Jul-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Do a pass over all modules in the kernel and make them return EOPNOTSUPP
for unknown events.

A number of modules return EINVAL in this instance, and I have left
those alone for now and instead taught MOD_QUIESCE to accept this
as "didn't do anything".


# f889d2ef 22-Jun-2004 Brooks Davis <brooks@FreeBSD.org>

Major overhaul of pseudo-interface cloning. Highlights include:

- Split the code out into if_clone.[ch].
- Locked struct if_clone. [1]
- Add a per-cloner match function rather then simply matching names of
the form <name><unit> and <name>.
- Use the match function to allow creation of <interface>.<tag>
vlan interfaces. The old way is preserved unchanged!
- Also the match function to allow creation of stf(4) interfaces named
stf0, stf, or 6to4. This is the only major user visible change in
that "ifconfig stf" creates the interface stf rather then stf0 and
does not print "stf0" to stdout.
- Allow destroy functions to fail so they can refuse to delete
interfaces. Currently, we forbid the deletion of interfaces which
were created in the init function, particularly lo0, pflog0, and
pfsync0. In the case of lo0 this was a panic implementation so it
does not count as a user visiable change. :-)
- Since most interfaces do not need the new functionality, an family of
wrapper functions, ifc_simple_*(), were created to wrap old style
cloner functions.
- The IF_CLONE_INITIALIZER macro is replaced with a new incompatible
IFC_CLONE_INITIALIZER and ifc_simple consumers use IFC_SIMPLE_DECLARE
instead.

Submitted by: Maurycy Pawlowski-Wieronski <maurycy at fouk.org> [1]
Reviewed by: andre, mlaier
Discussed on: net


# 5dba30f1 30-May-2004 Poul-Henning Kamp <phk@FreeBSD.org>

add missing #include <sys/module.h>


# 1861b710 18-Apr-2004 Brooks Davis <brooks@FreeBSD.org>

Use an tempory struct ifnet *ifp instead of sc->sc_if to access the
ifnet in stf_clone_create. Also use if_printf() instead of printf().


# bb2bfb4f 13-Apr-2004 Brooks Davis <brooks@FreeBSD.org>

Staticize <if>_clone_{create,destroy} functions.

Reviewed by: mlaier


# 15db03a0 09-Mar-2004 Robert Watson <rwatson@FreeBSD.org>

Introduce stf_mtx to protect global softc list in if_stf. Add
stf_destroy() to handle the common softc destruction path for the
two destruction sources: interface cloning destroy, and module
unload.

NOTE: sc_ro, the cached route for stf conversion, is not synchronized
against concurrent access in this change, that will follow in a future
change.

Reviewed by: pjd


# 591cf7ce 06-Mar-2004 Robert Watson <rwatson@FreeBSD.org>

Const-poison ip_stf_ttl to make it clear that the variable is not
modified at run-time.


# 437ffe18 27-Dec-2003 Sam Leffler <sam@FreeBSD.org>

o eliminate widespread on-stack mbuf use for bpf by introducing
a new bpf_mtap2 routine that does the right thing for an mbuf
and a variable-length chunk of data that should be prepended.
o while we're sweeping the drivers, use u_int32_t uniformly when
when prepending the address family (several places were assuming
sizeof(int) was 4)
o return M_ASSERTVALID to BPF_MTAP* now that all stack-allocated
mbufs have been eliminated; this may better be moved to the bpf
routines

Reviewed by: arch@ and several others


# 9bf40ede 31-Oct-2003 Brooks Davis <brooks@FreeBSD.org>

Replace the if_name and if_unit members of struct ifnet with new members
if_xname, if_dname, and if_dunit. if_xname is the name of the interface
and if_dname/unit are the driver name and instance.

This change paves the way for interface renaming and enhanced pseudo
device creation and configuration symantics.

Approved By: re (in principle)
Reviewed By: njl, imp
Tested On: i386, amd64, sparc64
Obtained From: NetBSD (if_xname)


# d1dd20be 03-Oct-2003 Sam Leffler <sam@FreeBSD.org>

Locking for updates to routing table entries. Each rtentry gets a mutex
that covers updates to the contents. Note this is separate from holding
a reference and/or locking the routing table itself.

Other/related changes:

o rtredirect loses the final parameter by which an rtentry reference
may be returned; this was never used and added unwarranted complexity
for locking.
o minor style cleanups to routing code (e.g. ansi-fy function decls)
o remove the logic to bump the refcnt on the parent of cloned routes,
we assume the parent will remain as long as the clone; doing this avoids
a circularity in locking during delete
o convert some timeouts to MPSAFE callouts

Notes:

1. rt_mtx in struct rtentry is guarded by #ifdef _KERNEL as user-level
applications cannot/do-no know about mutex's. Doing this requires
that the mutex be the last element in the structure. A better solution
is to introduce an externalized version of struct rtentry but this is
a major task because of the intertwining of rtentry and other data
structures that are visible to user applications.
2. There are known LOR's that are expected to go away with forthcoming
work to eliminate many held references. If not these will be resolved
prior to release.
3. ATM changes are untested.

Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS (partly)


# 1cafed39 04-Mar-2003 Jonathan Lemon <jlemon@FreeBSD.org>

Update netisr handling; Each SWI now registers its queue, and all queue
drain routines are done by swi_net, which allows for better queue control
at some future point. Packets may also be directly dispatched to a netisr
instead of queued, this may be of interest at some installations, but
currently defaults to off.

Reviewed by: hsu, silby, jayanth, sam
Sponsored by: DARPA, NAI Labs


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 8d95b0ce 15-Jan-2003 SUZUKI Shinsuke <suz@FreeBSD.org>

sync with KAME to simplify rev 1.28's patch (no functional changes)

Obtained from: KAME
Reviewd by: fenner
Approved by: re (jhb)


# 2b0e5976 05-Jan-2003 Bill Fenner <fenner@FreeBSD.org>

Fix alignment problems -- the embedded v4 address is guaranteed to
be only 16-bit aligned, so only do byte operations to compare with it.


# 6fc32a24 14-Nov-2002 Sam Leffler <sam@FreeBSD.org>

network interface and link layer changes:

o on input don't strip the Ethernet header from packets
o input packet handling is now done with if_input
o track changes to ether_ifattach/ether_ifdetach API
o track changes to bpf tapping
o call ether_ioctl for default handling of ioctl's
o use constants from net/ethernet.h where possible

Reviewed by: many
Approved by: re


# 6b459e49 20-Oct-2002 Robert Watson <rwatson@FreeBSD.org>

When packets pass in and out of six-to-four (STF) tunnels, perform
labeling checks and operations as with other network interfaces.
Eventually, if it proves desirable, we might want to offer special
casing of this or other tunnel interfaces where we have an existing
label of interest, rather than treating it as though it's an
entirely fresh mbuf in the incoming/outgoing encapsulation directions.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 5d846453 15-Oct-2002 Sam Leffler <sam@FreeBSD.org>

Replace aux mbufs with packet tags:

o instead of a list of mbufs use a list of m_tag structures a la openbsd
o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit
ABI/module number cookie
o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and
use this in defining openbsd-compatible m_tag_find and m_tag_get routines
o rewrite KAME use of aux mbufs in terms of packet tags
o eliminate the most heavily used aux mbufs by adding an additional struct
inpcb parameter to ip_output and ip6_output to allow the IPsec code to
locate the security policy to apply to outbound packets
o bump __FreeBSD_version so code can be conditionalized
o fixup ipfilter's call to ip_output based on __FreeBSD_version

Reviewed by: julian, luigi (silent), -arch, -net, darren
Approved by: julian, silence from everyone else
Obtained from: openbsd (mostly)
MFC after: 1 month


# f26b2d5b 17-Sep-2002 Hajimu UMEMOTO <ume@FreeBSD.org>

- increment interface output counter. sync w/ netbsd-current
- increase if_oerrors. sync w/netbsd

Obtained from: KAME


# ce9d7b2f 17-Sep-2002 Hajimu UMEMOTO <ume@FreeBSD.org>

- reject SIOCSIFADDR if embedded address is in private address range
- reject packets from private address range. from hitachi

Obtained from: KAME


# ae5a19be 25-May-2002 Brooks Davis <brooks@FreeBSD.org>

Move all unit number management cloned interfaces into the cloning
code. The reverts the API change which made the <if>_clone_destory()
functions return an int instead of void bringing us into closer
alignment with NetBSD.

Reviewed by: net (a long time ago)


# 88ff5695 18-Apr-2002 SUZUKI Shinsuke <suz@FreeBSD.org>

just merged cosmetic changes from KAME to ease sync between KAME and FreeBSD.
(based on freebsd4-snap-20020128)

Reviewed by: ume
MFC after: 1 week


# 929ddbbb 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# 3b16e7b2 11-Mar-2002 Maxime Henrion <mux@FreeBSD.org>

Simplify the interface cloning framework by handling unit
unit allocation with a bitmap in the generic layer. This
allows us to get rid of the duplicated rman code in every
clonable interface.

Reviewed by: brooks
Approved by: phk


# b75496fe 04-Mar-2002 Brooks Davis <brooks@FreeBSD.org>

Change the network interface cloning API so the destroy function returns
an int errorcode instead of void in preperation for merging cloning of
the loopback device.

Submitted by: mux
MFC after: 2 weeks


# 777b9faa 27-Feb-2002 Peter Wemm <peter@FreeBSD.org>

Fix warnings.


# e8783c4d 08-Jan-2002 Mike Smith <msmith@FreeBSD.org>

Staticise private interface lists.


# 1ed4b9fe 06-Dec-2001 Andrew R. Reiter <arr@FreeBSD.org>

- malloc should be passed M_WAITOK, not M_WAIT (a mbuf flag)
- make use of M_ZERO to remove a call to bzero()


# 8071913d 17-Oct-2001 Ruslan Ermilov <ru@FreeBSD.org>

Pull post-4.4BSD change to sys/net/route.c from BSD/OS 4.2.

Have sys/net/route.c:rtrequest1(), which takes ``rt_addrinfo *''
as the argument. Pass rt_addrinfo all the way down to rtrequest1
and ifa->ifa_rtrequest. 3rd argument of ifa->ifa_rtrequest is now
``rt_addrinfo *'' instead of ``sockaddr *'' (almost noone is
using it anyways).

Benefit: the following command now works. Previously we needed
two route(8) invocations, "add" then "change".
# route add -inet6 default ::1 -ifp gif0

Remove unsafe typecast in rtrequest(), from ``rtentry *'' to
``sockaddr *''. It was introduced by 4.3BSD-Reno and never
corrected.

Obtained from: BSD/OS, NetBSD
MFC after: 1 month
PR: kern/28360


# cf912c89 28-Sep-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Use in_ifaddrhashtbl instead of in_ifaddrhead to look up IP address.


# abb64706 18-Sep-2001 Brooks Davis <brooks@FreeBSD.org>

Make stf a clonable device.

Yes this really is rather silly and the implementation is overkill given
that you are only allowed one of them, but NetBSD implements cloning on
this device and it's a less cluttered example of cloning then most.


# ff265614 07-Sep-2001 Julian Elischer <julian@FreeBSD.org>

Patches from KAME to remove usage of Varargs in existing
IPV4 code. For now they will still have some in the developing stuff (IPv6)

Submitted by: Keiichi SHIMA / <keiichi@iij.ad.jp>
Obtained from: KAME


# f0ffb944 03-Sep-2001 Julian Elischer <julian@FreeBSD.org>

Patches from Keiichi SHIMA <keiichi@iij.ad.jp>
to make ip use the standard protosw structure again.

Obtained from: Well, KAME I guess.


# 53dab5fe 02-Jul-2001 Brooks Davis <brooks@FreeBSD.org>

gif(4) and stf(4) modernization:

- Remove gif dependencies from stf.
- Make gif and stf into modules
- Make gif cloneable.

PR: kern/27983
Reviewed by: ru, ume
Obtained from: NetBSD
MFC after: 1 week


# 8acb2290 24-Jun-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

inject outbound packet to BPF.

Submitted by: itojun
Obtained from: KAME
MFC after: 10 days


# 33841545 10-Jun-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

Sync with recent KAME.
This work was based on kame-20010528-freebsd43-snap.tgz and some
critical problem after the snap was out were fixed.
There are many many changes since last KAME merge.

TODO:
- The definitions of SADB_* in sys/net/pfkeyv2.h are still different
from RFC2407/IANA assignment because of binary compatibility
issue. It should be fixed under 5-CURRENT.
- ip6po_m member of struct ip6_pktopts is no longer used. But, it
is still there because of binary compatibility issue. It should
be removed under 5-CURRENT.

Reviewed by: itojun
Obtained from: KAME
MFC after: 3 weeks


# 22f29826 03-Feb-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Use <sys/queue.h> macro api rather than fondle its implementation detals.

Created with: /usr/bin/sed
Reviewed by: /sbin/md5


# 2a0c503e 21-Dec-2000 Bosko Milekic <bmilekic@FreeBSD.org>

* Rename M_WAIT mbuf subsystem flag to M_TRYWAIT.
This is because calls with M_WAIT (now M_TRYWAIT) may not wait
forever when nothing is available for allocation, and may end up
returning NULL. Hopefully we now communicate more of the right thing
to developers and make it very clear that it's necessary to check whether
calls with M_(TRY)WAIT also resulted in a failed allocation.
M_TRYWAIT basically means "try harder, block if necessary, but don't
necessarily wait forever." The time spent blocking is tunable with
the kern.ipc.mbuf_wait sysctl.
M_WAIT is now deprecated but still defined for the next little while.

* Fix a typo in a comment in mbuf.h

* Fix some code that was actually passing the mbuf subsystem's M_WAIT to
malloc(). Made it pass M_WAITOK instead. If we were ever to redefine the
value of the M_WAIT flag, this could have became a big problem.


# df5e1987 25-Nov-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Lock down the network interface queues. The queue mutex must be obtained
before adding/removing packets from the queue. Also, the if_obytes and
if_omcasts fields should only be manipulated under protection of the mutex.

IF_ENQUEUE, IF_PREPEND, and IF_DEQUEUE perform all necessary locking on
the queue. An IF_LOCK macro is provided, as well as the old (mutex-less)
versions of the macros in the form _IF_ENQUEUE, _IF_QFULL, for code which
needs them, but their use is discouraged.

Two new macros are introduced: IF_DRAIN() to drain a queue, and IF_HANDOFF,
which takes care of locking/enqueue, and also statistics updating/start
if necessary.


# 46aa3347 27-Oct-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Convert all users of fldoff() to offsetof(). fldoff() is bad
because it only takes a struct tag which makes it impossible to
use unions, typedefs etc.

Define __offsetof() in <machine/ansi.h>

Define offsetof() in terms of __offsetof() in <stddef.h> and <sys/types.h>

Remove myriad of local offsetof() definitions.

Remove includes of <stddef.h> in kernel code.

NB: Kernelcode should *never* include from /usr/include !

Make <sys/queue.h> include <machine/ansi.h> to avoid polluting the API.

Deprecate <struct.h> with a warning. The warning turns into an error on
01-12-2000 and the file gets removed entirely on 01-01-2001.

Paritials reviews by: various.
Significant brucifications by: bde


# d1d1144b 15-Aug-2000 Jun-ichiro itojun Hagino <itojun@FreeBSD.org>

repair endianness issue in IN_MULTICAST().
again, *BSD difference...

From: Nick Sayer <nsayer@quack.kfu.com>


# 686cdd19 04-Jul-2000 Jun-ichiro itojun Hagino <itojun@FreeBSD.org>

sync with kame tree as of july00. tons of bug fixes/improvements.

API changes:
- additional IPv6 ioctls
- IPsec PF_KEY API was changed, it is mandatory to upgrade setkey(8).
(also syntax change)