History log of /freebsd-current/sys/netinet6/nd6.c
Revision Date Author Comments
# 4f96be33 24-Mar-2024 Gleb Smirnoff <glebius@FreeBSD.org>

icmp6: move ICMPv6 related tunables to the files where they are used

Most of them can be declared as static after the move out of in6_proto.c.
Keeping sysctl(9) declarations with their text descriptions next to the
variable declaration create self-documenting code. There should be no
functional changes.

Differential Revision: https://reviews.freebsd.org/D44481


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 3d0d5b21 23-Jan-2023 Justin Hibbits <jhibbits@FreeBSD.org>

IfAPI: Explicitly include <net/if_private.h> in netstack

Summary:
In preparation of making if_t completely opaque outside of the netstack,
explicitly include the header. <net/if_var.h> will stop including the
header in the future.

Sponsored by: Juniper Networks, Inc.
Reviewed by: glebius, melifaro
Differential Revision: https://reviews.freebsd.org/D38200


# 6468b6b2 15-Jan-2023 Alexander V. Chernikov <melifaro@FreeBSD.org>

nd6: fix panic in lltable_drop_entry_queue()

nd6_resolve_slow() can be called without mbuf. If the LLE entry
is not reachable, nd6_resolve_slow() will add this NULL mbuf to
the holdchain via lltable_append_entry_queue, which will "append"
NULL to the end of the queue (effectively no-op) and bump la_numhold
value. When this entry gets freed, the kernel will panic due to the
inconsistency between the amount of mbufs in the queue and the value
of la_numhold.

Fix the panic by checking of mbuf is not NULL prior to inserting it
into the holdchain.

Reported by: kib
MFC after: 3 days


# 744bfb21 28-Oct-2022 John Baldwin <jhb@FreeBSD.org>

Import the WireGuard driver from zx2c4.com.

This commit brings back the driver from FreeBSD commit
f187d6dfbf633665ba6740fe22742aec60ce02a2 plus subsequent fixes from
upstream.

Relative to upstream this commit includes a few other small fixes such
as additional INET and INET6 #ifdef's, #include cleanups, and updates
for recent API changes in main.

Reviewed by: pauamma, gbe, kevans, emaste
Obtained from: git@git.zx2c4.com:wireguard-freebsd @ 3cc22b2
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D36909


# 177f04d5 29-Aug-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

routing: constantify @rc in rib_decompose_notification().

Clarify the @rc immutability by explicitly marking @rc const.

MFC after: 2 weeks


# 6d4f6e4c 09-Aug-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

routing: make rib_add_redirect() use new nhop-based KPI

MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D36169


# 8036234c 23-Aug-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

netinet6: fix SIOCSPFXFLUSH_IN6 by skipping manually-configured prefixes

Summary:
Currently netinet6/ code allocates IPv6 prefixes (nd_prefix) for
both manually-assigned addresses and advertised prefixes. As a result,
prefixes from manually-assigned prefixes can be seen in `ndp -p` list
and be cleared via `ndp -P`. The latter relies on the SIOCSPFXFLUSH_IN6
ioctl to clear to prefix list.
The original intent of the SIOCSPFXFLUSH_IN6 was to clear prefixes
originated from the advertising routers:

```
1998-09-02 JINMEI, Tatuya <jinmei@isl.rdc.toshiba.co.jp>
* nd6.c (nd6_ioctl): added 2 new ioctls; SIOCSRTRFLUSH_IN6 and
SIOCSPFXFLUSH_IN6. The former is to flush all default routers
in the default router list, and the latter is to flush all the
prefixes and the addresses derived from them in the prefix list.
```

Restore the intent by marking prefixes derived from the RA messages
with newly-added ndpr_flags.ra_derived flag and skip prefixes not marked
with such flag during deletion and listing.

Differential Revision: https://reviews.freebsd.org/D36312
MFC after: 2 weeks


# f998535a 10-Aug-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

netinet6: allow ND entries creation for all directly-reachable
destinations.

The current assumption is that kernel-handled rtadv prefixes along with
the interface address prefixes are the only prefixes considered in
the ND neighbor eligibility code.
Change this by allowing any non-gatewaye routes to be eligible. This
will allow DHCPv6-controlled routes to be correctly handled by
the ND code.
Refactor nd6_is_new_addr_neighbor() to enable more deterministic
performance in "found" case and remove non-needed
V_rt_add_addr_allfibs handling logic.

Reviewed By: kbowling
Differential Revision: https://reviews.freebsd.org/D23695
MFC after: 1 month


# cd330397 07-Aug-2022 Gordon Bergling <gbe@FreeBSD.org>

inet6(4): Fix a typo in a source code comment

- s/Unreachablity/Unreachability/

MFC after: 3 days


# 50207b2d 26-Jul-2022 Dimitry Andric <dim@FreeBSD.org>

Adjust function definition in nd6.c to avoid clang 15 warnings

With clang 15, the following -Werror warning is produced:

sys/netinet6/nd6.c:247:12: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
nd6_destroy()
^
void

This is nd6_destroy() is declared with a (void) argument list, but
defined with an empty argument list. Make the definition match the
declaration.

MFC after: 3 days


# d18b4bec 31-May-2022 Arseny Smalyuk <smalukav@gmail.com>

netinet6: Fix mbuf leak in NDP

Mbufs leak when manually removing incomplete NDP records with pending packet via ndp -d.
It happens because lltable_drop_entry_queue() rely on `la_numheld`
counter when dropping NDP entries (lles). It turned out NDP code never
increased `la_numheld`, so the actual free never happened.

Fix the issue by introducing unified lltable_append_entry_queue(),
common for both ARP and NDP code, properly addressing packet queue
maintenance.

Reviewed By: melifaro
Differential Revision: https://reviews.freebsd.org/D35365
MFC after: 2 weeks


# 990a6d18 08-Apr-2022 Mark Johnston <markj@FreeBSD.org>

net: Fix memory leaks in lltable_calc_llheader() error paths

Also convert raw epoch_call() calls to lltable_free_entry() calls, no
functional change intended. There's no need to asynchronously free the
LLEs in that case to begin with, but we might as well use the lltable
interfaces consistently.

Noticed by code inspection; I believe lltable_calc_llheader() failures
do not generally happen in practice.

Reviewed by: bz
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34832


# 10e0082f 28-Aug-2021 Gordon Bergling <gbe@FreeBSD.org>

inet6(4): Fix a few common typos in source code comments

- s/reshedule/reschedule/

MFC after: 3 days


# c541bd36 21-Aug-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

lltable: Add support for "child" LLEs holding encap for IPv4oIPv6 entries.

Currently we use pre-calculated headers inside LLE entries as prepend data
for `if_output` functions. Using these headers allows saving some
CPU cycles/memory accesses on the fast path.

However, this approach makes adding L2 header for IPv4 traffic with IPv6
nexthops more complex, as it is not possible to store multiple
pre-calculated headers inside lle. Additionally, the solution space is
limited by the fact that PCB caching saves LLEs in addition to the nexthop.

Thus, add support for creating special "child" LLEs for the purpose of holding
custom family encaps and store mbufs pending resolution. To simplify handling
of those LLEs, store them in a linked-list inside a "parent" (e.g. normal) LLE.
Such LLEs are not visible when iterating LLE table. Their lifecycle is bound
to the "parent" LLE - it is not possible to delete "child" when parent is alive.
Furthermore, "child" LLEs are static (RTF_STATIC), avoding complex state
machine used by the standard LLEs.

nd6_lookup() and nd6_resolve() now accepts an additional argument, family,
allowing to return such child LLEs. This change uses `LLE_SF()` macro which
packs family and flags in a single int field. This is done to simplify merging
back to stable/. Once this code lands, most of the cases will be converted to
use a dedicated `family` parameter.

Differential Revision: https://reviews.freebsd.org/D31379
MFC after: 2 weeks


# 663428ea 09-Aug-2021 Mark Johnston <markj@FreeBSD.org>

nd6: Mark several callouts as MPSAFE

The use of Giant here is vestigal and does not provide any useful
synchronization. Furthermore, non-MPSAFE callouts can cause the
softclock threads to block waiting for long-running newbus operations to
complete.

Reported by: mav
Reviewed by: bz
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31470


# 0b79b007 06-Aug-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

[lltable] Restructure nd6 code.

Factor out lltable locking logic from lltable_try_set_entry_addr()
into a separate lltable_acquire_wlock(), so the latter can be used
in other parts of the code w/o duplication.

Create nd6_try_set_entry_addr() to avoid code duplication in nd6.c
and nd6_nbr.c.

Move lle creation logic from nd6_resolve_slow() into a separate
nd6_get_llentry() to simplify the former.

These changes serve as a pre-requisite for implementing
RFC8950 (IPv4 prefixes with IPv6 nexthops).

Differential Revision: https://reviews.freebsd.org/D31432
MFC after: 2 weeks


# 8482aa77 02-Aug-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Use lltable calculated header when sending lle holdchain after successful lle resolution.

Subscribers: imp, ae, bz

Differential Revision: https://reviews.freebsd.org/D31391


# f3a3b061 02-Aug-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

[lltable] Unify datapath feedback mechamism.

Use newly-create llentry_request_feedback(),
llentry_mark_used() and llentry_get_hittime() to
request datapatch usage check and fetch the results
in the same fashion both in IPv4 and IPv6.

While here, simplify llentry_provide_feedback() wrapper
by eliminating 1 condition check.

MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D31390


# f187d6df 15-Mar-2021 Kyle Evans <kevans@FreeBSD.org>

base: remove if_wg(4) and associated utilities, manpage

After length decisions, we've decided that the if_wg(4) driver and
related work is not yet ready to live in the tree. This driver has
larger security implications than many, and thus will be held to
more scrutiny than other drivers.

Please also see the related message sent to the freebsd-hackers@
and freebsd-arch@ lists by Kyle Evans <kevans@FreeBSD.org> on
2021/03/16, with the subject line "Removing WireGuard Support From Base"
for additional context.


# 74ae3f3e 14-Mar-2021 Kyle Evans <kevans@FreeBSD.org>

if_wg: import latest fixup work from the wireguard-freebsd project

This is the culmination of about a week of work from three developers to
fix a number of functional and security issues. This patch consists of
work done by the following folks:

- Jason A. Donenfeld <Jason@zx2c4.com>
- Matt Dunwoodie <ncon@noconroy.net>
- Kyle Evans <kevans@FreeBSD.org>

Notable changes include:
- Packets are now correctly staged for processing once the handshake has
completed, resulting in less packet loss in the interim.
- Various race conditions have been resolved, particularly w.r.t. socket
and packet lifetime (panics)
- Various tests have been added to assure correct functionality and
tooling conformance
- Many security issues have been addressed
- if_wg now maintains jail-friendly semantics: sockets are created in
the interface's home vnet so that it can act as the sole network
connection for a jail
- if_wg no longer fails to remove peer allowed-ips of 0.0.0.0/0
- if_wg now exports via ioctl a format that is future proof and
complete. It is additionally supported by the upstream
wireguard-tools (which we plan to merge in to base soon)
- if_wg now conforms to the WireGuard protocol and is more closely
aligned with security auditing guidelines

Note that the driver has been rebased away from using iflib. iflib
poses a number of challenges for a cloned device trying to operate in a
vnet that are non-trivial to solve and adds complexity to the
implementation for little gain.

The crypto implementation that was previously added to the tree was a
super complex integration of what previously appeared in an old out of
tree Linux module, which has been reduced to crypto.c containing simple
boring reference implementations. This is part of a near-to-mid term
goal to work with FreeBSD kernel crypto folks and take advantage of or
improve accelerated crypto already offered elsewhere.

There's additional test suite effort underway out-of-tree taking
advantage of the aforementioned jail-friendly semantics to test a number
of real-world topologies, based on netns.sh.

Also note that this is still a work in progress; work going further will
be much smaller in nature.

MFC after: 1 month (maybe)


# c139b3c1 22-Feb-2021 Kristof Provost <kp@FreeBSD.org>

arp/nd: Cope with late calls to iflladdr_event

When tearing down vnet jails we can move an if_bridge out (as
part of the normal vnet_if_return()). This can, when it's clearing out
its list of member interfaces, change its link layer address.
That sends an iflladdr_event, but at that point we've already freed the
AF_INET/AF_INET6 if_afdata pointers.

In other words: when the iflladdr_event callbacks fire we can't assume
that ifp->if_afdata[AF_INET] will be set.

Reviewed by: donner@, melifaro@
MFC after: 1 week
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D28860


# 24a8f6d3 27-Jan-2021 Randall Stewart <rrs@FreeBSD.org>

When we are about to send down to the driver layer
we need to make sure that the m_nextpkt field is NULL
else the lower layers may do unwanted things.

Reviewed By: gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D28377


# 0da3f8c9 11-Jan-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Bump amount of queued packets in for unresolved ARP/NDP entries to 16.

Currently default behaviour is to keep only 1 packet per unresolved entry.
Ability to queue more than one packet was added 10 years ago, in r215207,
though the default value was kep intact.

Things have changed since that time. Systems tend to initiate multiple
connections at once for a variety of reasons.
For example, recent kern/252278 bug report describe happy-eyeball DNS
behaviour sending multiple requests to the DNS server.

The primary driver for upper value for the queue length determination is
memory consumption. Remote actors should not be able to easily exhaust
local memory by sending packets to unresolved arp/ND entries.

For now, bump value to 16 packets, to match Darwin implementation.

The proper approach would be to switch the limit to calculate memory
consumption instead of packet count and limit based on memory.

We should MFC this with a variation of D22447.

Reviewers: #manpages, #network, bz, emaste

Reviewed By: emaste, gbe(doc), jilles(doc)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D28068


# d1d941c5 29-Nov-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Remove RADIX_MPATH config option.

ROUTE_MPATH is the new config option controlling new multipath routing
implementation. Remove the last pieces of RADIX_MPATH-related code and
the config option.

Reviewed by: glebius
Differential Revision: https://reviews.freebsd.org/D27244


# dd4d5a5f 25-Nov-2020 Bjoern A. Zeeb <bz@FreeBSD.org>

IPv6: set ifdisabled in the kernel rather than in rc

Enable ND6_IFF_IFDISABLED when the interface is created in the
kernel before return to user space.

This avoids a race when an interface is create by a program which
also calls ifconfig IF inet6 -ifdisabled and races with the
devd -> /etc/pccard_ether -> .. netif start IF -> ifdisabled
calls (the devd/rc framework disabling IPv6 again after the program
had enabled it already).

In case the global net.inet6.ip6.accept_rtadv was turned on,
we also default to enabling IPv6 on the interfaces, rather than
disabling them.

PR: 248172
Reported by: Gert Doering (gert greenie.muc.de)
Reviewed by: glebius (, phk)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D27324


# fedeb08b 03-Oct-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Introduce scalable route multipath.

This change is based on the nexthop objects landed in D24232.

The change introduces the concept of nexthop groups.
Each group contains the collection of nexthops with their
relative weights and a dataplane-optimized structure to enable
efficient nexthop selection.

Simular to the nexthops, nexthop groups are immutable. Dataplane part
gets compiled during group creation and is basically an array of
nexthop pointers, compiled w.r.t their weights.

With this change, `rt_nhop` field of `struct rtentry` contains either
nexthop or nexthop group. They are distinguished by the presense of
NHF_MULTIPATH flag.
All dataplane lookup functions returns pointer to the nexthop object,
leaving nexhop groups details inside routing subsystem.

User-visible changes:

The change is intended to be backward-compatible: all non-mpath operations
should work as before with ROUTE_MPATH and net.route.multipath=1.

All routes now comes with weight, default weight is 1, maximum is 2^24-1.

Current maximum multipath group width is statically set to 64.
This will become sysctl-tunable in the followup changes.

Using functionality:
* Recompile kernel with ROUTE_MPATH
* set net.route.multipath to 1

route add -6 2001:db8::/32 2001:db8::2 -weight 10
route add -6 2001:db8::/32 2001:db8::3 -weight 20

netstat -6On

Nexthop groups data

Internet6:
GrpIdx NhIdx Weight Slots Gateway Netif Refcnt
1 ------- ------- ------- --------------------------------------- --------- 1
13 10 1 2001:db8::2 vlan2
14 20 2 2001:db8::3 vlan2

Next steps:
* Land outbound hashing for locally-originated routes ( D26523 ).
* Fix net/bird multipath (net/frr seems to work fine)
* Add ROUTE_MPATH to GENERIC
* Set net.route.multipath=1 by default

Tested by: olivier
Reviewed by: glebius
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D26449


# 2259a030 21-Sep-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Rework part of routing code to reduce difference to D26449.

* Split rt_setmetrics into get_info_weight() and rt_set_expire_info(),
as these two can be applied at different entities and at different times.
* Start filling route weight in route change notifications
* Pass flowid to UDP/raw IP route lookups
* Rework nd6_subscription_cb() and sysctl_dumpentry() to prepare for the fact
that rtentry can contain multiple nexthops.

Differential Revision: https://reviews.freebsd.org/D26497


# 662c1305 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

net: clean up empty lines in .c and .h files


# e1c05fd2 21-Jul-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Transition from rtrequest1_fib() to rib_action().

Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib,
in6_rtrequest, rtrequest_fib> and their uses and switch to
to rib_action(). This is part of the new routing KPI.

Submitted by: Neel Chauhan <neel AT neelc DOT org>
Differential Revision: https://reviews.freebsd.org/D25546


# 72587123 19-Jul-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Temporarly revert r363319 to unbreak the build.

Reported by: CI
Pointy hat to: melifaro


# 8cee15d9 19-Jul-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Transition from rtrequest1_fib() to rib_action().

Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib,
in6_rtrequest, rtrequest_fib> and their uses and switch to
to rib_action(). This is part of the new routing KPI.

Submitted by: Neel Chauhan <neel AT neelc DOT org>
Differential Revision: https://reviews.freebsd.org/D25546


# 4c7ba83f 12-Jul-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Switch inet6 default route subscription to the new rib subscription api.

Old subscription model allowed only single customer.

Switch inet6 to the new subscription api and eliminate the old model.

Differential Revision: https://reviews.freebsd.org/D25615


# 2bbab0af 23-May-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Use epoch(9) for rtentries to simplify control plane operations.

Currently the only reason of refcounting rtentries is the need to report
the rtable operation details immediately after the execution.
Delaying rtentry reclamation allows to stop refcounting and simplify the code.
Additionally, this change allows to reimplement rib_lookup_info(), which
is used by some of the customers to get the matching prefix along
with nexthops, in more efficient way.

The change keeps per-vnet rtzone uma zone. It adds nh_vnet field to
nhop_priv to be able to reliably set curvnet even during vnet teardown.
Rest of the reference counting code will be removed in the D24867 .

Differential Revision: https://reviews.freebsd.org/D24866


# bc74b819 12-May-2020 Andrew Gallatin <gallatin@FreeBSD.org>

IPv6: Fix a panic in the nd6 code with unmapped mbufs.

If the neighbor entry for an IPv6 TCP session using unmapped
mbufs times out, IPv6 will send an icmp6 dest. unreachable
message. In doing this, it will try to do a software checksum
on the reflected packet. If this is a TCP session using unmapped
mbufs, then there will be a kernel panic.

To fix this, just free packets with unmapped mbufs, rather
than sending the icmp.

Reviewed by: np, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24821


# 74787ef4 29-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Add nhop to the ifa_rtrequest() callback.

With the upcoming multipath changes described in D24141,
rt->rt_nhop can potentially point to a nexthop group instead of
an individual nhop.
To simplify caller handling of such cases, change ifa_rtrequest() callback
to pass changed nhop directly.

Differential Revision: https://reviews.freebsd.org/D24604


# e7d8af4f 28-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Move route_temporal.c and route_var.h to net/route.

Nexthop objects implementation, defined in r359823,
introduced sys/net/route directory intended to hold all
routing-related code. Move recently-introduced route_temporal.c and
private route_var.h header there.

Differential Revision: https://reviews.freebsd.org/D24597


# fe6da727 28-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Move struct rtentry definition to nhop_var.h.

One of the goals of the new routing KPI defined in r359823
is to entirely hide`struct rtentry` from the consumers.
It will allow to improve routing subsystem internals and deliver
features much faster.

This is one of the last changes, effectively moving struct rtentry
definition to a net/route_var.h header, internal to the routing subsystem.

Differential Revision: https://reviews.freebsd.org/D24580


# aaad3c4f 23-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Convert rtentry field accesses into nhop field accesses.

One of the goals of the new routing KPI defined in r359823 is to entirely
hide`struct rtentry` from the consumers. It will allow to improve routing
subsystem internals and deliver more features much faster.

This commit is mostly mechanical change to eliminate direct struct rtentry
field accesses.

The only notable difference is AF_LINK gateway encoding.

AF_LINK gw is used in routing stack for operations with interface routes
and host loopback routes.
In the former case it indicates _some_ non-NULL gateway, as the interface
is the same as in rt_ifp in kernel and rtm_ifindex in rtsock reporting.
In the latter case the interface index inside gateway was used by the IPv6
datapath to verify address scope for link-local interfaces.

Kernel uses struct sockaddr_dl for this type of gateway. This structure
allows for specifying rich interface data, such as mac address and interface
name. However, this results in relatively large structure size - 52 bytes.
Routing stack fils in only 2 fields - sdl_index and sdl_type, which reside
in the first 8 bytes of the structure.

In the new KPI, struct nhop_object tries to be cache-efficient, hence
embodies gateway address inside the structure. In the AF_LINK case it
stores stortened version of the structure - struct sockaddr_dl_short,
which occupies 16 bytes. After D24340 changes, the data inside AF_LINK
gateway will not be used in the kernel at all, leaving rtsock as the only
potential concern.

The difference in rtsock reporting:

(old)
got message of size 240 on Thu Apr 16 03:12:13 2020
RTM_ADD: Add Route: len 240, pid: 0, seq 0, errno 0, flags:<UP,DONE,PINNED>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK>
10.0.0.0 link#5 255.255.255.0

(new)
got message of size 200 on Sun Apr 19 09:46:32 2020
RTM_ADD: Add Route: len 200, pid: 0, seq 0, errno 0, flags:<UP,DONE,PINNED>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK>
10.0.0.0 link#5 255.255.255.0

Note 40 bytes different (52-16 + alignment).
However, gateway is still a valid AF_LINK gateway with proper data filled in.

It is worth noting that these particular messages (interface routes) are mostly
ignored by routing daemons:
* bird/quagga/frr uses RTM_NEWADDR and ignores prefix route addition messages.
* quagga/frr ignores routes without gateway

More detailed overview on how rtsock messages are used by the
routing daemons to reconstruct the kernel view, can be found in D22974.

Differential Revision: https://reviews.freebsd.org/D24519


# 539642a2 16-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Add nhop parameter to rti_filter callback.

One of the goals of the new routing KPI defined in r359823 is to
entirely hide`struct rtentry` from the consumers. It will allow to
improve routing subsystem internals and deliver more features much faster.
This change is one of the ongoing changes to eliminate direct
struct rtentry field accesses.

Additionally, with the followup multipath changes, single rtentry can point
to multiple nexthops.

With that in mind, convert rti_filter callback used when traversing the
routing table to accept pair (rt, nhop) instead of nexthop.

Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D24440


# 3c5018ca 19-Nov-2019 Bjoern A. Zeeb <bz@FreeBSD.org>

nd6: sysctl

Move the SYSCTL_DECL to the top of the file. Move the sysctl function
before SYSCTL_PROC so that we don't need an extra function declaration in
the middle of the file.

No functional changes.

MFC after: 3 weeks
Sponsored by: Netflix


# 6db65273 19-Nov-2019 Bjoern A. Zeeb <bz@FreeBSD.org>

nd6: make nd6_timer_ch static

nd6_timer_ch is only used in file local context. There is no need to
export it, so make it static.

MFC after: 3 weeks
Sponsored by: Netflix


# 808c432f 15-Nov-2019 Bjoern A. Zeeb <bz@FreeBSD.org>

nd6: retire defrouter_select(), use _fib() variant.

Burn bridges and replace the last two calls of defrouter_select() with
defrouter_select_fib(). That allows us to retire defrouter_select()
and make it more clear in the calling code that it applies to all FIBs.

Sponsored by: Netflix


# e20b5bc4 15-Nov-2019 Bjoern A. Zeeb <bz@FreeBSD.org>

nd6: simplify code

We are taking the same actions in both cases of the branch inside the block.
Simplify that code as the extra branch is not needed.

MFC after: 3 weeks
Sponsored by: Netflix


# d64df9a2 13-Nov-2019 Bjoern A. Zeeb <bz@FreeBSD.org>

nd6: make nd6_alloc() file static

nd6_alloc() is a function used only locally. Make it static and no
longer export it. Keeps the KPI smaller.

Sponsored by: Netflix


# ad675b32 12-Nov-2019 Bjoern A. Zeeb <bz@FreeBSD.org>

nd6 defrouter: consolidate nd_defrouter manipulations in nd6_rtr.c

Move the nd_defrouter along with the sysctl handler from nd6.c to
nd6_rtr.c and make the variable file static. Provide (temporary)
new accessor functions for code manipulating nd_defrouter from nd6.c,
and stop exporting functions no longer needed outside nd6_rtr.c.
This also shuffles a few functions around in nd6_rtr.c without
functional changes.

Given all nd_defrouter logic is now in one place we can tidy up the
code, locking and, and other open items.

MFC after: 3 weeks
X-MFC: keep exporting the functions
Sponsored by: Netflix


# d6dbfed8 04-Nov-2019 Gleb Smirnoff <glebius@FreeBSD.org>

In nd6_timer() enter the network epoch earlier. The defrouter_del() may
call into leaf functions that require epoch. Since the function is already
run in non-sleepable context, it should be safe to cover it whole with epoch.

Reported by: syzcaller


# ef2e580e 12-Oct-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Don't cover in6_ifattach() with network epoch, as it may call into
network drivers ioctls, that may sleep.

PR: 241223


# b8a6e03f 07-Oct-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Widen NET_EPOCH coverage.

When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.

However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.

Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.

On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().

This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.

Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.

This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.

Reviewed by: gallatin, hselasky, cy, adrian, kristof
Differential Revision: https://reviews.freebsd.org/D19111


# e2e050c8 19-May-2019 Conrad Meyer <cem@FreeBSD.org>

Extract eventfilter declarations to sys/_eventfilter.h

This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.

EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).

As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions. The remainder of the patch addresses
adding appropriate includes to fix those files.

LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).

No functional change (intended). Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.


# ca1163bd 30-Mar-2019 Mark Johnston <markj@FreeBSD.org>

Do not perform DAD on stf(4) interfaces.

stf(4) interfaces are not multicast-capable so they can't perform DAD.
They also did not set IFF_DRV_RUNNING when an address was assigned, so
the logic in nd6_timer() would periodically flag such an address as
tentative, resulting in interface flapping.

Fix the problem by setting IFF_DRV_RUNNING when an address is assigned,
and do some related cleanup:
- In in6if_do_dad(), remove a redundant check for !UP || !RUNNING.
There is only one caller in the tree, and it only looks at whether
the return value is non-zero.
- Have in6if_do_dad() return false if the interface is not
multicast-capable.
- Set ND6_IFF_NO_DAD when an address is assigned to an stf(4) interface
and the interface goes UP as a result. Note that this is not
sufficient to fix the problem because the new address is marked as
tentative and DAD is started before in6_ifattach() is called.
However, setting no_dad is formally correct.
- Change nd6_timer() to not flag addresses as tentative if no_dad is
set.

This is based on a patch from Viktor Dukhovni.

Reported by: Viktor Dukhovni <ietf-dane@dukhovni.org>
Reviewed by: ae
MFC after: 3 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19751


# 30b45077 07-Mar-2019 Bjoern A. Zeeb <bz@FreeBSD.org>

Update for IETF draft-ietf-6man-ipv6only-flag.

When we roam between networks and our link-state goes down, automatically remove
the IPv6-Only flag from the interface. Otherwise we might switch from an
IPv6-only to and IPv4-only network and the flag would stay and we would prevent
IPv4 from working.

While the actual function call to clear the flag is under EXPERIMENTAL,
the eventhandler is not as we might want to re-use it for other
functionality on link-down event (such was re-calculate default routers
for example if there is more than one).

Reviewed by: hrs
Differential Revision: https://reviews.freebsd.org/D19487


# a68cc388 08-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Mechanical cleanup of epoch(9) usage in network stack.

- Remove macros that covertly create epoch_tracker on thread stack. Such
macros a quite unsafe, e.g. will produce a buggy code if same macro is
used in embedded scopes. Explicitly declare epoch_tracker always.

- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
locking macros to what they actually are - the net_epoch.
Keeping them as is is very misleading. They all are named FOO_RLOCK(),
while they no longer have lock semantics. Now they allow recursion and
what's more important they now no longer guarantee protection against
their companion WLOCK macros.
Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.

This is non functional mechanical change. The only functionally changed
functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
epoch recursively.

Discussed with: jtl, gallatin


# 5f901c92 24-Jul-2018 Andrew Turner <andrew@FreeBSD.org>

Use the new VNET_DEFINE_STATIC macro when we are defining static VNET
variables.

Reviewed by: bz
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D16147


# 4f6c66cc 23-May-2018 Matt Macy <mmacy@FreeBSD.org>

UDP: further performance improvements on tx

Cumulative throughput while running 64
netperf -H $DUT -t UDP_STREAM -- -m 1
on a 2x8x2 SKL went from 1.1Mpps to 2.5Mpps

Single stream throughput increases from 910kpps to 1.18Mpps

Baseline:
https://people.freebsd.org/~mmacy/2018.05.11/udpsender2.svg

- Protect read access to global ifnet list with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender3.svg

- Protect short lived ifaddr references with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender4.svg

- Convert if_afdata read lock path to epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender5.svg

A fix for the inpcbhash contention is pending sufficient time
on a canary at LLNW.

Reviewed by: gallatin
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15409


# d7c5a620 18-May-2018 Matt Macy <mmacy@FreeBSD.org>

ifnet: Replace if_addr_lock rwlock with epoch + mutex

Run on LLNW canaries and tested by pho@

gallatin:
Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5
based ConnectX 4-LX NIC, I see an almost 12% improvement in received
packet rate, and a larger improvement in bytes delivered all the way
to userspace.

When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1,
I see, using nstat -I mce0 1 before the patch:

InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree
4.98 0.00 4.42 0.00 4235592 33 83.80 4720653 2149771 1235 247.32
4.73 0.00 4.20 0.00 4025260 33 82.99 4724900 2139833 1204 247.32
4.72 0.00 4.20 0.00 4035252 33 82.14 4719162 2132023 1264 247.32
4.71 0.00 4.21 0.00 4073206 33 83.68 4744973 2123317 1347 247.32
4.72 0.00 4.21 0.00 4061118 33 80.82 4713615 2188091 1490 247.32
4.72 0.00 4.21 0.00 4051675 33 85.29 4727399 2109011 1205 247.32
4.73 0.00 4.21 0.00 4039056 33 84.65 4724735 2102603 1053 247.32

After the patch

InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree
5.43 0.00 4.20 0.00 3313143 33 84.96 5434214 1900162 2656 245.51
5.43 0.00 4.20 0.00 3308527 33 85.24 5439695 1809382 2521 245.51
5.42 0.00 4.19 0.00 3316778 33 87.54 5416028 1805835 2256 245.51
5.42 0.00 4.19 0.00 3317673 33 90.44 5426044 1763056 2332 245.51
5.42 0.00 4.19 0.00 3314839 33 88.11 5435732 1792218 2499 245.52
5.44 0.00 4.19 0.00 3293228 33 91.84 5426301 1668597 2121 245.52

Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch

Reviewed by: gallatin
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15366


# 3a4fc8a8 13-Apr-2018 Brooks Davis <brooks@FreeBSD.org>

Remove support for the Arcnet protocol.

While Arcnet has some continued deployment in industrial controls, the
lack of drivers for any of the PCI, USB, or PCIe NICs on the market
suggests such users aren't running FreeBSD.

Evidence in the PR database suggests that the cm(4) driver (our sole
Arcnet NIC) was broken in 5.0 and has not worked since.

PR: 182297
Reviewed by: jhibbits, vangyzen
Relnotes: yes
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15057


# 0437c8e3 11-Apr-2018 Brooks Davis <brooks@FreeBSD.org>

Remove support for FDDI networks.

Defines in net/if_media.h remain in case code copied from ifconfig is in
use elsewere (supporting non-existant media type is harmless).

Reviewed by: kib, jhb
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15017


# 69f0fecb 28-Mar-2018 Brooks Davis <brooks@FreeBSD.org>

Remove infrastructure for token-ring networks.

Reviewed by: cem, imp, jhb, jmallett
Relnotes: yes
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14875


# f8116f39 29-Jan-2018 Eric van Gyzen <vangyzen@FreeBSD.org>

ND6: Set the correct state for new neighbor cache entries

Restore state 6. Many of the UNH tests end up exercising this
state, where we have a new neighbor cache entry and a new link-layer
entry is being created for it. The link-layer address is currently
unknown so the initial state of the "llentry" should remain initialized
to ND6_LLINFO_NOSTATE so that the ND code will send a solicitation.
Setting this to ND6_LLINFO_STALE implies that the link-level entry
is valid and can be used (but needs to be refreshed via the Neighbor
Unreachability state machine).

https://forums.freebsd.org/threads/64287/

Submitted by: Farrell Woods <Farrell_Woods@Dell.com>
Reviewed by: mjoras, dab, ae
MFC after: 1 week
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D14059


# 151ba793 24-Dec-2017 Alexander Kabaev <kan@FreeBSD.org>

Do pass removing some write-only variables from the kernel.

This reduces noise when kernel is compiled by newer GCC versions,
such as one used by external toolchain ports.

Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial)
Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c)
Differential Revision: https://reviews.freebsd.org/D10385


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# 559b4296 17-Mar-2017 Alan Somers <asomers@FreeBSD.org>

Constrain IPv6 routes to single FIBs when net.add_addr_allfibs=0

sys/netinet6/icmp6.c
Use the interface's FIB for source address selection in ICMPv6 error
responses.

sys/netinet6/in6.c
In in6_newaddrmsg, announce arrival of local addresses on the
interface's FIB only. In in6_lltable_rtcheck, use a per-fib ND6
cache instead of a single cache.

sys/netinet6/in6_src.c
In in6_selectsrc, use the caller's fib instead of the default fib.
In in6_selectsrc_socket, remove a superfluous check.

sys/netinet6/nd6.c
In nd6_lle_event, use the interface's fib for routing socket
messages. In nd6_is_new_addr_neighbor, check all FIBs when trying
to determine whether an address is a neighbor. Also, simplify the
code for point to point interfaces.

sys/netinet6/nd6.h
sys/netinet6/nd6.c
sys/netinet6/nd6_rtr.c
Make defrouter_select fib-aware, and make all of its callers pass in
the interface fib.

sys/netinet6/nd6_nbr.c
When inputting a Neighbor Solicitation packet, consider the
interface fib instead of the default fib for DAD. Output NS and
Neighbor Advertisement packets on the correct fib.

sys/netinet6/nd6_rtr.c
Allow installing the same host route on different interfaces in
different FIBs. If rt_add_addr_allfibs=0, only install or delete
the prefix route on the interface fib.

tests/sys/netinet/fibs_test.sh
Clear some expected failures, but add a skip for the newly revealed
BUG217871.

PR: 196361
Submitted by: Erick Turnquist <jhujhiti@adjectivism.org>
Reported by: Jason Healy <jhealy@logn.net>
Reviewed by: asomers
MFC after: 3 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D9451


# 2bbd06fc 28-Jan-2017 Andriy Voskoboinyk <avos@FreeBSD.org>

Garbage collect IFT_IEEE80211 (but leave the define for possible reuse)

This interface type ("a parent interface of wlanX") is not used since
r287197

Reviewed by: adrian, glebius
Differential Revision: https://reviews.freebsd.org/D9308


# 8cd3b204 08-Jan-2017 Mark Johnston <markj@FreeBSD.org>

Release the ND6 list lock before making a prefix off-link in nd6_timer().

Reported by: Jim <BM-2cWfdfG5CJsquqkJyry7hZT9LypbSEWEkQ@bitmessage.ch>
X-MFC With: r306829


# d748f7ef 07-Oct-2016 Mark Johnston <markj@FreeBSD.org>

Lock the ND prefix list and add refcounting for prefixes.

This change extends the nd6 lock to protect the ND prefix list as well
as the list of advertising routers associated with each prefix. To handle
cases where the nd6 lock must be dropped while iterating over either the
prefix or default router lists, a generation counter is used to track
modifications to the lists. Additionally, a new mutex is used to serialize
prefix on-link/off-link transitions. This mutex must be acquired before
the nd6 lock and is held while updating the routing table in
nd6_prefix_onlink() and nd6_prefix_offlink().

Reviewed by: ae, tuexen (SCTP bits)
Tested by: Jason Wolfe <jason@llnw.com>,
Larry Rosenman <ler@lerctr.org>
MFC after: 2 months
Differential Revision: https://reviews.freebsd.org/D8125


# f7d91d8c 06-Oct-2016 Mark Johnston <markj@FreeBSD.org>

Use a const reference to prefixes in nd6_is_new_addr_neighbor().

MFC after: 1 week


# 0bbf244e 23-Sep-2016 Mark Johnston <markj@FreeBSD.org>

Rename ndpr_refcnt to ndpr_addrcnt.

This field counts derived addresses and is not a true refcount for prefix
objects, so the previous name was misleading.

MFC after: 1 week


# ea17754c 21-Jul-2016 Mike Karels <karels@FreeBSD.org>

Fix per-connection L2 caching in fast path

r301217 re-added per-connection L2 caching from a previous change,
but it omitted caching in the fast path. Add it.

Reviewed By: gallatin
Approved by: gnn (mentor)
Differential Revision: https://reviews.freebsd.org/D7239


# 89856f7e 21-Jun-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

Get closer to a VIMAGE network stack teardown from top to bottom rather
than removing the network interfaces first. This change is rather larger
and convoluted as the ordering requirements cannot be separated.

Move the pfil(9) framework to SI_SUB_PROTO_PFIL, move Firewalls and
related modules to their own SI_SUB_PROTO_FIREWALL.
Move initialization of "physical" interfaces to SI_SUB_DRIVERS,
move virtual (cloned) interfaces to SI_SUB_PSEUDO.
Move Multicast to SI_SUB_PROTO_MC.

Re-work parts of multicast initialisation and teardown, not taking the
huge amount of memory into account if used as a module yet.

For interface teardown we try to do as many of them as we can on
SI_SUB_INIT_IF, but for some this makes no sense, e.g., when tunnelling
over a higher layer protocol such as IP. In that case the interface
has to go along (or before) the higher layer protocol is shutdown.

Kernel hhooks need to go last on teardown as they may be used at various
higher layers and we cannot remove them before we cleaned up the higher
layers.

For interface teardown there are multiple paths:
(a) a cloned interface is destroyed (inside a VIMAGE or in the base system),
(b) any interface is moved from a virtual network stack to a different
network stack ("vmove"), or (c) a virtual network stack is being shut down.
All code paths go through if_detach_internal() where we, depending on the
vmove flag or the vnet state, make a decision on how much to shut down;
in case we are destroying a VNET the individual protocol layers will
cleanup their own parts thus we cannot do so again for each interface as
we end up with, e.g., double-frees, destroying locks twice or acquiring
already destroyed locks.
When calling into protocol cleanups we equally have to tell them
whether they need to detach upper layer protocols ("ulp") or not
(e.g., in6_ifdetach()).

Provide or enahnce helper functions to do proper cleanup at a protocol
rather than at an interface level.

Approved by: re (hrs)
Obtained from: projects/vnet
Reviewed by: gnn, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D6747


# 99a0c406 06-Jun-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

Move the callout_reset() to the end of the work not having it stick
before we do anything.

Obtained from: projects/vnet
MFC after: 2 week
Sponsored by: The FreeBSD Foundation


# 6d768226 02-Jun-2016 George V. Neville-Neil <gnn@FreeBSD.org>

This change re-adds L2 caching for TCP and UDP, as originally added in D4306
but removed due to other changes in the system. Restore the llentry pointer
to the "struct route", and use it to cache the L2 lookup (ARP or ND6) as
appropriate.

Submitted by: Mike Karels
Differential Revision: https://reviews.freebsd.org/D6262


# 65fdf521 23-May-2016 Mark Johnston <markj@FreeBSD.org>

Mark the prefix and default router list sysctl handlers MPSAFE.

MFC after: 2 weeks


# cc51be7b 23-May-2016 Mark Johnston <markj@FreeBSD.org>

Acquire the nd6 lock in the prefix list sysctl handler.

The nd6 lock will be used to synchronize access to the NDP prefix list.

MFC after: 2 weeks
Tested by: Jason Wolfe (as part of a larger change)


# 5e0a6f31 19-May-2016 Mark Johnston <markj@FreeBSD.org>

Move IPv6 malloc tag definitions into the IPv6 code.


# df890b8e 09-May-2016 Mark Johnston <markj@FreeBSD.org>

Remove obsolescent comments from nd6_purge().

MFC after: 1 week


# a4641f4e 03-May-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/net*: minor spelling fixes.

No functional change.


# b0ab3725 25-Apr-2016 Luiz Otavio O Souza <loos@FreeBSD.org>

Fixes the comment to reflect the code.

Sponsored by: Rubicon Communications (Netgate)


# 435bece4 29-Mar-2016 Mark Johnston <markj@FreeBSD.org>

Modify nd6_llinfo_timer() to acquire the nd6 lock before the LLE lock.

When expiring a neighbour cache entry we may need to look up the associated
default router, which requires the nd6 read lock. To avoid an LOR, the nd6
lock should be acquired first.

X-MFC-With: r296063
Tested by: Larry Rosenman <ler@lerctr.org> (previous revision)


# 9901091e 22-Mar-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

Mfp4 @180378:

Factor out nd6 and in6_attach initialization to their own files.
Also move destruction into those files though still called from
the central initialization.

Sponsored by: CK Software GmbH
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D5033


# 4de485fe 25-Feb-2016 Mark Johnston <markj@FreeBSD.org>

Lock the NDP default router list and count defrouter references.

This addresses a number of race conditions that can cause crashes as a
result of unsynchronized access to the list.

PR: 206904
Tested by: Larry Rosenman <ler@lerctr.org>,
Kevin Bowling <kevin.bowling@kev009.com>
MFC after: 2 months
Differential Revision: https://reviews.freebsd.org/D5315


# 01869be5 12-Feb-2016 Mark Johnston <markj@FreeBSD.org>

Rename the flags field of struct nd_defrouter to "raflags".

This field contains the flags inherited from the corresponding router
advertisement message and is not for storing private state.

MFC after: 1 week


# baebd3e5 12-Feb-2016 Mark Johnston <markj@FreeBSD.org>

Remove superfluous return statements from the neighbour discovery code.

MFC after: 1 week


# fc315641 12-Feb-2016 Mark Johnston <markj@FreeBSD.org>

Fix style around allocations from M_IP6NDP.

- Don't cast the return value of malloc(9).
- Use M_ZERO instead of explicitly calling bzero(9).

MFC after: 1 week


# 97dca6a2 12-Feb-2016 Mark Johnston <markj@FreeBSD.org>

Remove some unreferenced NDP debug variable definitions.

MFC after: 1 week


# 1f12da0e 22-Jan-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

Just checkpoint the WIP in order to be able to make the tree update
easier. Note: this is currently not in a usable state as certain
teardown parts are not called and the DOMAIN rework is missing.
More to come soon and find its way to head.

Obtained from: P4 //depot/user/bz/vimage/...
Sponsored by: The FreeBSD Foundation


# 9a1b64d5 04-Jan-2016 Alexander V. Chernikov <melifaro@FreeBSD.org>

Add rib_lookup_info() to provide API for retrieving individual route
entries data in unified format.

There are control plane functions that require information other than
just next-hop data (e.g. individual rtentry fields like flags or
prefix/mask). Given that the goal is to avoid rte reference/refcounting,
re-use rt_addrinfo structure to store most rte fields. If caller wants
to retrieve key/mask or gateway (which are sockaddrs and are allocated
separately), it needs to provide sufficient-sized sockaddrs structures
w/ ther pointers saved in passed rt_addrinfo.

Convert:
* lltable new records checks (in_lltable_rtcheck(),
nd6_is_new_addr_neighbor().
* rtsock pre-add/change route check.
* IPv6 NS ND-proxy check (RADIX_MPATH code was eliminated because
1) we don't support RTF_ANNOUNCE ND-proxy for networks and there should
not be multiple host routes for such hosts 2) if we have multiple
routes we should inspect them (which is not done). 3) the entire idea
of abusing KRT as storage for ND proxy seems odd. Userland programs
should be used for that purpose).


# 4fb3a820 30-Dec-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Implement interface link header precomputation API.

Add if_requestencap() interface method which is capable of calculating
various link headers for given interface. Right now there is support
for INET/INET6/ARP llheader calculation (IFENCAP_LL type request).
Other types are planned to support more complex calculation
(L2 multipath lagg nexthops, tunnel encap nexthops, etc..).

Reshape 'struct route' to be able to pass additional data (with is length)
to prepend to mbuf.

These two changes permits routing code to pass pre-calculated nexthop data
(like L2 header for route w/gateway) down to the stack eliminating the
need for other lookups. It also brings us closer to more complex scenarios
like transparently handling MPLS nexthops and tunnel interfaces.
Last, but not least, it removes layering violation introduced by flowtable
code (ro_lle) and simplifies handling of existing if_output consumers.

ARP/ND changes:
Make arp/ndp stack pre-calculate link header upon installing/updating lle
record. Interface link address change are handled by re-calculating
headers for all lles based on if_lladdr event. After these changes,
arpresolve()/nd6_resolve() returns full pre-calculated header for
supported interfaces thus simplifying if_output().
Move these lookups to separate ether_resolve_addr() function which ether
returs error or fully-prepared link header. Add <arp|nd6_>resolve_addr()
compat versions to return link addresses instead of pre-calculated data.

BPF changes:
Raw bpf writes occupied _two_ cases: AF_UNSPEC and pseudo_AF_HDRCMPLT.
Despite the naming, both of there have ther header "complete". The only
difference is that interface source mac has to be filled by OS for
AF_UNSPEC (controlled via BIOCGHDRCMPLT). This logic has to stay inside
BPF and not pollute if_output() routines. Convert BPF to pass prepend data
via new 'struct route' mechanism. Note that it does not change
non-optimized if_output(): ro_prepend handling is purely optional.
Side note: hackish pseudo_AF_HDRCMPLT is supported for ethernet and FDDI.
It is not needed for ethernet anymore. The only remaining FDDI user is
dev/pdq mostly untouched since 2007. FDDI support was eliminated from
OpenBSD in 2013 (sys/net/if_fddisubr.c rev 1.65).

Flowtable changes:
Flowtable violates layering by saving (and not correctly managing)
rtes/lles. Instead of passing lle pointer, pass pointer to pre-calculated
header data from that lle.

Differential Revision: https://reviews.freebsd.org/D4102


# d6e82913 17-Dec-2015 Steven Hartland <smh@FreeBSD.org>

Revert r292275 & r292379

glebius has concerns about these changes so reverting those can be discussed
and addressed.

Sponsored by: Multiplay


# 3a909afe 16-Dec-2015 Steven Hartland <smh@FreeBSD.org>

Fix issues introduced by r292275

* Fix panic for etherswitches which don't have a LLADDR.
* Disabled DELAY in unsolicited NDA, which needs further work.
* Fixed missing DELAY in carp_send_na.
* style(9) fix.

Reported by: kp & melifaro
X-MFC-With: r292275
MFC after: 1 month
Sponsored by: Multiplay


# 427c2f4e 16-Dec-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Provide additional lle data in IPv6 lltable dump used by ndp(8).

Before the change, things like lle state were queried via
SIOCGNBRINFO_IN6 by ndp(8) for _each_ lle entry in dump.
This ioctl was added in 1999, probably to avoid touching rtsock code.

This change maps SIOCGNBRINFO_IN6 data to standard rtsock dump the
following way:
expire (already) maps to rtm_rmx.rmx_expire
isrouter -> rtm_flags & RTF_GATEWAY
asked -> rtm_rmx.rmx_pksent
state -> rtm_rmx.rmx_state (maps to rmx_weight via define)

Reviewed by: ae


# 52e53e2d 15-Dec-2015 Steven Hartland <smh@FreeBSD.org>

Fix lagg failover due to missing notifications

When using lagg failover mode neither Gratuitous ARP (IPv4) or Unsolicited
Neighbour Advertisements (IPv6) are sent to notify other nodes that the
address may have moved.

This results is slow failover, dropped packets and network outages for the
lagg interface when the primary link goes down.

We now use the new if_link_state_change_cond with the force param set to
allow lagg to force through link state changes and hence fire a
ifnet_link_event which are now monitored by rip and nd6.

Upon receiving these events each protocol trigger the relevant
notifications:
* inet4 => Gratuitous ARP
* inet6 => Unsolicited Neighbour Announce

This also fixes the carp IPv6 NA's that stopped working after r251584 which
added the ipv6_route__llma route.

The new behavour can be controlled using the sysctls:
* net.link.ether.inet.arp_on_link
* net.inet6.icmp6.nd6_on_link

Also removed unused param from lagg_port_state and added descriptions for the
sysctls while here.

PR: 156226
MFC after: 1 month
Sponsored by: Multiplay
Differential Revision: https://reviews.freebsd.org/D4111


# 7e037c12 14-Dec-2015 Kristof Provost <kp@FreeBSD.org>

inet6: Do not assume every interface has ip6 enabled.

Certain interfaces (e.g. pfsync0) do not have ip6 addresses (in other words,
ifp->if_afdata[AF_INET6] is NULL). Ensure we don't panic when the MTU is
updated.

pfsync interfaces will never have ip6 support, because it's explicitly disabled
in in6_domifattach().

PR: 205194
Reviewed by: melifaro, hrs
Differential Revision: https://reviews.freebsd.org/D4522


# 12cb7521 13-Dec-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Remove LLE read lock from IPv6 fast path.

LLE structure is mostly unchanged during its lifecycle: there are only 2
things relevant for fast path lookup code:
1) link-level address change. Since r286722, these updates are performed
under AFDATA WLOCK.
2) Some sort of feedback indicating that this particular entry is used so
we send NS to perform reachability verification instead of expiring entry.
The only signal that is needed from fast path is something like binary
yes/no.
The latter is solved by the following changes:

Special r_skip_req (introduced in D3688) value is used for fast path feedback.
It is read lockless by fast path, but updated under req_mutex mutex. If this
field is non-zero, then fast path will acquire lock and set it back to 0.

After transitioning to STALE state, callout timer is armed to run each
V_nd6_delay seconds to make sure that if packet was transmitted at the start
of given interval, we would be able to switch to PROBE state in V_nd6_delay
seconds as user expects.
(in STALE state) timer is rescheduled until original V_nd6_gctimer expires
keeping lle in STALE state (remaining timer value stored in lle_remtime).
(in STALE state) timer is rescheduled if packet was transmitted less that
V_nd6_delay seconds ago to make sure we transition to PROBE state exactly
after V_n6_delay seconds.

As a result, all packets towards lle in REACHABLE/STALE/PROBE states are handled
by fast path without acquiring lle read lock.

Differential Revision: https://reviews.freebsd.org/D3780


# e8b0643e 29-Nov-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Add new rt_foreach_fib_walk_del() function for deleting route entries
by filter function instead of picking into routing table details in
each consumer.
Remove now-unused rt_expunge() (eliminating last external RTF_RNH_LOCKED
user).
This simplifies future nexthops/mulitipath changes and rtrequest1_fib()
locking refactoring.

Actual changes:
Add "rt_chain" field to permit rte grouping while doing batched delete
from routing table (thus growing rte 200->208 on amd64).
Add "rti_filter" / "rti_filterdata" / "rti_spare" fields to rt_addrinfo
to pass filter function to various routing subsystems in standard way.
Convert all rt_expunge() customers to new rt_addinfo-based api and eliminate
rt_expunge().


# 637670e7 15-Nov-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Bring back the ability of passing cached route via nd6_output_ifp().


# 7c4676dd 13-Nov-2015 Randall Stewart <rrs@FreeBSD.org>

This fixes several places where callout_stops return is examined. The
new return codes of -1 were mistakenly being considered "true". Callout_stop
now returns -1 to indicate the callout had either already completed or
was not running and 0 to indicate it could not be stopped. Also update
the manual page to make it more consistent no non-zero in the callout_stop
or callout_reset descriptions.

MFC after: 1 Month with associated callout change.


# ddd208f7 07-Nov-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Unify setting lladdr for AF_INET[6].


# ab415c83 04-Oct-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Invoke lle_event for new entry iff it has lladdr set.


# 7503e0c7 03-Oct-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Simplify if (lladdr) condition in nd6_cache_lladdr():
For case (7) (new entry) nothing has to be done except lle_event.
Invoke this event directly from "create new lle" code block.
For case (4) (existing entry, same mac) useless mac update was performed,
along with LLENTRY_RESOLVED lle_event. There was no sense in doing that,
since nothing really had changed. Simply avoid this condition instead.
Given that, condition was simplified to (3),(5) states which can be merged
with previous block.


# 9b420b3d 04-Oct-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Eliminate nd6_llinfo_settimer(). All consumers were converted to
use nd6_llinfo_settimer_locked() in r216022.
Make nd6_llinfo_settimer_locked() static: last external consumer was
converted in r288124.


# c0b8aeae 04-Oct-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Add __noinline attribute to several functions to ease dtrace instrumentation


# 06a60e4b 04-Oct-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix condition for nd6_llinfo_getholdsrc() introduced in r287484.
Effectively it always returned NULL so SAS was always performed and
sometimes the result might have been different.

Fix state machine change accidentally introduced in r287985:
state (4) inside nd6_cache_lladdr() (existing entry got nd message
with the same lladdress) started to cause lle state transition to STALE
instead of no-action.


# 6401c828 02-Oct-2015 Hiroki Sato <hrs@FreeBSD.org>

- Schedule DAD for IN6_IFF_TENTATIVE addresses in nd6_timer(). This
catches cases that DAD probes cannot be sent because of
IFF_UP && !IFF_DRV_RUNNING.

- nd6_dad_starttimer() now calls nd6_dad_ns_output(), instead of
calling it before nd6_dad_starttimer().

- Do not release an entry in dadq when a duplicate entry is being
added.


# 1558cb24 26-Sep-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Eliminate nd6_nud_hint() and its TCP bindings.

Initially function was introduced in r53541 (KAME initial commit) to
"provide hints from upper layer protocols that indicate a connection
is making "forward progress"" (quote from RFC 2461 7.3.1 Reachability
Confirmation).
However, it was converted to do nothing (e.g. just return) in r122922
(tcp_hostcache implementation) back in 2003. Some defines were moved
to tcp_var.h in r169541. Then, it was broken (for non-corner cases)
by r186119 (L2<>L3 split) in 2008 (NULL ifp in nd6_lookup). So,
right now this code is broken and has no "real" base users.

Differential Revision: https://reviews.freebsd.org/D3699


# f506d933 22-Sep-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Use standard lle LLE_EXCLUSIVE request flags instead of
its redefined version.


# aa5f023e 21-Sep-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Unify nd6 state switching by using newly-created nd6_llinfo_setstate()
function. The change is mostly mechanical with the following exception:
Last piece of nd6_resolve_slow() was refactored: ND6_LLINFO_PERMANENT
condition was removed as always-true, explicit ND6_LLINFO_NOSTATE ->
ND6_LLINFO_INCOMPLETE state transition was removed as duplicate.

Reviewed by: ae
Sponsored by: Yandex LLC


# 1496229a 21-Sep-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Add "stale" timer back to nd6_cache_lladdr().
Setting timer was accidentally removed in r276844 due to misleading
comment on its meaningless. Add it back to restore proper behaviour.


# 501adf01 19-Sep-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Cleanup nd6_cache_lladdr(). No functional changes.

* Since new extries are now allocated explicitly, fill in
all the necessary fields for lle _before_ attaching it to the table.
* Remove ND6_LLINFO_INCOMPLETE check which was unused even in
first KAME merge (r53541).
* After that, the only new state that function can set, was
ND6_LLINFO_STALE. Given everything above, simplify logic besides
do_update and is_newentry.
* Fix nd_resolve() comment.


# 41a31e78 18-Sep-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

* Simplify logic besides llchange variable.
* Refresh nd6_is_router() comment.


# 1fe201c3 16-Sep-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Simplify the way of attaching IPv6 link-layer header.

Problem description:
How do we currently perform layer 2 resolution and header imposition:

For IPv4 we have the following chain:
ip_output() -> (ether|atm|whatever)_output() -> arpresolve()

Lookup is done in proper place (link-layer output routine) and it is possible
to provide cached lle data.

For IPv6 situation is more complex:
ip6_output() -> nd6_output() -> nd6_output_ifp() -> (whatever)_output() ->
nd6_storelladdr()

We have ip6_ouput() which calls nd6_output() instead of link output routine.
nd6_output() does the following:
* checks if lle exists, creates it if needed (similar to arpresolve())
* performes lle state transitions (similar to arpresolve())
* calls nd6_output_ifp() which pushes packets to link output routine along
with running SeND/MAC hooks regardless of lle state
(e.g. works as run-hooks placeholder).

After that, iface output routine like ether_output() calls nd6_storelladdr()
which performs lle lookup once again.

As a result, we perform lookup twice for each outgoing packet for most types
of interfaces. We also need to maintain runtime-checked table of 'nd6-free'
interfaces (see nd6_need_cache()).

Fix this behavior by eliminating first ND lookup. To be more specific:
* make all nd6_output() consumers use nd6_output_ifp() instead
* rename nd6_output[_slow]() to nd6_resolve_[slow]()
* convert nd6_resolve() and nd6_resolve_slow() to arpresolve() semantics,
e.g. copy L2 address to buffer instead of pushing packet towards lower
layers
* Make all nd6_storelladdr() users use nd6_resolve()
* eliminate nd6_storelladdr()

The resulting callchain is the following:
ip6_output() -> nd6_output_ifp() -> (whatever)_output() -> nd6_resolve()

Error handling:
Currently sending packet to non-existing la results in ip6_<output|forward>
-> nd6_output() -> nd6_output _lle() which returns 0.
In new scenario packet is propagated to <ether|whatever>_output() ->
nd6_resolve() which will return EWOULDBLOCK, and that result
will be converted to 0.

(And EWOULDBLOCK is actually used by IB/TOE code).

Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D1469


# f0316e1a 16-Sep-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Constantify lookup key in several nd6_* functions.


# 0e2dcee6 15-Sep-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Simplify nd6_cache_lladdr:
* Move isRouter calculation code to separate nd6_is_router() function.
* Make nd6_cache_lladdr() return void: its return value hasn't been used
since r53541 KAME import in 1999.

Sponsored by: Yandex LLC


# d3cdb716 15-Sep-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

* Require explicitl lle unlink prior to calling llentry_delete().
This one slightly decreases time of holding afdata wlock.
* While here, make nd6_free() return void. No one has used its return value
since r186119.


# 17a03656 14-Sep-2015 Eric van Gyzen <vangyzen@FreeBSD.org>

Fix the handling of IPv6 On-Link Redirects.

On receipt of a redirect message, install an interface route for the
redirected destination. On removal of the corresponding Neighbor Cache
entry, remove the interface route.

This requires changes in rtredirect_fib() to cope with an AF_LINK
address for the gateway and with the absence of RTF_GATEWAY.

This fixes the "Redirected On-Link" test cases in the Tahi IPv6 Ready Logo
Phase 2 test suite.

Unrelated to the above, fix a recursion on the radix node head lock
triggered by the Tahi Redirected to Alternate Router test cases.

When I first wrote this patch in October 2012, all Section 2
(Neighbor Discovery) test cases passed on 10-CURRENT, 9-STABLE,
and 8-STABLE. cem@ recently rebased the 10.x patch onto head and reported
that it passes Tahi. (Thanks!)

These other test cases also passed in 2012:

* the RTF_MODIFIED case, with IPv4 and IPv6 (using a
RTF_HOST|RTF_GATEWAY route for the destination)

* the redirected-to-self case, with IPv4 and IPv6

* a valid IPv4 redirect

All testing in 2012 was done with WITNESS and INVARIANTS.

Tested by: EMC / Isilon Storage Division via Conrad Meyer (cem) in 2015,
Mark Kelley <mark_kelley@dell.com> in 2012,
TC Telkamp <terence_telkamp@dell.com> in 2012
PR: 152791
Reviewed by: melifaro (current rev), bz (earlier rev)
Approved by: kib (mentor)
MFC after: 1 month
Relnotes: yes
Sponsored by: Dell Inc.
Differential Revision: https://reviews.freebsd.org/D3602


# 3e7a2321 14-Sep-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

* Do more fine-grained locking: call eventhandlers/free_entry
without holding afdata wlock
* convert per-af delete_address callback to global lltable_delete_entry() and
more low-level "delete this lle" per-af callback
* fix some bugs/inconsistencies in IPv4/IPv6 ifscrub procedures

Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D3573


# d0bec2c5 10-Sep-2015 Hiroki Sato <hrs@FreeBSD.org>

- Remove SIOCGDRLST_IN6 and SIOCGPRLST_IN6. These are quite old APIs and
there is no consumer now.

- Simplify first and duplicate LLA check.

MFC after: 3 days


# 26deb882 05-Sep-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Do not pass lle to nd6_ns_output(). Use newly-added
nd6_llinfo_get_holdsrc() to extract desired IPv6 source
from holdchain and pass it to the nd6_ns_output().


# 3b0fd911 30-Aug-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Simplify lla_rt_output()/nd6_add_ifa_lle() by setting lle state in
alloc handler, based on flags.


# 5a255516 19-Aug-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

* Split allocation and table linking for lle's.
Before that, the logic besides lle_create() was the following:
return existing if found, create if not. This behaviour was error-prone
since we had to deal with 'sudden' static<>dynamic lle changes.
This commit fixes bunch of different issues like:
- refcount leak when lle is converted to static.
Simple check case:
console 1:
while true;
do for i in `arp -an|awk '$4~/incomp/{print$2}'|tr -d '()'`;
do arp -s $i 00:22:44:66:88:00 ; arp -d $i;
done;
done
console 2:
ping -f any-dead-host-in-L2
console 3:
# watch for memory consumption:
vmstat -m | awk '$1~/lltable/{print$2}'
- possible problems in arptimer() / nd6_timer() when dropping/reacquiring
lock.
New logic explicitly handles use-or-create cases in every lla_create
user. Basically, most of the changes are purely mechanical. However,
we explicitly avoid using existing lle's for interface/static LLE records.
* While here, call lle_event handlers on all real table lle change.
* Create lltable_free_entry() calling existing per-lltable
lle_free_t callback for entry deletion


# 0447c136 10-Aug-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Use single 'lle_timer' callout in lltable instead of
two different names of the same timer.


# 314294de 11-Aug-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Store addresses instead of sockaddrs inside llentry.
This permits us having all (not fully true yet) all the info
needed in lookup process in first 64 bytes of 'struct llentry'.

struct llentry layout:
BEFORE:
[rwlock .. state .. state .. MAC ] (lle+1) [sockaddr_in[6]]
AFTER
[ in[6]_addr MAC .. state .. rwlock ]

Currently, address part of struct llentry has only 16 bytes for the key.
However, lltable does not restrict any custom lltable consumers with long
keys use the previous approach (store key at (lle+1)).

Sponsored by: Yandex LLC


# 30aee131 20-Jul-2015 Andrey V. Elsukov <ae@FreeBSD.org>

Add LLE event handler to report ND6 events to userland via rtsock.

Obtained from: Yandex LLC
MFC after: 2 weeks
Sponsored by: Yandex LLC


# 4e870f94 29-May-2015 Andrey V. Elsukov <ae@FreeBSD.org>

Move RTM announces into generic code to be independent from Layer2 code.
This fixes bug introduced in 274988, when announces about new addresses
don't sent for tunneling interfaces.

Reported by: tuexen@
MFC after: 1 week


# 0fa5aacd 02-May-2015 Gleb Smirnoff <glebius@FreeBSD.org>

Remove #ifdef IFT_FOO.

Submitted by: Guy Yur <guyyur gmail.com>


# dff78447 12-Apr-2015 Mark Johnston <markj@FreeBSD.org>

Fix a possible refcount leak in regen_tmpaddr().

public_ifa6 may be set to NULL after taking a reference to a previous
address list element. Instead, only take the reference after leaving the
loop but before releasing the address list lock.

Differential Revision: https://reviews.freebsd.org/D2253
Reviewed by: ae
MFC after: 2 weeks


# 11d8451d 02-Mar-2015 Hiroki Sato <hrs@FreeBSD.org>

Implement Enhanced DAD algorithm for IPv6 described in
draft-ietf-6man-enhanced-dad-13.

This basically adds a random nonce option (RFC 3971) to NS messages
for DAD probe to detect a looped back packet. This looped back packet
prevented DAD on some pseudo-interfaces which aggregates multiple L2 links
such as lagg(4).

The length of the nonce is set to 6 bytes. This algorithm can be disabled by
setting net.inet6.ip6.dad_enhanced sysctl to 0 in a per-vnet basis.

Reported by: hiren
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D1835


# 2575fbb8 09-Feb-2015 Randall Stewart <rrs@FreeBSD.org>

This fixes a bug in the way that the LLE timers for nd6
and arp were being used. They basically would pass in the
mutex to the callout_init. Because they used this method
to the callout system, it was possible to "stop" the callout.
When flushing the table and you stopped the running callout, the
callout_stop code would return 1 indicating that it was going
to stop the callout (that was about to run on the callout_wheel blocked
by the function calling the stop). Now when 1 was returned, it would
lower the reference count one extra time for the stopped timer, then
a few lines later delete the memory. Of course the callout_wheel was
stuck in the lock code and would then crash since it was accessing
freed memory. By using callout_init(c, 1) we always get a 0 back
and the reference counting bug does not rear its head. We do have
to make a few adjustments to the callouts themselves though to make
sure it does the proper thing if rescheduled as well as gets the lock.

Commented upon by hiren and sbruno
See Phabricator D1777 for more details.

Commented upon by hiren and sbruno
Reviewed by: adrian, jhb and bz
Sponsored by: Netflix Inc.


# d63e657c 08-Jan-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

* Deal with ARCNET L2 multicast mapping for IPv6 the same way as in IPv4:
handle it in arc_output() instead of nd6_storelladdr().
* Remove IFT_ARCNET check from arpresolve() since arc_output() does not
use arpresolve() to handle broadcast/multicast. This check was there
since r84931. It looks like it was not used since r89099 (initial
import of Arcnet support where multicast is handled separately).
* Remove IFT_IEEE1394 case from nd6_storelladdr() since firewire_output()
calles nd6_storelladdr() for unicast addresses only.
* Remove IFT_ARCNET case from nd6_storelladdr() since arc_output() now
handles multicast by itself.

As a result, we have the following pattern: all non-ethernet-style
media have their own multicast map handling inside their appropriate
routines. On the other hand, arpresolve() (and nd6_storelladdr()) which
meant to be 'generic' ones de-facto handles ethernet-only multicast maps.

MFC after: 3 weeks


# abc1be90 08-Jan-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Add forgotten definition for nd6_output_ifp().


# d7968c29 08-Jan-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

* Use newly-created nd6_grab_holdchain() function to retrieve lle
hold mbuf chain instead of calling full-blown nd6_output_lle()
for each packet. This simplifies both callers and nd6_output_lle()
implementation.
* Make nd6_output_lle() static and remove now-unused lle and chain
arguments.
* Rename nd6_output_flush() -> nd6_flush_holdchain() to be consistent.
* Move all pre-send transmit hooks to newly-created nd6_output_ifp().
Now nd6_output(), nd6_output_lle() and nd6_flush_holdchain() are using
it to send mbufs to if_output.
* Remove SeND hook from nd6_na_input() because it was implemented
incorrectly since the beginning (r211501):
- it tagged initial input mbuf (m) instead of m_hold
- tagging _all_ mbufs in holdchain seems to be wrong anyway.


# df485dbe 05-Jan-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Do not call LLE_WUNLOCK() for deleted lle.


# 20dd8995 03-Jan-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

* Hide lltable implementation details in if_llatbl_var.h
* Make most of lltable_* methods 'normal' functions instead of inline
* Add lltable_get_<af|ifp>() functions to access given lltable fields
* Temporarily resurrect nd6_lookup() function


# ee7e9a4e 08-Dec-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

* Do not assume lle has sockaddr key after struct lle:
use llt_fill_sa_entry() llt method to store lle address in sa.
* Eliminate L3_ADDR macro and either reference IPv4/IPv6 address
directly from lle or use newly-created llt_fill_sa_entry().
* Do not store sockaddr inside arp/ndp lle anymore.


# d82ed505 08-Dec-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Simplify lle lookup/create api by using addresses instead of sockaddrs.


# d6ad6a86 07-Dec-2014 Mark Johnston <markj@FreeBSD.org>

Add refcounting to IPv6 DAD objects and simplify the DAD code to fix a
number of races which could cause double frees or use-after-frees when
performing DAD on an address. In particular, an IPv6 address can now only be
marked as a duplicate from the DAD callout.

Differential Revision: https://reviews.freebsd.org/D1258
Reviewed by: ae, hrs
Reported by: rstone
MFC after: 1 month


# 73b52ad8 07-Dec-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Use llt_prepare_static_entry method to prepare valid per-af static entry.


# 0368226e 07-Dec-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

* Retire abstract llentry_free() in favor of lltable_drop_entry_queue()
and explicit calls to RTENTRY_FREE_LOCKED()
* Use lltable_prefix_free() in arp_ifscrub to be consistent with nd6.
* Rename <lltable_|llt>_delete function to _delete_addr() to note that
this function is used to external callers. Make this function maintain
its own locking.
* Use lookup/unlink/clear call chain from internal callers instead of
delete_addr.
* Fix LLE_DELETED flag handling


# 721cd2e0 07-Dec-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Do not enforce particular lle storage scheme:
* move lltable allocation to per-domain callbacks.
* make llentry_link/unlink functions overridable llt methods.
* make hash table traversal another overridable llt method.


# a743ccd4 07-Dec-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

* Add llt_clear_entry() callback which is able to do all lle
cleanup including unlinking/freeing
* Relax locking in lltable_prefix_free_af/lltable_free
* Do not pass @llt to lle free callback: it is always NULL now.
* Unify arptimer/nd6_llinfo_timer: explicitly unlock lle avoiding
unlock/lock sequinces
* Do not pass unlocked lle to nd6_ns_output(): add nd6_llinfo_get_holdsrc()
to retrieve preferred source address from lle hold queue and pass it
instead of lle.
* Finally, make nd6_create() create and return unlocked lle
* Separate defrtr handling code from nd6_free():
use nd6_check_del_defrtr() to check if we need to keep entry instead of
performing GC,
use nd6_check_recalc_defrtr() to perform actual recalc on lle removal.
* Move isRouter handling from nd6_cache_lladdr() to separate
nd6_check_router()
* Add initial code to maintain lle runtime flags in sync.


# ce313fdd 30-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

* Unify lle table dump/prefix removal code.
* Rename lla_XXX -> lltable_XXX_lle to reduce number of name prefixes
used by lltable code.


# 74860d4f 27-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Do not return unlocked/unreferenced lle in arpresolve/nd6_storelladdr -
return lle flags IFF needed.
Do not pass rte to arpresolve - pass is_gateway flag instead.


# af6209a1 24-Nov-2014 Andrey V. Elsukov <ae@FreeBSD.org>

Skip L2 addresses lookups for p2p interfaces.

Discussed with: melifaro
Sponsored by: Yandex LLC


# 73d77028 23-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Do more fine-grained lltable locking: use table runtime lock as rare
as we can.


# 27688dfe 22-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Temporarily revert r274774.


# 9883e41b 20-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Switch IF_AFDATA lock to rmlock


# df629abf 16-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Rework LLE code locking:
* struct llentry is now basically split into 2 pieces:
all fields within 64 bytes (amd64) are now protected by both
ifdata lock AND lle lock, e.g. you require both locks to be held
exclusively for modification. All data necessary for fast path
operations is kept here. Some fields were added:
- r_l3addr - makes lookup key liev within first 64 bytes.
- r_flags - flags, containing pre-compiled decision whether given
lle contains usable data or not. Current the only flag is RLLE_VALID.
- r_len - prepend data len, currently unused
- r_kick - used to provide feedback to control plane (see below).
All other fields are protected by lle lock.
* Add simple state machine for ARP to handle "about to expire" case:
Current model (for the fast path) is the following:
- rlock afdata
- find / rlock rte
- runlock afdata
- see if "expire time" is approaching
(time_uptime + la->la_preempt > la->la_expire)
- if true, call arprequest() and decrease la_preempt
- store MAC and runlock rte
New model (data plane):
- rlock afdata
- find rte
- check if it can be used using r_* fields only
- if true, store MAC
- if r_kick field != 0 set it to 0.
- runlock afdata
New mode (control plane):
- schedule arptimer to be called in (V_arpt_keep - V_arp_maxtries)
seconds instead of V_arpt_keep.
- on first timer invocation change state from ARP_LLINFO_REACHABLE
to ARP_LLINFO_VERIFY, sets r_kick to 1 and shedules next call in
V_arpt_rexmit (default to 1 sec).
- on subsequent timer invocations in ARP_LLINFO_VERIFY state, checks
for r_kick value: reschedule if not changed, and send arprequest()
if set to zero (e.g. entry was used).
* Convert IPv4 path to use new single-lock approach. IPv6 bits to follow.
* Slow down in_arpinput(): now valid reply will (in most cases) require
acquiring afdata WLOCK twice. This is requirement for storing changed
lle data. This change will be slightly optimized in future.
* Provide explicit hash link/unlink functions for both ipv4/ipv6 code.
This will probably be moved to generic lle code once we have per-AF
hashing callback inside lltable.
* Perform lle unlink on deletion immediately instead of delaying it to
the timer routine.
* Make r244183 more explicit: use new LLE_CALLOUTREF flag to indicate the
presence of lle reference used for safe callout calls.


# b4b1367a 15-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

* Move lle creation/deletion from lla_lookup to separate functions:
lla_lookup(LLE_CREATE) -> lla_create
lla_lookup(LLE_DELETE) -> lla_delete
Assume lla_create to return LLE_EXCLUSIVE lock for lle.
* Rework lla_rt_output to perform all lle changes under afdata WLOCK.
* change arp_ifscrub() ackquire afdata WLOCK, the same as arp_ifinit().


# e6abaf91 10-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Consistently use if_link.

Reviewed by: ae, melifaro


# d0f9fca4 09-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Remove forgotten arguments.


# 033074c4 09-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Replace 'struct route *' if_output() argument with 'struct nhop_info *'.
Leave 'struct route' as is for legacy routing api users.
Remove most of rtalloc_ign*-derived functions.


# 9c9bde01 09-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Remove unused 'struct route *' argument from nd6_output_flush().


# 6df8a710 07-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Remove SYSCTL_VNET_* macros, and simply put CTLFLAG_VNET where needed.

Sponsored by: Nginx, Inc.


# 1a75e3b2 06-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Make checks for rt_mtu generic:

Some virtual if drivers has (ab)used ifa ifa_rtrequest hook to enforce
route MTU to be not bigger that interface MTU. While ifa_rtrequest hooking
might be an option in some situation, it is not feasible to do MTU checks
there: generic (or per-domain) routing code is perfectly capable of doing
this.

We currrently have 3 places where MTU is altered:

1) route addition.
In this case domain overrides radix _addroute callback (in[6]_addroute)
and all necessary checks/fixes are/can be done there.

2) route change (especially, GW change).
In this case, there are no explicit per-domain calls, but one can
override rte by setting ifa_rtrequest hook to domain handler
(inet6 does this).

3) ifconfig ifaceX mtu YYYY
In this case, we have no callbacks, but ip[6]_output performes runtime
checks and decreases rt_mtu if necessary.

Generally, the goals are to be able to handle all MTU changes in
control plane, not in runtime part, and properly deal with increased
interface MTU.

This commit changes the following:
* removes hooks setting MTU from drivers side
* adds proper per-doman MTU checks for case 1)
* adds generic MTU check for case 2)

* The latter is done by using new dom_ifmtu callback since
if_mtu denotes L3 interface MTU, e.g. maximum trasmitted _packet_ size.
However, IPv6 mtu might be different from if_mtu one (e.g. default 1280)
for some cases, so we need an abstract way to know maximum MTU size
for given interface and domain.
* moves rt_setmetrics() before MTU/ifa_rtrequest hooks since it copies
user-supplied data which must be checked.
* removes RT_LOCK_ASSERT() from other ifa_rtrequest hooks to be able to
use this functions on new non-inserted rte.

More changes will follow soon.

MFC after: 1 month
Sponsored by: Yandex LLC


# 8c3cfe0b 04-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Hide 'struct rtentry' and all its macro inside new header:
net/route_internal.h
The goal is to make its opaque for all code except route/rtsock and
proto domain _rmx.


# 4f8585e0 11-Sep-2014 Alan Somers <asomers@FreeBSD.org>

Revisions 264905 and 266860 added a "int fib" argument to ifa_ifwithnet and
ifa_ifwithdstaddr. For the sake of backwards compatibility, the new
arguments were added to new functions named ifa_ifwithnet_fib and
ifa_ifwithdstaddr_fib, while the old functions became wrappers around the
new ones that passed RT_ALL_FIBS for the fib argument. However, the
backwards compatibility is not desired for FreeBSD 11, because there are
numerous other incompatible changes to the ifnet(9) API. We therefore
decided to remove it from head but leave it in place for stable/9 and
stable/10. In addition, this commit adds the fib argument to
ifa_ifwithbroadaddr for consistency's sake.

sys/sys/param.h
Increment __FreeBSD_version

sys/net/if.c
sys/net/if_var.h
sys/net/route.c
Add fibnum argument to ifa_ifwithbroadaddr, and remove the _fib
versions of ifa_ifwithdstaddr, ifa_ifwithnet, and ifa_ifwithroute.

sys/net/route.c
sys/net/rtsock.c
sys/netinet/in_pcb.c
sys/netinet/ip_options.c
sys/netinet/ip_output.c
sys/netinet6/nd6.c
Fixup calls of modified functions.

share/man/man9/ifnet.9
Document changed API.

CR: https://reviews.freebsd.org/D458
MFC after: Never
Sponsored by: Spectra Logic


# ff899182 11-Jul-2014 Andrey V. Elsukov <ae@FreeBSD.org>

Fix condition.

Sponsored by: Yandex LLC


# f4839cbc 23-Jun-2014 Hajimu UMEMOTO <ume@FreeBSD.org>

Make nd6_gctimer tunable.

MFC after: 1 week


# 2f308a34 29-May-2014 Alan Somers <asomers@FreeBSD.org>

Fix unintended KBI change from r264905. Add _fib versions of
ifa_ifwithnet() and ifa_ifwithdstaddr() The legacy functions will call the
_fib() versions with RT_ALL_FIBS, preserving legacy behavior.

sys/net/if_var.h
sys/net/if.c
Add legacy-compatible functions as described above. Ensure legacy
behavior when RT_ALL_FIBS is passed as fibnum.

sys/netinet/in_pcb.c
sys/netinet/ip_output.c
sys/netinet/ip_options.c
sys/net/route.c
sys/net/rtsock.c
sys/netinet6/nd6.c
Call with _fib() functions if we must use a specific fib, or the
legacy functions otherwise.

tests/sys/netinet/fibs_test.sh
tests/sys/netinet/udp_dontroute.c
Improve the udp_dontroute test. The bug that this test exercises is
that ifa_ifwithnet() will return the wrong address, if multiple
interfaces have addresses on the same subnet but with different
fibs. The previous version of the test only considered one possible
failure mode: that ifa_ifwithnet_fib() might fail to find any
suitable address at all. The new version also checks whether
ifa_ifwithnet_fib() finds the correct address by checking where the
ARP request goes.

Reported by: bz, hrs
Reviewed by: hrs
MFC after: 1 week
X-MFC-with: 264905
Sponsored by: Spectra Logic


# 0cfee0c2 24-Apr-2014 Alan Somers <asomers@FreeBSD.org>

Fix subnet and default routes on different FIBs on the same subnet.

These two bugs are closely related. The root cause is that ifa_ifwithnet
does not consider FIBs when searching for an interface address.

sys/net/if_var.h
sys/net/if.c
Add a fib argument to ifa_ifwithnet and ifa_ifwithdstadddr. Those
functions will only return an address whose interface fib equals the
argument.

sys/net/route.c
Update calls to ifa_ifwithnet and ifa_ifwithdstaddr with fib
arguments.

sys/netinet/in.c
Update in_addprefix to consider the interface fib when adding
prefixes. This will prevent it from not adding a subnet route when
one already exists on a different fib.

sys/net/rtsock.c
sys/netinet/in_pcb.c
sys/netinet/ip_output.c
sys/netinet/ip_options.c
sys/netinet6/nd6.c
Add RT_DEFAULT_FIB arguments to ifa_ifwithdstaddr and ifa_ifwithnet.
In some cases it there wasn't a clear specific fib number to use.
In others, I was unable to test those functions so I chose
RT_DEFAULT_FIB to minimize divergence from current behavior. I will
fix some of the latter changes along with PR kern/187553.

tests/sys/netinet/fibs_test.sh
tests/sys/netinet/udp_dontroute.c
tests/sys/netinet/Makefile
Revert r263738. The udp_dontroute test was right all along.
However, bugs kern/187550 and kern/187553 cancelled each other out
when it came to this test. Because of kern/187553, ifa_ifwithnet
searched the default fib instead of the requested one, but because
of kern/187550, there was an applicable subnet route on the default
fib. The new test added in r263738 doesn't work right, however. I
can verify with dtrace that ifa_ifwithnet returned the wrong address
before I applied this commit, but route(8) miraculously found the
correct interface to use anyway. I don't know how.

Clear expected failure messages for kern/187550 and kern/187552.

PR: kern/187550
PR: kern/187552
Reviewed by: melifaro
MFC after: 3 weeks
Sponsored by: Spectra Logic


# f6990c4e 13-Feb-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Further simplify nd6_output_lle.
Currently we have 3 usage patterns:
1) nd6_output (most traffic flow, no lle supplied, lle RLOCK sufficient)
2) corner cases for output (no lle, STALE lle, so on). lle WLOCK needed.
3) nd* iunternal machinery (WLOCK'ed lle provided, perform packet queing).

We separate case 1 and implement it inside its only customer - nd6_output.
This leads to some code duplication (especialy SEND stuff, which should be
hooked to output in a different way), but simplifies locking and control
flow logic fir nd6_output_lle.

Reviewed by: ae
MFC after: 3 weeks
Sponsored by: Yandex LLC


# 9dffa6a3 09-Feb-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Simplify nd6_output_lle:
* Check ND6_IFF_IFDISABLED before acquiring any locks
* Assume m is always non-NULL
* remove 'bad' case not used anymore
* Simply if_output conditional

MFC after: 2 weeks
Sponsored by: Yandex LLC


# 74a976ff 07-Feb-2014 Andrey V. Elsukov <ae@FreeBSD.org>

Unlock entry before retry.

Submitted by: melifaro
MFC after: 1 week


# 51eecdc3 02-Feb-2014 Andrey V. Elsukov <ae@FreeBSD.org>

Take exclusive lock only when lle isn't NULL. We don't need write access
to lle in most cases.

MFC after: 1 week
Sponsored by: Yandex LLC


# f6b84910 19-Jan-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Further rework netinet6 address handling code:
* Set ia address/mask values BEFORE attaching to address lists.
Inet6 address assignment is not atomic, so the simplest way to
do this atomically is to fill in ia before attach.
* Validate irfa->ia_addr field before use (we permit ANY sockaddr in old code).
* Do some renamings:
in6_ifinit -> in6_notify_ifa (interaction with other subsystems is here)
in6_setup_ifa -> in6_broadcast_ifa (LLE/Multicast/DaD code)
in6_ifaddloop -> nd6_add_ifa_lle
in6_ifremloop -> nd6_rem_ifa_lle
* Split working with LLE and route announce code for last two.
Add temporary in6_newaddrmsg() function to mimic current rtsock behaviour.
* Call device SIOCSIFADDR handler IFF we're adding first address.
In IPv4 we have to call it on every address change since ARP record
is installed by arp_ifinit() which is called by given handler.
IPv6 stack, on the opposite is responsible to call nd6_add_ifa_lle() so
there is no reason to call SIOCSIFADDR often.


# ea0c3776 02-Jan-2014 Andrey V. Elsukov <ae@FreeBSD.org>

lla_lookup() does modification only when LLE_CREATE is specified.
Thus we can use IF_AFDATA_RLOCK() instead of IF_AFDATA_LOCK() when doing
lla_lookup() without LLE_CREATE flag.

Reviewed by: glebius, adrian
MFC after: 1 week
Sponsored by: Yandex LLC


# c445d252 31-Dec-2013 Adrian Chadd <adrian@FreeBSD.org>

Use an RLOCK here instead of an RWLOCK - matching all the other calls
to lla_lookup().

This drastically reduces the very high lock contention when doing parallel
TCP throughput tests (> 1024 sockets) with IPv6.

Tested:

* parallel IPv6 TCP bulk data exchange, 8192 sockets

MFC after: 1 week
Sponsored by: Netflix, Inc.


# 54366c0b 25-Nov-2013 Attilio Rao <attilio@FreeBSD.org>

- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip


# 76039bc8 26-Oct-2013 Gleb Smirnoff <glebius@FreeBSD.org>

The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 57f60867 25-Aug-2013 Mark Johnston <markj@FreeBSD.org>

Implement the ip, tcp, and udp DTrace providers. The probe definitions use
dynamic translation so that their arguments match the definitions for
these providers in Solaris and illumos. Thus, existing scripts for these
providers should work unmodified on FreeBSD.

Tested by: gnn, hiren
MFC after: 1 month


# 86bd0491 19-Aug-2013 Andre Oppermann <andre@FreeBSD.org>

Add m_clrprotoflags() to clear protocol specific mbuf flags at up and
downwards layer crossings.

Consistently use it within IP, IPv6 and ethernet protocols.

Discussed with: trociny, glebius


# 5a041915 17-Aug-2013 Hiroki Sato <hrs@FreeBSD.org>

Return 0 in nbi->expire when la_expire == 0. Conversion from time_uptime to
time_second should not be performed in this case.


# ffa0165a 06-Aug-2013 Hiroki Sato <hrs@FreeBSD.org>

Fix incompatibility in ICMPV6CTL_ND6_PRLIST sysctl, and SIOCGPRLST_IN6,
SIOCGDRLST_IN6, and SIOCGNBRINFO_IN6 ioctl. These userland interfaces
treat expiration times in time_second, not time_uptime.


# 7d26db17 05-Aug-2013 Hiroki Sato <hrs@FreeBSD.org>

- Use time_uptime instead of time_second in data structures for
PF_INET6 in kernel. This fixes various malfunction when the wall time
clock is changed. Bump __FreeBSD_version to 1000041.

- Use clock_gettime(CLOCK_MONOTONIC_FAST) in userland utilities.

MFC after: 1 month


# 0de0dd9b 31-Jul-2013 Hiroki Sato <hrs@FreeBSD.org>

Allocate in6_ifextra (ifp->if_afdata[AF_INET6]) only for IPv6-capable
interfaces. This eliminates unnecessary IPv6 processing for non-IPv6
interfaces.

MFC after: 3 days


# af805644 02-Jul-2013 Hiroki Sato <hrs@FreeBSD.org>

- Allow ND6_IFF_AUTO_LINKLOCAL for IFT_BRIDGE. An interface with IFT_BRIDGE
is initialized with !ND6_IFF_AUTO_LINKLOCAL && !ND6_IFF_ACCEPT_RTADV
regardless of net.inet6.ip6.accept_rtadv and net.inet6.ip6.auto_linklocal.
To configure an autoconfigured link-local address (RFC 4862), the
following rc.conf(5) configuration can be used:

ifconfig_bridge0_ipv6="inet6 auto_linklocal"

- if_bridge(4) now removes IPv6 addresses on a member interface to be
added when the parent interface or one of the existing member
interfaces has an IPv6 address. if_bridge(4) merges each link-local
scope zone which the member interfaces form respectively, so it causes
address scope violation. Removal of the IPv6 addresses prevents it.

- if_lagg(4) now removes IPv6 addresses on a member interfaces
unconditionally.

- Set reasonable flags to non-IPv6-capable interfaces. [*]

Submitted by: rpaulo [*]
MFC after: 1 week


# 47e8d432 25-Apr-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Add const qualifier to the dst parameter of the ifnet if_output method.


# b3dcd51d 21-Mar-2013 Kevin Lo <kevlo@FreeBSD.org>

Clean up some unused leftover code.

Pointed out by: ae


# 63a97a40 25-Jan-2013 Navdeep Parhar <np@FreeBSD.org>

Generate lle_event in the IPv6 neighbor discovery code too.

Reviewed by: bz@


# f31b83e1 25-Jan-2013 Navdeep Parhar <np@FreeBSD.org>

Avoid NULL dereference in nd6_storelladdr when no mbuf is provided. It
is called this way from a couple of places in the OFED code. (toecore
calls it too but that's going to change shortly).

Reviewed by: bz@


# b1ec2940 13-Dec-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Fix problem in r238990. The LLE_LINKED flag should be tested prior to
entering llentry_free(), and in case if we lose the race, we should simply
perform LLE_FREE_LOCKED(). Otherwise, if the race is lost by the thread
performing arptimer(), it will remove two references from the lle instead
of one.

Reported by: Ian FREISLICH <ianf clue.co.za>


# 73cb2f38 15-Nov-2012 Andrey V. Elsukov <ae@FreeBSD.org>

Reduce the overhead of locking, use IF_AFDATA_RLOCK() when we are doing
simple lookups.

Sponsored by: Yandex LLC
MFC after: 1 week


# 6f56329a 22-Oct-2012 Xin LI <delphij@FreeBSD.org>

Remove __P.

Submitted by: kevlo
Reviewed by: md5(1)
MFC after: 2 months


# c9b652e3 18-Oct-2012 Andre Oppermann <andre@FreeBSD.org>

Mechanically remove the last stray remains of spl* calls from net*/*.
They have been Noop's for a long time now.


# 39e19560 25-May-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

MFp4 bz_ipv6_fast:

Use M_ZERO with malloc rather than calling bzero() ourselves.

Change if () panic() checks to KASSERT()s as they are only
catching invariants in code flow but not dependent on network
input/output.

Move initial assigments indirecting pointers after the lock
has been aquired.

Passing layer boundries, reset M_PROTOFLAGS.

Remove a NULL assignment before free.

Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems

Reviewed by: gnn (as part of the whole)
MFC After: 3 days


# 7195094c 19-May-2012 Marius Strobl <marius@FreeBSD.org>

Rewrite nd6_sysctl_{d,p}rlist() to avoid misaligned accesses to char arrays
casted to structs by getting rid of these buffers entirely. In r169832, it
was tried to paper over this issue by 32-bit aligning the buffers. Depending
on compiler optimizations that still was insufficient for 64-bit architectures
with strong alignment requirements though.
While at it, add comments regarding the total lack of locking in this area.

Tested by: bz
Reviewed by: bz (slightly earlier version), yongari (earlier version)
MFC after: 1 week


# abbe8356 04-Mar-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

In nd6_options() ignore the RFC 6106 options completely rather than printing
them if nd6_debug is enabled as unknown. Leave a comment about the RFC4191
option as I am undecided so far.

Discussed with: hrs
MFC after: 3 days


# a1875676 02-Mar-2012 Hiroki Sato <hrs@FreeBSD.org>

Remove a redundant check.


# 81d5d46b 03-Feb-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

Add multi-FIB IPv6 support to the core network stack supplementing
the original IPv4 implementation from r178888:

- Use RT_DEFAULT_FIB in the IPv4 implementation where noticed.
- Use rt*fib() KPI with explicit RT_DEFAULT_FIB where applicable in
the NFS code.
- Use the new in6_rt* KPI in TCP, gif(4), and the IPv6 network stack
where applicable.
- Split in6_rtqtimo() and in6_mtutimo() as done in IPv4 and equally
prevent multiple initializations of callouts in in6_inithead().
- Use wrapper functions where needed to preserve the current KPI to
ease MFCs. Use BURN_BRIDGES to indicate expected future cleanup.
- Fix (related) comments (both technical or style).
- Convert to rtinit() where applicable and only use custom loops where
currently not possible otherwise.
- Multicast group, most neighbor discovery address actions and faith(4)
are locked to the default FIB. Individual IPv6 addresses will only
appear in the default FIB, however redirect information and prefixes
of connected subnets are automatically propagated to all FIBs by
default (mimicking IPv4 behavior as closely as possible).

Sponsored by: Cisco Systems, Inc.


# 8e4609a4 25-Jan-2012 Sergey Kandaurov <pluknet@FreeBSD.org>

Remove unused variable.

The actual ia6->ia6_lifetime access is hidden in
IFA6_IS_INVALID/IFA6_IS_DEPRECATED macros since a long time ago
(see netinet6/nd6.c, r1.104 of KAME for the reference).

MFC after: 3 days


# 137f91e8 05-Jan-2012 John Baldwin <jhb@FreeBSD.org>

Convert all users of IF_ADDR_LOCK to use new locking macros that specify
either a read lock or write lock.

Reviewed by: bz
MFC after: 2 weeks


# 3b0b2840 29-Dec-2011 John Baldwin <jhb@FreeBSD.org>

Use queue(3) macros instead of home-rolled versions in several places in
the INET6 code. This includes retiring the 'ndpr_next' and 'pfr_next'
macros.

Submitted by: pluknet (earlier version)
Reviewed by: pluknet


# 08b68b0e 15-Dec-2011 Gleb Smirnoff <glebius@FreeBSD.org>

A major overhaul of the CARP implementation. The ip_carp.c was started
from scratch, copying needed functionality from the old implemenation
on demand, with a thorough review of all code. The main change is that
interface layer has been removed from the CARP. Now redundant addresses
are configured exactly on the interfaces, they run on.

The CARP configuration itself is, as before, configured and read via
SIOCSVH/SIOCGVH ioctls. A new prefix created with SIOCAIFADDR or
SIOCAIFADDR_IN6 may now be configured to a particular virtual host id,
which makes the prefix redundant.

ifconfig(8) semantics has been changed too: now one doesn't need
to clone carpXX interface, he/she should directly configure a vhid
on a Ethernet interface.

To supply vhid data from the kernel to an application the getifaddrs(8)
function had been changed to pass ifam_data with each address. [1]

The new implementation definitely closes all PRs related to carp(4)
being an interface, and may close several others. It also allows
to run a single redundant IP per interface.

Big thanks to Bjoern Zeeb for his help with inet6 part of patch, for
idea on using ifam_data and for several rounds of reviewing!

PR: kern/117000, kern/126945, kern/126714, kern/120130, kern/117448
Reviewed by: bz
Submitted by: bz [1]


# 0f1aca65 11-Nov-2011 Qing Li <qingli@FreeBSD.org>

A default route learned from the RAs could be deleted manually
after its installation. This removal may be accidental and can
prevent the default route from being installed in the future if
the associated default router has the best preference. The cause
is the lack of status update in the default router on the state
of its route installation in the kernel FIB. This patch fixes
the described problem.

Reviewed by: hrs, discussed with hrs
MFC after: 5 days


# 154d5f73 16-Oct-2011 Hiroki Sato <hrs@FreeBSD.org>

Fix a problem that an interface unexpectedly becomes IFF_UP by
just doing "ifconfing inet6 -ifdisabled" when the interface has
ND6_IFF_AUTO_LINKLOCAL flag and no link-local address.


# d7ae3714 30-Sep-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Fix an obvious bug from r186196 shadowing a variable, not correctly
appending the new mbuf to the chain reference but possibly causing an mbuf
nextpkt loop leading to a memory used after handoff (or having been freed)
and leaking an mbuf here.

Reviewed by: rwatson, brooks
MFC after: 3 days


# 23be7825 05-Jun-2011 Hiroki Sato <hrs@FreeBSD.org>

Do not activate automatic LL addr configuration when 0/1->1 transition of
ND6_IFF_IFDISABLED flag.


# 77bc4985 05-Jun-2011 Hiroki Sato <hrs@FreeBSD.org>

- Make the code more proactively clear an ND6_IFF_IFDISABLED flag when
an explicit action for INET6 configuration happens. The changes are:

1. When an ND6 flag is changed via SIOCSIFINFO_FLAGS ioctl,
setting ND6_IFF_ACCEPT_RTADV and/or ND6_IFF_AUTO_LINKLOCAL now triggers
an attempt to clear the ND6_IFF_IFDISABLED flag.

2. When an AF_INET6 address is added successfully to an interface and
it is marked as ND6_IFF_IFDISABLED, an attempt to clear the
ND6_IFF_IFDISABLED happens.

This simplifies ND6_IFF_IFDISABLED flag manipulation by users via ifconfig(8);
in most cases manual configuration is no longer needed.

- When ND6_IFF_AUTO_LINKLOCAL is set and no link-local address is assigned to
an interface, SIOCSIFINFO_FLAGS ioctl now calls in6_ifattach() to configure
a link-local address.

This change ensures link-local address configuration when "ifconfig IF inet6"
command is invoked. For example, "ifconfig IF inet6 auto_linklocal" now
always try to configure an LL addr even if ND6_IFF_AUTO_LINKLOCAL is already
set to 1 (i.e. down/up cycle is no longer needed).

Reviewed by: bz


# e7fa8d0a 05-Jun-2011 Hiroki Sato <hrs@FreeBSD.org>

- Accept Router Advertisement messages even when net.inet6.ip6.forwarding=1.

- A new per-interface knob IFF_ND6_NO_RADR and sysctl IPV6CTL_NO_RADR.
This controls if accepting a route in an RA message as the default route.
The default value for each interface can be set by net.inet6.ip6.no_radr.
The system wide default value is 0.

- A new sysctl: net.inet6.ip6.norbit_raif. This controls if setting R-bit in
NA on RA accepting interfaces. The default is 0 (R-bit is set based on
net.inet6.ip6.forwarding).

Background:

IPv6 host/router model suggests a router sends an RA and a host accepts it for
router discovery. Because of that, KAME implementation does not allow
accepting RAs when net.inet6.ip6.forwarding=1. Accepting RAs on a router can
make the routing table confused since it can change the default router
unintentionally.

However, in practice there are cases where we cannot distinguish a host from
a router clearly. For example, a customer edge router often works as a host
against the ISP, and as a router against the LAN at the same time. Another
example is a complex network configurations like an L2TP tunnel for IPv6
connection to Internet over an Ethernet link with another native IPv6 subnet.
In this case, the physical interface for the native IPv6 subnet works as a
host, and the pseudo-interface for L2TP works as the default IP forwarding
route.

Problem:

Disabling processing RA messages when net.inet6.ip6.forwarding=1 and
accepting them when net.inet6.ip6.forward=0 cause the following practical
issues:

- A router cannot perform SLAAC. It becomes a problem if a box has
multiple interfaces and you want to use SLAAC on some of them, for
example. A customer edge router for IPv6 Internet access service
using an IPv6-over-IPv6 tunnel sometimes needs SLAAC on the
physical interface for administration purpose; updating firmware
and so on (link-local addresses can be used there, but GUAs by
SLAAC are often used for scalability).

- When a host has multiple IPv6 interfaces and it receives multiple RAs on
them, controlling the default route is difficult. Router preferences
defined in RFC 4191 works only when the routers on the links are
under your control.

Details of Implementation Changes:

Router Advertisement messages will be accepted even when
net.inet6.ip6.forwarding=1. More precisely, the conditions are as
follow:

(ACCEPT_RTADV && !NO_RADR && !ip6.forwarding)
=> Normal RA processing on that interface. (as IPv6 host)

(ACCEPT_RTADV && (NO_RADR || ip6.forwarding))
=> Accept RA but add the router to the defroute list with
rtlifetime=0 unconditionally. This effectively prevents
from setting the received router address as the box's
default route.

(!ACCEPT_RTADV)
=> No RA processing on that interface.

ACCEPT_RTADV and NO_RADR are per-interface knob. In short, all interface
are classified as "RA-accepting" or not. An RA-accepting interface always
processes RA messages regardless of ip6.forwarding. The difference caused by
NO_RADR or ip6.forwarding is whether the RA source address is considered as
the default router or not.

R-bit in NA on the RA accepting interfaces is set based on
net.inet6.ip6.forwarding. While RFC 6204 W-1 rule (for CPE case) suggests
a router should disable the R-bit completely even when the box has
net.inet6.ip6.forwarding=1, I believe there is no technical reason with
doing so. This behavior can be set by a new sysctl net.inet6.ip6.norbit_raif
(the default is 0).

Usage:

# ifconfig fxp0 inet6 accept_rtadv
=> accept RA on fxp0
# ifconfig fxp0 inet6 accept_rtadv no_radr
=> accept RA on fxp0 but ignore default route information in it.
# sysctl net.inet6.ip6.norbit_no_radr=1
=> R-bit in NAs on RA accepting interfaces will always be set to 0.


# e4cd31dd 21-Mar-2011 Jeff Roberson <jeff@FreeBSD.org>

- Merge changes to the base system to support OFED. These include
a wider arg2 for sysctl, updates to vlan code, IFT_INFINIBAND,
and other miscellaneous small features.


# 1d5089c2 07-Dec-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

Loosen the locking in nd6-free() again after r216022 to avoid
a LOR and a recursed lock.

Reported by: delphij
Tested by: delphij
PR: kern/148857
MFC After: 3 days


# e6950476 28-Nov-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

Plug well observed races on la_hold entries with the callout handler.

Call the handler function with the lock held, return unlocked as we
might free the entry. Rework functions later in the call graph to be
either called with the lock held or, only if needed, unlocked.

Place asserts to document and tighten assumptions on various lle locking,
which were not always true before.

We call nd6_ns_output() unlocked and the assignment of ip6->ip6_src was
decentralized to minimize possible complexity introduced with the formerly
missing locking there. This also resulted in a push down of local
variable scopes into smaller blocks.

Reported by: many
PR: kern/148857
Submitted by: Dmitrij Tejblum (tejblum yandex-team.ru) (original version)
MFC After: 4 days


# 3e288e62 22-Nov-2010 Dimitry Andric <dim@FreeBSD.org>

After some off-list discussion, revert a number of changes to the
DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various
people working on the affected files. A better long-term solution is
still being considered. This reversal may give some modules empty
set_pcpu or set_vnet sections, but these are harmless.

Changes reverted:

------------------------------------------------------------------------
r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines

Instead of unconditionally emitting .globl's for the __start_set_xxx and
__stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu
sections are actually defined.

------------------------------------------------------------------------
r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.

------------------------------------------------------------------------
r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines

Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.


# 68352503 17-Nov-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

Do not initialize flag variables before needed.
Consistently use the LLE_ prefix for lla_lookup() and the ND6_ prefix
for nd6_lookup() even though both are defined the same. Use the right
flag variable when checking each.

No real functional change.

MFC after: 4 days


# 20723e34 17-Nov-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

No need to re-initialize the callout. We initially do it in in6_lltable_new()
right after allocation. Worse, we are losing the right flags here.

MFC after: 4 days


# 31c6a003 14-Nov-2010 Dimitry Andric <dim@FreeBSD.org>

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 1db8d1f8 19-Aug-2010 Ana Kukec <anchie@FreeBSD.org>

MFp4: anchie_soc2009 branch:

Add kernel side support for Secure Neighbor Discovery (SeND), RFC 3971.

The implementation consists of a kernel module that gets packets from
the nd6 code, sends them to user space on a dedicated socket and reinjects
them back for further processing.

Hooks are used from nd6 code paths to divert relevant packets to the
send implementation for processing in user space. The hooks are only
triggered if the send module is loaded. In case no user space
application is connected to the send socket, processing continues
normaly as if the module would not be loaded. Unloading the module
is not possible at this time due to missing nd6 locking.

The native SeND socket is similar to a raw IPv6 socket but with its own,
internal pseudo-protocol.

Approved by: bz (mentor)


# 19291ab3 31-Jul-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

Document the mandatory argument to the arptimer() and
nd6_llinfo_timer() functions with a KASSERT().
Note: there is no need to return after panic.

In the legacy IP case, only assign the arg after the check,
in the IPv6 case, remove the extra checks for the table and
interface as they have to be there unless we freed and forgot
to cancel the timer. It doesn't matter anyway as we would
panic on the NULL pointer deref immediately and the bug is
elsewhere.
This unifies the code of both address families to some extend.

Reviewed by: rwatson
MFC after: 6 days


# 480d7c6c 06-May-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

MFC r207369:
MFP4: @176978-176982, 176984, 176990-176994, 177441

"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by: jhb
Discussed with: rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH


# 82cea7e6 29-Apr-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

MFP4: @176978-176982, 176984, 176990-176994, 177441

"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by: jhb
Discussed with: rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
MFC after: 6 days


# feb3a5f7 21-Apr-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

MFC r206481:

Plug reference leaks in the link-layer code ("new-arp") that previously
prevented the link-layer entry from being freed.

In both in.c and in6.c (though that code path seems to be basically dead)
plug a reference leak in case of a pending callout being drained.

In if_ether.c consistently add a reference before resetting the callout
and in case we canceled a pending one remove the reference for that.
In the final case in arptimer, before freeing the expired entry, remove
the reference again and explicitly call callout_stop() to clear the active
flag.

In nd6.c:nd6_free() we are only ever called from the callout function and
thus need to remove the reference there as well before calling into
llentry_free().

In if_llatbl.c when freeing the entire tables make sure that in case we
cancel a pending callout to remove the reference as well.

Reviewed by: qingli (earlier version)
MFC after: 10 days
Problem observed, patch tested by: simon on ipv6gw.f.o,
Christian Kratzer (ck cksoft.de),
Evgenii Davidov (dado korolev-net.ru)
PR: kern/144564
Configurations still affected: with options FLOWTABLE


# becba438 11-Apr-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

Plug reference leaks in the link-layer code ("new-arp") that previously
prevented the link-layer entry from being freed.

In both in.c and in6.c (though that code path seems to be basically dead)
plug a reference leak in case of a pending callout being drained.

In if_ether.c consistently add a reference before resetting the callout
and in case we canceled a pending one remove the reference for that.
In the final case in arptimer, before freeing the expired entry, remove
the reference again and explicitly call callout_stop() to clear the active
flag.

In nd6.c:nd6_free() we are only ever called from the callout function and
thus need to remove the reference there as well before calling into
llentry_free().

In if_llatbl.c when freeing entire tables make sure that in case we cancel
a pending callout to remove the reference as well.

Reviewed by: qingli (earlier version)
MFC after: 10 days
Problem observed, patch tested by: simon on ipv6gw.f.o,
Christian Kratzer (ck cksoft.de),
Evgenii Davidov (dado korolev-net.ru)
PR: kern/144564
Configurations still affected: with options FLOWTABLE


# 22177b72 02-Apr-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

MFC r205637:

We are holding a write lock here so avoid aquiring it twice calling
the "locked" version rather than the wrapper function.


# d715e397 25-Mar-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

We are holding a write lock here so avoid aquiring it twice calling
the "locked" version rather than the wrapper function.

MFC after: 6 days


# 44520d29 08-Mar-2010 Qing Li <qingli@FreeBSD.org>

MFC 204402

Use reference counting instead of locking to secure an address while
that address is being used to generate temporary IPv6 address. This
approach is sufficient and avoids recursive locking.


# c1752bcd 27-Feb-2010 Qing Li <qingli@FreeBSD.org>

Use reference counting instead of locking to secure an address while
that address is being used to generate temporary IPv6 address. This
approach is sufficient and avoids recursive locking.

MFC after: 3 days


# c5f368ce 05-Jan-2010 Qing Li <qingli@FreeBSD.org>

MFC r201284

Multiple IPv6 addresses of the same prefix can be installed on the
same interface. The first address will install the prefix route into
the kernel routing table and that prefix will be marked as on-link.
Without RADIX_MPATH enabled, the other address aliases of the same
prefix will update the prefix reference count but no other routes
will be installed. Consequently the prefixes associated with these
addresses would not be marked as on-link. As such, incoming packets
destined to these address aliases will fail the ND6 on-link check
on input. This patch fixes the above problem by searching the kernel
routing table and try to find an on-link prefix on the given interface.


# baf7c373 30-Dec-2009 Qing Li <qingli@FreeBSD.org>

Multiple IPv6 addresses of the same prefix can be installed on the
same interface. The first address will install the prefix route into
the kernel routing table and that prefix will be marked as on-link.
Without RADIX_MPATH enabled, the other address aliases of the same
prefix will update the prefix reference count but no other routes
will be installed. Consequently the prefixes associated with these
addresses would not be marked as on-link. As such, incoming packets
destined to these address aliases will fail the ND6 on-link check
on input. This patch fixes the above problem by searching the kernel
routing table and try to find an on-link prefix on the given interface.

MFC after: 5 days


# 30178759 19-Nov-2009 Hajimu UMEMOTO <ume@FreeBSD.org>

MFC r199225:
- We are not guaranteed that we're not dropping a reference that
we did not add. Call LLE_REMREF() only when callout_stop()
actually canceled a pending callout.
- callout_reset() may cancel a pending callout. When
callout_reset() canceled a pending callout, call LLE_REMREF()
to drop a reference for the canceled callout.


# 864fb934 13-Nov-2009 Hajimu UMEMOTO <ume@FreeBSD.org>

MFC r199173: CURVNET_RESTORE() was not called in certain cases.


# ef8d671c 12-Nov-2009 Hajimu UMEMOTO <ume@FreeBSD.org>

- We are not guaranteed that we're not dropping a reference that
we did not add. Call LLE_REMREF() only when callout_stop()
actually canceled a pending callout.
- callout_reset() may cancel a pending callout. When
callout_reset() canceled a pending callout, call LLE_REMREF()
to drop a reference for the canceled callout.

MFC after: 1 week


# f0c0b143 11-Nov-2009 Hajimu UMEMOTO <ume@FreeBSD.org>

CURVNET_RESTORE() was not called in certain cases.

MFC after: 3 days


# c303f1eb 09-Nov-2009 Hajimu UMEMOTO <ume@FreeBSD.org>

MFC r198976, r198993:
- Don't call LLE_FREE() after nd6_free().
- Make nd6_llinfo_timer() does its job, again. ln->la_expire was
greater than time_second, in most cases.


# 287e3cb4 06-Nov-2009 Hajimu UMEMOTO <ume@FreeBSD.org>

Make nd6_llinfo_timer() does its job, again. ln->la_expire was
greater than time_second, in most cases.

MFC after: 3 days


# 2eb10edc 06-Nov-2009 Hajimu UMEMOTO <ume@FreeBSD.org>

Don't call LLE_FREE() after nd6_free().

MFC after: 3 days


# a283298c 12-Sep-2009 Hiroki Sato <hrs@FreeBSD.org>

Improve flexibility of receiving Router Advertisement and
automatic link-local address configuration:

- Convert a sysctl net.inet6.ip6.accept_rtadv to one for the
default value of a per-IF flag ND6_IFF_ACCEPT_RTADV, not a
global knob. The default value of the sysctl is 0.

- Add a new per-IF flag ND6_IFF_AUTO_LINKLOCAL and convert a
sysctl net.inet6.ip6.auto_linklocal to one for its default
value. The default value of the sysctl is 1.

- Make ND6_IFF_IFDISABLED more robust. It can be used to disable
IPv6 functionality of an interface now.

- Receiving RA is allowed if ip6_forwarding==0 *and*
ND6_IFF_ACCEPT_RTADV is set on that interface. The former
condition will be revisited later to support a "host + router" box
like IPv6 CPE router. The current behavior is compatible with
the older releases of FreeBSD.

- The ifconfig(8) now supports these ND6 flags as well as "nud",
"prefer_source", and "disabled" in ndp(8). The ndp(8) now
supports "auto_linklocal".

Discussed with: bz and jinmei
Reviewed by: bz
MFC after: 3 days


# 3ef94f2b 28-Aug-2009 Robert Watson <rwatson@FreeBSD.org>

Merge r196481 from head to stable/8:

Rework global locks for interface list and index management, correcting
several critical bugs, including race conditions and lock order issues:

Replace the single rwlock, ifnet_lock, with two locks, an rwlock and an
sxlock. Either can be held to stablize the lists and indexes, but both
are required to write. This allows the list to be held stable in both
network interrupt contexts and sleepable user threads across sleeping
memory allocations or device driver interactions. As before, writes to
the interface list must occur from sleepable contexts.

Reviewed by: bz, julian

Approved by: re (kib)


# 77dfcdc4 23-Aug-2009 Robert Watson <rwatson@FreeBSD.org>

Rework global locks for interface list and index management, correcting
several critical bugs, including race conditions and lock order issues:

Replace the single rwlock, ifnet_lock, with two locks, an rwlock and an
sxlock. Either can be held to stablize the lists and indexes, but both
are required to write. This allows the list to be held stable in both
network interrupt contexts and sleepable user threads across sleeping
memory allocations or device driver interactions. As before, writes to
the interface list must occur from sleepable contexts.

Reviewed by: bz, julian
MFC after: 3 days


# 530c0060 01-Aug-2009 Robert Watson <rwatson@FreeBSD.org>

Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by: bz
Approved by: re (vimage blanket)


# 1e77c105 16-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Remove unused VNET_SET() and related macros; only VNET_GET() is
ever actually used. Rename VNET_GET() to VNET() to shorten
variable references.

Discussed with: bz, julian
Reviewed by: bz
Approved by: re (kensmith, kib)


# eddfbb76 14-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)


# d1da0a06 25-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Add address list locking for in6_ifaddrhead/ia_link: as with locking
for in_ifaddrhead, we stick with an rwlock for the time being, which
we will revisit in the future with a possible move to rmlocks.

Some pieces of code require significant further reworking to be
safe from all classes of writer-writer races.

Reviewed by: bz
MFC after: 6 weeks


# 80af0152 24-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Convert netinet6 to using queue(9) rather than hand-crafted linked lists
for the global IPv6 address list (in6_ifaddr -> in6_ifaddrhead). Adopt
the code styles and conventions present in netinet where possible.

Reviewed by: gnn, bz
MFC after: 6 weeks (possibly not MFCable?)


# 8c0fec80 23-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Modify most routines returning 'struct ifaddr *' to return references
rather than pointers, requiring callers to properly dispose of those
references. The following routines now return references:

ifaddr_byindex
ifa_ifwithaddr
ifa_ifwithbroadaddr
ifa_ifwithdstaddr
ifa_ifwithnet
ifaof_ifpforaddr
ifa_ifwithroute
ifa_ifwithroute_fib
rt_getifa
rt_getifa_fib
IFP_TO_IA
ip_rtaddr
in6_ifawithifp
in6ifa_ifpforlinklocal
in6ifa_ifpwithaddr
in6_ifadd
carp_iamatch6
ip6_getdstifaddr

Remove unused macro which didn't have required referencing:

IFP_TO_IA6

This closes many small races in which changes to interface
or address lists while an ifaddr was in use could lead to use of freed
memory (etc). In a few cases, add missing if_addr_list locking
required to safely acquire references.

Because of a lack of deep copying support, we accept a race in which
an in6_ifaddr pointed to by mbuf tags and extracted with
ip6_getdstifaddr() doesn't hold a reference while in transmit. Once
we have mbuf tag deep copy support, this can be fixed.

Reviewed by: bz
Obtained from: Apple, Inc. (portions)
MFC after: 6 weeks (portions)


# 8d8bc018 08-Jun-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

After r193232 rt_tables in vnet.h are no longer indirectly dependent on
the ROUTETABLES kernel option thus there is no need to include opt_route.h
anymore in all consumers of vnet.h and no longer depend on it for module
builds.

Remove the hidden include in flowtable.h as well and leave the two
explicit #includes in ip_input.c and ip_output.c.


# bc29160d 08-Jun-2009 Marko Zec <zec@FreeBSD.org>

Introduce an infrastructure for dismantling vnet instances.

Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state. The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.

While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers. Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.

Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels. Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.

Bump __FreeBSD_version to 800097.
Reviewed by: bz, julian
Approved by: rwatson, kib (re), julian (mentor)


# bcf11e8d 05-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# 0733f6a6 01-Jun-2009 Marko Zec <zec@FreeBSD.org>

Remove an #undef MIN that slipped under the radar and led me to
hastily introduce an #define MIN() a few lines below in r191816.

Approved by: julian (mentor)
Discussed with: bz


# 21ca7b57 05-May-2009 Marko Zec <zec@FreeBSD.org>

Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one. The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE(). Recursions
on curvnet are permitted, though strongly discuouraged.

This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.

The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.

The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.

This change also introduces a DDB subcommand to show the list of all
vnet instances.

Approved by: julian (mentor)


# f6dfe47a 30-Apr-2009 Marko Zec <zec@FreeBSD.org>

Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance. Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:

1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables. As an example, V_ifnet becomes:

options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet
default build: vnet_net_0._ifnet
options VIMAGE_GLOBALS: ifnet

2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:

INIT_VNET_NET(ifp->if_vnet); becomes

struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];

3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals. If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.

4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet. options VIMAGE builds
will fill in those fields as required.

5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.

6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod. SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.

Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.

Reviewed by: bz, rwatson
Approved by: julian (mentor)


# c4dd3fe1 20-Apr-2009 Robert Watson <rwatson@FreeBSD.org>

Prefer structure fields (ifa_link) to macro aliases for them
(ifa_list).

MFC after: 2 weeks


# 1e6a4139 20-Apr-2009 Robert Watson <rwatson@FreeBSD.org>

Acquire interface address list lock around access to if_addrhead,
closing several writer-writer races, and some read-write races.

MFC after: 2 weeks


# f68ffa03 20-Apr-2009 Robert Watson <rwatson@FreeBSD.org>

Use TAILQ_FOREACH() and TAILQ_FOREACH_SAFE() rather than manually
accessing queue(9) structure fields for if_addrhead.

Prefer FreeBSD field name if_addrhead to compatibility macro
if_addrlist.

MFC after: 2 weeks


# 279aa3d4 16-Apr-2009 Kip Macy <kmacy@FreeBSD.org>

Change if_output to take a struct route as its fourth argument in order
to allow passing a cached struct llentry * down to L2

Reviewed by: rwatson


# e27b0c87 12-Apr-2009 Robert Watson <rwatson@FreeBSD.org>

Update stats in struct icmpstat and icmp6stat using four new
macros: ICMPSTAT_ADD(), ICMPSTAT_INC(), ICMP6STAT_ADD(), and
ICMP6STAT_INC(), rather than directly manipulating the fields
of these structures across the kernel. This will make it
easier to change the implementation of these statistics,
such as using per-CPU versions of the data structures.

In on case, icmp6stat members are manipulated indirectly, by
icmp6_errcount(), and this will require further work to fix
for per-CPU stats.

MFC after: 3 days


# 33553d6e 27-Feb-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

For all files including net/vnet.h directly include opt_route.h and
net/route.h.

Remove the hidden include of opt_route.h and net/route.h from net/vnet.h.

We need to make sure that both opt_route.h and net/route.h are included
before net/vnet.h because of the way MRT figures out the number of FIBs
from the kernel option. If we do not, we end up with the default number
of 1 when including net/vnet.h and array sizes are wrong.

This does not change the list of files which depend on opt_route.h
but we can identify them now more easily.


# 61cab5d6 27-Feb-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Shuffle the vimage.h includes or add where missing.


# 959e14c1 31-Jan-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Remove unused local MACROs.

Submitted by: Christoph Mallon christoph.mallon@gmx.de
MFC after: 2 weeks


# bbd8aeba 17-Dec-2008 Qing Li <qingli@FreeBSD.org>

in6_clsroute() was applied to prefix routes causing some
of them to expire. in6_clsroute() was only applied to
cloned routes that are no longer applicable after the
arp-v2 commit.


# fd14c50b 16-Dec-2008 Kip Macy <kmacy@FreeBSD.org>

- Simplify handling of the deferring of mbuf transmit until after lle lock drop
- add a couple of comments to clarify intent


# aba53ef0 15-Dec-2008 Kip Macy <kmacy@FreeBSD.org>

convert more pointer validation checks to checking against NULL


# 23ee1bfa 15-Dec-2008 Kip Macy <kmacy@FreeBSD.org>

explicitly check return of lla_lookup against NULL


# 15209fb6 15-Dec-2008 Kip Macy <kmacy@FreeBSD.org>

advance tail pointer in nd6_output_lle and check lla_output return against NULL


# 3d3728e9 15-Dec-2008 Qing Li <qingli@FreeBSD.org>

Initialize the variable "router", and apply "static_route" flag
across the entire nd6_cache_lladdr() function.


# c2b3a02b 15-Dec-2008 Kip Macy <kmacy@FreeBSD.org>

fix two use after frees in nd6_cache_lladdr caused by last minute unlock shuffling


# 6e6b3f7c 14-Dec-2008 Qing Li <qingli@FreeBSD.org>

This main goals of this project are:
1. separating L2 tables (ARP, NDP) from the L3 routing tables
2. removing as much locking dependencies among these layers as
possible to allow for some parallelism in the search operations
3. simplify the logic in the routing code,

The most notable end result is the obsolescent of the route
cloning (RTF_CLONING) concept, which translated into code reduction
in both IPv4 ARP and IPv6 NDP related modules, and size reduction in
struct rtentry{}. The change in design obsoletes the semantics of
RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland
applications such as "arp" and "ndp" have been modified to reflect
those changes. The output from "netstat -r" shows only the routing
entries.

Quite a few developers have contributed to this project in the
past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and
Andre Oppermann. And most recently:

- Kip Macy revised the locking code completely, thus completing
the last piece of the puzzle, Kip has also been conducting
active functional testing
- Sam Leffler has helped me improving/refactoring the code, and
provided valuable reviews
- Julian Elischer setup the perforce tree for me and has helped
me maintaining that branch before the svn conversion


# 385195c0 10-Dec-2008 Marko Zec <zec@FreeBSD.org>

Conditionally compile out V_ globals while instantiating the appropriate
container structures, depending on VIMAGE_GLOBALS compile time option.

Make VIMAGE_GLOBALS a new compile-time option, which by default will not
be defined, resulting in instatiations of global variables selected for
V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be
effectively compiled out. Instantiate new global container structures
to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0,
vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0.

Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_
macros resolve either to the original globals, or to fields inside
container structures, i.e. effectively

#ifdef VIMAGE_GLOBALS
#define V_rt_tables rt_tables
#else
#define V_rt_tables vnet_net_0._rt_tables
#endif

Update SYSCTL_V_*() macros to operate either on globals or on fields
inside container structs.

Extend the internal kldsym() lookups with the ability to resolve
selected fields inside the virtualization container structs. This
applies only to the fields which are explicitly registered for kldsym()
visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently
this is done only in sys/net/if.c.

Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code,
and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in
turn result in proper code being generated depending on VIMAGE_GLOBALS.

De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c
which were prematurely V_irtualized by automated V_ prepending scripts
during earlier merging steps. PF virtualization will be done
separately, most probably after next PF import.

Convert a few variable initializations at instantiation to
initialization in init functions, most notably in ipfw. Also convert
TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in
initializer functions.

Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 4b79449e 02-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation


# 44e33a07 19-Nov-2008 Marko Zec <zec@FreeBSD.org>

Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions. As a rule,
initialization at instatiation for such variables should never be
introduced again from now on. Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact. In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 8b615593 02-Oct-2008 Marko Zec <zec@FreeBSD.org>

Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 5ed3800e 19-Aug-2008 Julian Elischer <julian@FreeBSD.org>

Fix some of the formatting fixes.. It's amazing how some thing stand out
in a commit message.


# ac957cd2 19-Aug-2008 Julian Elischer <julian@FreeBSD.org>

A bunch of formatting fixes brough to light by, or created by the Vimage commit
a few days ago.


# 603724d3 17-Aug-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch


# 9233d8f3 08-Jan-2008 David E. O'Brien <obrien@FreeBSD.org>

un-__P()


# b48287a3 10-Dec-2007 David E. O'Brien <obrien@FreeBSD.org>

Clean up VCS Ids.


# b9b0dac3 28-Oct-2007 Robert Watson <rwatson@FreeBSD.org>

Move towards more explicit support for various network protocol stacks
in the TrustedBSD MAC Framework:

- Add mac_atalk.c and add explicit entry point mac_netatalk_aarp_send()
for AARP packet labeling, rather than using a generic link layer
entry point.

- Add mac_inet6.c and add explicit entry point mac_netinet6_nd6_send()
for ND6 packet labeling, rather than using a generic link layer entry
point.

- Add expliict entry point mac_netinet_arp_send() for ARP packet
labeling, and mac_netinet_igmp_send() for IGMP packet labeling,
rather than using a generic link layer entry point.

- Remove previous genering link layer entry point,
mac_mbuf_create_linklayer() as it is no longer used.

- Add implementations of new entry points to various policies, largely
by replicating the existing link layer entry point for them; remove
old link layer entry point implementation.

- Make MAC_IFNET_LOCK(), MAC_IFNET_UNLOCK(), and mac_ifnet_mtx global
to the MAC Framework rather than static to mac_net.c as it is now
needed outside of mac_net.c.

Obtained from: TrustedBSD Project


# 86407646 26-Oct-2007 Robert Watson <rwatson@FreeBSD.org>

Rename 'mac_mbuf_create_from_firewall' to 'mac_netinet_firewall_send' as
we move towards netinet as a pseudo-object for the MAC Framework.

Rename 'mac_create_mbuf_linklayer' to 'mac_mbuf_create_linklayer' to
reflect general object-first ordering preference.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# 21b415b2 22-Oct-2007 John Baldwin <jhb@FreeBSD.org>

Close a race when trying to lookup a gateway route in rt_check().
Specifically, if two threads were doing concurrent lookups and the existing
gateway was marked down, the the first thread would drop a reference on the
gateway route and then unlock the "root" route while it tried to allocate
a new route. The second thread could then also drop a reference on the
same gateway route resulting in a reference underflow. Fix this by
clearing the gateway route pointer after dropping the reference count but
before dropping the lock. Secondly, in this same case, the second thread
would overwrite the gateway route pointer w/o free'ing a reference to the
route installed by the first thread. In practice this would probably just
fix a lost reference that would result in a route never being freed.

This fixes panics observed in rt_check() and rtexpunge().

MFC after: 1 week
PR: kern/112490
Insight from: mehuljv at yahoo.com
Reviewed by: ru (found the "not-setting it to NULL" part)
Tested by: several


# 2a463222 05-Jul-2007 Xin LI <delphij@FreeBSD.org>

Space cleanup

Approved by: re (rwatson)


# 1272577e 05-Jul-2007 Xin LI <delphij@FreeBSD.org>

ANSIfy[1] plus some style cleanup nearby.

Discussed with: gnn, rwatson
Submitted by: Karl Sj?dahl - dunceor <dunceor gmail com> [1]
Approved by: re (rwatson)


# edbb8b46 05-Jul-2007 Peter Wemm <peter@FreeBSD.org>

Fix 'assignment used as truth value' warning

Approved by: re (rwatson)


# b2630c29 02-Jul-2007 George V. Neville-Neil <gnn@FreeBSD.org>

Commit the change from FAST_IPSEC to IPSEC. The FAST_IPSEC
option is now deprecated, as well as the KAME IPsec code.
What was FAST_IPSEC is now IPSEC.

Approved by: re
Sponsored by: Secure Computing


# 2cb64cb2 01-Jul-2007 George V. Neville-Neil <gnn@FreeBSD.org>

Commit IPv6 support for FAST_IPSEC to the tree.
This commit includes only the kernel files, the rest of the files
will follow in a second commit.

Reviewed by: bz
Approved by: re
Supported by: Secure Computing


# d10f3ce0 21-May-2007 Olivier Houchard <cognet@FreeBSD.org>

Force the alignment of the chars arrays, as they are casted later to
structs.
gcc 4.2 doesn't do it by default, and that results in unaligned access on
arm.

Reviewed by: gnn, imp


# 8f34a8b8 04-May-2007 SUZUKI Shinsuke <suz@FreeBSD.org>

some minor modification to the previous commit to sys/netinet6/nd6.c and nd6_nbr.c.
- added some clarification comments
- removed an unnecesary code

Obtained from: KAME
MFC after: 1 week


# 8d290a59 03-May-2007 SUZUKI Shinsuke <suz@FreeBSD.org>

fixed a memory leak in unresolved ND queue processing

Obtained from: KAME
MFC after: 1 week


# c57086ce 03-Feb-2007 Hajimu UMEMOTO <ume@FreeBSD.org>

ng_iface requiers neighbor cache as well.

MFC after: 3 days


# f234bea7 26-Jan-2007 Bruce A. Mah <bmah@FreeBSD.org>

Revert nd6.c revs. 1.67, 1.68, 1.69, 1.70 in an attempt to unbreak
IPv6 over point-to-point gif(4) tunnels.

These revisions caused a host route to the destination of a
point-to-point gif(4) interface to not get installed when the interface
and destination addresses were assigned. This caused
"no route to host" errors when trying to send traffic over the
interface. The first packet arriving inbound over the tunnel,
however, would cause the correct route to get installed, allowing
subsequent outbound traffic to be routed correctly.

gif(4) interfaces with prefix lengths of less than 128 bits
(i.e. no explicit destination address assigned) were not affected
by this bug.

This bug fix is a possible candidate for a 6.2-RELEASE errata note.

Approved by: jhay (original committer)
Discussed with: jhay, JINMEI Tatuya
MFC after: 3 days


# 1d54aa3b 11-Dec-2006 Bjoern A. Zeeb <bz@FreeBSD.org>

MFp4: 92972, 98913 + one more change

In ip6_sprintf no longer use and return one of eight static buffers
for printing/logging ipv6 addresses.
The caller now has to hand in a sufficiently large buffer as first
argument.


# f9a047a1 25-Nov-2006 Ruslan Ermilov <ru@FreeBSD.org>

- In nd6_rtrequest(), when caching an rtentry, don't forget
to add a reference to it; otherwise, we could later access
a freed memory. This is believed to fix panics some users
were observing when running route6d(8), and is similar to
the fix in sys/netinet/if_ether.c,v 1.139 by glebius@.

PR: kern/93910, kern/105437
Testing by: Wojciech Puchar (still ongoing)

- Add rtentry locking to nd6_output() similar to rt_check().

MFC after: 4 days


# aed55708 22-Oct-2006 Robert Watson <rwatson@FreeBSD.org>

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# ae0ddac7 02-Oct-2006 John Hay <jhay@FreeBSD.org>

Hopefully the last tweak in trying to make it possible to add ipv6 direct
host routes without side effects.

Submitted by: JINMEI Tatuya
MFC after: 4 days


# 584b68e7 30-Sep-2006 John Hay <jhay@FreeBSD.org>

A better fix is to check if it is a host route.

Submitted by: ume
MFC after: 5 days


# c482f11e 30-Sep-2006 John Hay <jhay@FreeBSD.org>

My previous commit broke "route add -inet6 <network_addr> -interface gif0".
Fix that by excluding point-to-point interfaces.

MFC after: 5 days


# f1298924 16-Sep-2006 John Hay <jhay@FreeBSD.org>

Make it possible to add an IPv6 host route to a host directly connected.

Use something like this:
route add -inet6 <dest_addr> <my_addr_on_that_interface> -interface -llinfo

This is usefull for wireless adhoc mesh networks.

MFC after: 5 days


# 43bc7a9c 04-Aug-2006 Brooks Davis <brooks@FreeBSD.org>

With exception of the if_name() macro, all definitions in net_osdep.h
were unused or already in if_var.h so add if_name() to if_var.h and
remove net_osdep.h along with all references to it.

Longer term we may want to kill off if_name() entierly since all modern
BSDs have if_xname variables rendering it unnecessicary.


# a59af512 07-Jun-2006 George V. Neville-Neil <gnn@FreeBSD.org>

Fix spurious warnings from neighbor discovery when working with IPv6 over
point to point tunnels (gif).

PR: 93220
Submitted by: Jinmei Tatuya
MFC after: 1 week


# 31d4137b 24-Mar-2006 SUZUKI Shinsuke <suz@FreeBSD.org>

fixed a memory leak when net.inet6.icmp6.nd6_maxqueuelen is greater than 1

Obtained from: KAME
MFC after: 3 days


# 43068328 12-Feb-2006 Hajimu UMEMOTO <ume@FreeBSD.org>

avoided the use of purged address structure when an address became
invalid in nd6_timer().

PR: kern/93170
Reported by: kris
Submitted by: JINMEI Tatuya <jinmei__at__isl.rdc.toshiba.co.jp>
Confirmed by: kris
Obtained from: KAME
MFC after: 2 days


# 36dc24e6 21-Oct-2005 SUZUKI Shinsuke <suz@FreeBSD.org>

fixed a compilation failure on amd64/sparc64/ia64

Submitted by: max
MFC after: 2 month


# 743eee66 21-Oct-2005 SUZUKI Shinsuke <suz@FreeBSD.org>

sync with KAME regarding NDP

- introduced fine-grain-timer to manage ND-caches and IPv6 Multicast-Listeners
- supports Router-Preference <draft-ietf-ipv6-router-selection-07.txt>
- better prefix lifetime management
- more spec-comformant DAD advertisement
- updated RFC/internet-draft revisions

Obtained from: KAME
Reviewed by: ume, gnn
MFC after: 2 month


# 9c8aab3e 21-Oct-2005 SUZUKI Shinsuke <suz@FreeBSD.org>

perform NUD on an IPv6-aware point-to-point interface

Obtained from: KAME
MFC after: 1 week


# 7aa59493 19-Oct-2005 SUZUKI Shinsuke <suz@FreeBSD.org>

sync with KAME (nuked unused code, use NULL to denote a NULL pointer)

Obtained from: KAME
Reviewed by: ume, gnn


# 5b27b045 19-Oct-2005 SUZUKI Shinsuke <suz@FreeBSD.org>

supported an ndp command suboption to disable IPv6 in the given interface

Obtained from: KAME
Reviewd by: ume, gnn
MFC after: 2 week


# b9204379 19-Oct-2005 SUZUKI Shinsuke <suz@FreeBSD.org>

added an ioctl option in kernel so that ndp/rtadvd can change some NDP-related kernel variables based on their configurations (RFC2461 p.43 6.2.1 mandates this for IPv6 routers)

Obtained from: KAME
Reviewd by: ume, gnn
MFC after: 2 weeks


# 2ce62dce 19-Oct-2005 SUZUKI Shinsuke <suz@FreeBSD.org>

sync with KAME in the following points:
- fixed typos
- improved some comment descriptions
- use NULL, instead of 0, to denote a NULL pointer
- avoid embedding a magic number in the code
- use nd6log() instead of log() to record NDP-specific logs
- nuked an unnecessay white space

Obtained from: KAME
MFC after: 1 day


# 59280079 06-Sep-2005 Andrew Thompson <thompsa@FreeBSD.org>

Add support for multicast to the bridge and allow inet6 addresses to be
assigned to the interface.

IPv6 auto-configuration is disabled. An IPv6 link-local address has a
link-local scope within one link, the spec is unclear for the bridge case and
it may cause scope violation.

An address can be assigned in the usual way;
ifconfig bridge0 inet6 xxxx:...

Tested by: bmah
Reviewed by: ume (netinet6)
Approved by: mlaier (mentor)
MFC after: 1 week


# cd0fdcf7 12-Aug-2005 Hajimu UMEMOTO <ume@FreeBSD.org>

- fix typo in comment.
- nuke unused code.

Submitted by: suz
Obtained from: KAME


# 530f95fc 11-Aug-2005 Gleb Smirnoff <glebius@FreeBSD.org>

o Make rt_check() function more strict:
- rt0 passed to rt_check() must not be NULL, assert this.
- rt returned by rt_check() must be valid locked rtentry,
if no error occured.
o Modify callers, so that they never pass NULL rt0
to rt_check().

Reviewed by: sam, ume (nd6.c)


# 9bd8ca30 09-Aug-2005 Gleb Smirnoff <glebius@FreeBSD.org>

In preparation for fixing races in ARP (and probably in other
L2/L3 mappings) make rt_check() return a locked rtentry.


# 401df2f2 09-Aug-2005 Gleb Smirnoff <glebius@FreeBSD.org>

- Use 'error' variable to store error value, instead of 'i'.
- Push 'i' into the only block where it is used.
- Remove redundant check for rt being NULL. If rt_check() hasn't
returned an error, then rt is valid.

Reviewed by: gnn


# a1f7e5f8 24-Jul-2005 Hajimu UMEMOTO <ume@FreeBSD.org>

scope cleanup. with this change
- most of the kernel code will not care about the actual encoding of
scope zone IDs and won't touch "s6_addr16[1]" directly.
- similarly, most of the kernel code will not care about link-local
scoped addresses as a special case.
- scope boundary check will be stricter. For example, the current
*BSD code allows a packet with src=::1 and dst=(some global IPv6
address) to be sent outside of the node, if the application do:
s = socket(AF_INET6);
bind(s, "::1");
sendto(s, some_global_IPv6_addr);
This is clearly wrong, since ::1 is only meaningful within a single
node, but the current implementation of the *BSD kernel cannot
reject this attempt.

Submitted by: JINMEI Tatuya <jinmei__at__isl.rdc.toshiba.co.jp>
Obtained from: KAME


# 9727df0c 20-Jul-2005 Hajimu UMEMOTO <ume@FreeBSD.org>

do not hardcode if_mtu values in here, except for IFT_{ARC,FDDI} -
they need special handling. makes it possible to take advantage of 9k ether
frames.

Obtained from: NetBSD


# a9771948 22-Feb-2005 Gleb Smirnoff <glebius@FreeBSD.org>

Add CARP (Common Address Redundancy Protocol), which allows multiple
hosts to share an IP address, providing high availability and load
balancing.

Original work on CARP done by Michael Shalayeff, with many
additions by Marco Pfatschbacher and Ryan McBride.

FreeBSD port done solely by Max Laier.

Patch by: mlaier
Obtained from: OpenBSD (mickey, mcbride)


# caf43b02 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes, separate for KAME


# 77b691e0 02-Oct-2004 Brian Feldman <green@FreeBSD.org>

Prevent reentrancy of the IPv6 routing code (leading to crash with
INVARIANTS on, who knows what with it off).


# 690be704 05-Sep-2004 Robert Watson <rwatson@FreeBSD.org>

Call callout_init() on nd6_slowtimo_ch before setting it going; otherwise,
the flags field will be improperly initialized resulting in inconsistent
operation (sometimes with Giant, sometimes without, et al).

RELENG_5 candidate.


# c415679d 22-Aug-2004 Robert Watson <rwatson@FreeBSD.org>

Remove in6_prefix.[ch] and the contained router renumbering capability.
The prefix management code currently resides in nd6, leaving only the
unused router renumbering capability in the in6_prefix files. Removing
it will make it easier for us to provide locking for the remainder of
IPv6 by reducing the number of objects requiring synchronized access.

This functionality has also been removed from NetBSD and OpenBSD.

Submitted by: George Neville-Neil <gnn at neville-neil.com>
Discussed with/approved by: suz, keiichi at kame.net, core at kame.net


# 354c3d34 26-Apr-2004 Luigi Rizzo <luigi@FreeBSD.org>

fix the change of interface in nd6_storelladdr for multicast
addresses too.

Reported by: Jun Kuriyama


# cd46a114 25-Apr-2004 Luigi Rizzo <luigi@FreeBSD.org>

This commit does two things:

1. rt_check() cleanup:
rt_check() is only necessary for some address families to gain access
to the corresponding arp entry, so call it only in/near the *resolve()
routines where it is actually used -- at the moment this is
arpresolve(), nd6_storelladdr() (the call is embedded here),
and atmresolve() (the call is just before atmresolve to reduce
the number of changes).
This change will make it a lot easier to decouple the arp table
from the routing table.

There is an extra call to rt_check() in if_iso88025subr.c to
determine the routing info length. I have left it alone for
the time being.

The interface of arpresolve() and nd6_storelladdr() now changes slightly:
+ the 'rtentry' parameter (really a hint from the upper level layer)
is now passed unchanged from *_output(), so it becomes the route
to the final destination and not to the gateway.
+ the routines will return 0 if resolution is possible, non-zero
otherwise.
+ arpresolve() returns EWOULDBLOCK in case the mbuf is being held
waiting for an arp reply -- in this case the error code is masked
in the caller so the upper layer protocol will not see a failure.

2. arpcom untangling
Where possible, use 'struct ifnet' instead of 'struct arpcom' variables,
and use the IFP2AC macro to access arpcom fields.
This mostly affects the netatalk code.

=== Detailed changes: ===
net/if_arcsubr.c
rt_check() cleanup, remove a useless variable

net/if_atmsubr.c
rt_check() cleanup

net/if_ethersubr.c
rt_check() cleanup, arpcom untangling

net/if_fddisubr.c
rt_check() cleanup, arpcom untangling

net/if_iso88025subr.c
rt_check() cleanup

netatalk/aarp.c
arpcom untangling, remove a block of duplicated code

netatalk/at_extern.h
arpcom untangling

netinet/if_ether.c
rt_check() cleanup (change arpresolve)

netinet6/nd6.c
rt_check() cleanup (change nd6_storelladdr)


# 32404088 19-Apr-2004 Luigi Rizzo <luigi@FreeBSD.org>

Remove a tail-recursive call in nd6_output.
This change is functionally identical to the original code, though
I have no idea if that was correct in the first place (see comment
in the commit).


# 056c7327 18-Apr-2004 Luigi Rizzo <luigi@FreeBSD.org>

Replace Bcopy/Bzero with 'the real thing' as in the rest of the file.


# 328a0408 28-Jan-2004 Hajimu UMEMOTO <ume@FreeBSD.org>

protect access to ifnet structure with mutex.


# 4208ea14 08-Dec-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

- changed the logic in nd6_is_addr_neighbor(); check on-link prefixes
(not interface addresses) to see if a given address is on-link.
- skip offlink prefixes in neighbor determination in nd6_is_addr_neighbor.
- in nd6_is_addr_neighbor, regarded every address as on-link when the
default router list is empty. otherwise, we'd not be able make a neighbor
cache for the address.
this algorithm is applied to hosts only.
- in nd6_is_addr_neighbor, check if the default interface is equal to
the interface in question in addition to check if the default router
list is empty.

Obtained from: KAME


# 7138d65c 08-Nov-2003 Sam Leffler <sam@FreeBSD.org>

replace explicit changes to rt_refcnt by RT_ADDREF and RT_REMREF
macros that expand to include assertions when the system is built
with INVARIANTS

Supported by: FreeBSD Foundation


# 0f9ade71 04-Nov-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

- cleanup SP refcnt issue.
- share policy-on-socket for listening socket.
- don't copy policy-on-socket at all. secpolicy no longer contain
spidx, which saves a lot of memory.
- deep-copy pcb policy if it is an ipsec policy. assign ID field to
all SPD entries. make it possible for racoon to grab SPD entry on
pcb.
- fixed the order of searching SA table for packets.
- fixed to get a security association header. a mode is always needed
to compare them.
- fixed that the incorrect time was set to
sadb_comb_{hard|soft}_usetime.
- disallow port spec for tunnel mode policy (as we don't reassemble).
- an user can define a policy-id.
- clear enc/auth key before freeing.
- fixed that the kernel crashed when key_spdacquire() was called
because key_spdacquire() had been implemented imcopletely.
- preparation for 64bit sequence number.
- maintain ordered list of SA, based on SA id.
- cleanup secasvar management; refcnt is key.c responsibility;
alloc/free is keydb.c responsibility.
- cleanup, avoid double-loop.
- use hash for spi-based lookup.
- mark persistent SP "persistent".
XXX in theory refcnt should do the right thing, however, we have
"spdflush" which would touch all SPs. another solution would be to
de-register persistent SPs from sptree.
- u_short -> u_int16_t
- reduce kernel stack usage by auto variable secasindex.
- clarify function name confusion. ipsec_*_policy ->
ipsec_*_pcbpolicy.
- avoid variable name confusion.
(struct inpcbpolicy *)pcb_sp, spp (struct secpolicy **), sp (struct
secpolicy *)
- count number of ipsec encapsulations on ipsec4_output, so that we
can tell ip_output() how to handle the packet further.
- When the value of the ul_proto is ICMP or ICMPV6, the port field in
"src" of the spidx specifies ICMP type, and the port field in "dst"
of the spidx specifies ICMP code.
- avoid from applying IPsec transport mode to the packets when the
kernel forwards the packets.

Tested by: nork
Obtained from: KAME


# f95d4633 24-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

Switch Advanced Sockets API for IPv6 from RFC2292 to RFC3542
(aka RFC2292bis). Though I believe this commit doesn't break
backward compatibility againt existing binaries, it breaks
backward compatibility of API.
Now, the applications which use Advanced Sockets API such as
telnet, ping6, mld6query and traceroute6 use RFC3542 API.

Obtained from: KAME


# 31b3783c 20-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

correct linkmtu handling.

Obtained from: KAME


# 2d0e1cf1 18-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

rtfree() must be called in lock context.

Reported by: jhay


# 31b1bfe1 17-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

- add dom_if{attach,detach} framework.
- transition to use ifp->if_afdata.

Obtained from: KAME


# ba00f009 14-Oct-2003 Sam Leffler <sam@FreeBSD.org>

MFp4: correct locking issues in nd6_lookup

Supported by: FreeBSD Foundation


# 953ad2fb 10-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

nuke SCOPEDROUTING. Though it was there for a long time,
it was never enabled.


# 07eb2995 09-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

- typo in comment
- style
- ANSIfy
(there is no functional change.)

Obtained from: KAME


# 40e39bbb 06-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

return(code) -> return (code)
(reduce diffs against KAME)


# d1dd20be 03-Oct-2003 Sam Leffler <sam@FreeBSD.org>

Locking for updates to routing table entries. Each rtentry gets a mutex
that covers updates to the contents. Note this is separate from holding
a reference and/or locking the routing table itself.

Other/related changes:

o rtredirect loses the final parameter by which an rtentry reference
may be returned; this was never used and added unwarranted complexity
for locking.
o minor style cleanups to routing code (e.g. ansi-fy function decls)
o remove the logic to bump the refcnt on the parent of cloned routes,
we assume the parent will remain as long as the clone; doing this avoids
a circularity in locking during delete
o convert some timeouts to MPSAFE callouts

Notes:

1. rt_mtx in struct rtentry is guarded by #ifdef _KERNEL as user-level
applications cannot/do-no know about mutex's. Doing this requires
that the mutex be the last element in the structure. A better solution
is to introduce an externalized version of struct rtentry but this is
a major task because of the intertwining of rtentry and other data
structures that are visible to user applications.
2. There are known LOR's that are expected to go away with forthcoming
work to eliminate many held references. If not these will be resolved
prior to release.
3. ATM changes are untested.

Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS (partly)


# 2049fdee 13-Sep-2003 Matthew N. Dodd <mdodd@FreeBSD.org>

Enable IPv6 for Token Ring.


# 07cf047d 05-Aug-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

introduced a flag bit "ND6_IFF_ACCEPT_RTADV" in the nd_ifinfo structure to
control whether to accept RAs per-interface basis.
the new stuff ensures the backward compatibility;
- the kernel does not accept RAs on any interfaces by default.
- since the default value of the flag bit is on, the kernel accepts RAs
on all interfaces when net.inet6.ip6.accept_rtadv is 1.

Obtained from: KAME
MFC after: 1 week


# 77d43dae 29-Apr-2003 SUZUKI Shinsuke <suz@FreeBSD.org>

panic() doesn't need \n

Obtained from: KAME
MFC after: 2 days


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 8deebb01 02-Feb-2003 Alfred Perlstein <alfred@FreeBSD.org>

Consolidate MIN/MAX macros into one place (param.h).

Submitted by: Hiten Pandya <hiten@unixdaemons.com>


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 4b32dfdc 02-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Introduce support for Mandatory Access Control and extensible
kernel access control.

When generating nd6 output on an interface, label the packet
appropriately.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 83649f07 24-Apr-2002 Hajimu UMEMOTO <ume@FreeBSD.org>

Correct timer management (deprecated) in nd6_timer.

Obtained from: KAME
MFC after: 3 days


# 88ff5695 18-Apr-2002 SUZUKI Shinsuke <suz@FreeBSD.org>

just merged cosmetic changes from KAME to ease sync between KAME and FreeBSD.
(based on freebsd4-snap-20020128)

Reviewed by: ume
MFC after: 1 week


# 8e5fd461 05-Apr-2002 Matthew N. Dodd <mdodd@FreeBSD.org>

Use <net/fddi.h> rather than <netinet/if_fddi.h>.


# 3c404a32 01-Apr-2002 Hajimu UMEMOTO <ume@FreeBSD.org>

In nd6_lookup(), check if rt_llinfo is non-NULL to avoid returning an
entry that has the LLINFO flag but is not a neighbor cache entry.

Obtained from: KAME
MFC after: 1 week


# c3cf07a1 28-Feb-2002 Hajimu UMEMOTO <ume@FreeBSD.org>

- In nd6_rtrequest(), ignored a route when it is created by cloning and
is not a neighbor. see comments for the detailed reason.

- Rejected the process of nd6_rtrequest() when the request is RESOLVE and
the interface does not need neighbor caches.

Obtained from: KAME
MFC After: 1 week


# 8071913d 17-Oct-2001 Ruslan Ermilov <ru@FreeBSD.org>

Pull post-4.4BSD change to sys/net/route.c from BSD/OS 4.2.

Have sys/net/route.c:rtrequest1(), which takes ``rt_addrinfo *''
as the argument. Pass rt_addrinfo all the way down to rtrequest1
and ifa->ifa_rtrequest. 3rd argument of ifa->ifa_rtrequest is now
``rt_addrinfo *'' instead of ``sockaddr *'' (almost noone is
using it anyways).

Benefit: the following command now works. Previously we needed
two route(8) invocations, "add" then "change".
# route add -inet6 default ::1 -ifp gif0

Remove unsafe typecast in rtrequest(), from ``rtentry *'' to
``sockaddr *''. It was introduced by 4.3BSD-Reno and never
corrected.

Obtained from: BSD/OS, NetBSD
MFC after: 1 month
PR: kern/28360


# f9132ceb 05-Sep-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Wrap array accesses in macros, which also happen to be lvalues:

ifnet_addrs[i - 1] -> ifaddr_byindex(i)
ifindex2ifnet[i] -> ifnet_byindex(i)

This is intended to ease the conversion to SMPng.


# 554bf4aa 04-Jul-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

When the link-layer address of a router changes, select the
best router again. In particular, when the neighbor entry is newly
created, it might affect the selection policy.

Obtained from: KAME
MFC after: 1 week


# 1026ccc4 27-Jun-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

refresh default router list on nd6_purge(), only if we are an
autoconfigured host.

Obtained from: KAME


# 05b6760d 19-Jun-2001 Munechika SUMIKAWA <sumikawa@FreeBSD.org>

Add IFT_L2VLAN for supported NDP type. IPv6 over VLAN works now.

Obtained from: KAME
MFC after: 2 weeks


# 33841545 10-Jun-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

Sync with recent KAME.
This work was based on kame-20010528-freebsd43-snap.tgz and some
critical problem after the snap was out were fixed.
There are many many changes since last KAME merge.

TODO:
- The definitions of SADB_* in sys/net/pfkeyv2.h are still different
from RFC2407/IANA assignment because of binary compatibility
issue. It should be fixed under 5-CURRENT.
- ip6po_m member of struct ip6_pktopts is no longer used. But, it
is still there because of binary compatibility issue. It should
be removed under 5-CURRENT.

Reviewed by: itojun
Obtained from: KAME
MFC after: 3 weeks


# fef5fd23 10-Mar-2001 Bosko Milekic <bmilekic@FreeBSD.org>

Plug several mbuf leaks in error cases (in nd6)

Submitted by: jhay


# 19391949 25-Feb-2001 Kris Kennaway <kris@FreeBSD.org>

More IP option length validation.
Includes the following revisions from KAME (two of these were actually
committed previously but the CVS revisions weren't documented):

1.40 kame/kame/sys/netinet6/ah_core.c (committed in previous rev)
1.41 kame/kame/sys/netinet6/ah_core.c
1.28 kame/kame/sys/netinet6/ah_output.c (committed in previous rev)
1.29 kame/kame/sys/netinet6/ah_output.c
1.30 kame/kame/sys/netinet6/ah_output.c
1.129 kame/kame/sys/netinet6/nd6.c
1.130 kame/kame/sys/netinet6/nd6.c
1.24 kame/kame/sys/netinet6/dest6.c
1.25 kame/kame/sys/netinet6/dest6.c

Obtained from: KAME


# bf1c6fef 20-Feb-2001 Hidetoshi Shimokawa <simokawa@FreeBSD.org>

Better detection of duplicated initialization.

Obtained from: KAME


# 0634b4a7 29-Jan-2001 Peter Wemm <peter@FreeBSD.org>

Yikes, these files bogusly #include "loop.h" but didn't use the value.
My searching for NLOOP missed them. :-(


# 686cdd19 04-Jul-2000 Jun-ichiro itojun Hagino <itojun@FreeBSD.org>

sync with kame tree as of july00. tons of bug fixes/improvements.

API changes:
- additional IPv6 ioctls
- IPsec PF_KEY API was changed, it is mandatory to upgrade setkey(8).
(also syntax change)


# ad8d5711 17-Apr-2000 Munechika SUMIKAWA <sumikawa@FreeBSD.org>

even if nd6_nud_hint is called, do not change a neighbor's status
unless the old status is probably reachable (i.e. the link-layer address
has already been resolved).

Obtained from: KAME Project


# cfa1ca9d 07-Dec-1999 Yoshinobu Inoue <shin@FreeBSD.org>

udp IPv6 support, IPv6/IPv4 tunneling support in kernel,
packet divert at kernel for IPv6/IPv4 translater daemon

This includes queue related patch submitted by jburkhol@home.com.

Submitted by: queue related patch from jburkhol@home.com
Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 82cd038d 21-Nov-1999 Yoshinobu Inoue <shin@FreeBSD.org>

KAME netinet6 basic part(no IPsec,no V6 Multicast Forwarding, no UDP/TCP
for IPv6 yet)

With this patch, you can assigne IPv6 addr automatically, and can reply to
IPv6 ping.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project