History log of /freebsd-11-stable/sys/net/route.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 359652 06-Apr-2020 hselasky

MFC r333806:
Use NULL for SYSINIT's last arg, which is a pointer type

Sponsored by: The FreeBSD Foundation


# 350865 11-Aug-2019 gnn

MFC: 350557

Properly validate arguments for route deletion

Reported by: Liang Zhuo brightiup.zhuo@gmail.com


# 331722 29-Mar-2018 eadler

Revert r330897:

This was intended to be a non-functional change. It wasn't. The commit
message was thus wrong. In addition it broke arm, and merged crypto
related code.

Revert with prejudice.

This revert skips files touched in r316370 since that commit was since
MFCed. This revert also skips files that require $FreeBSD$ property
changes.

Thank you to those who helped me get out of this mess including but not
limited to gonzo, kevans, rgrimes.

Requested by: gjb (re)


# 330897 14-Mar-2018 eadler

Partial merge of the SPDX changes

These changes are incomplete but are making it difficult
to determine what other changes can/should be merged.

No objections from: pfg


# 320134 20-Jun-2017 ae

MFC r319895:
Resurrect RTF_RNH_LOCKED flag and restore ability to call rtalloc1_fib()
with acquired RIB lock.

This fixes a possible panic due to trying to acquire RIB rlock when it is
already exclusive locked.

PR: 215963, 215122
Sponsored by: Yandex LLC
Approved by: re (delphij)


# 310884 31-Dec-2016 loos

MFC r309717:

Fix the typos and style(9) in comment.


# 302408 07-Jul-2016 gjb

Copy head@r302406 to stable/11 as part of the 11.0-RELEASE cycle.
Prune svn:mergeinfo from the new branch, as nothing has been merged
here.

Additional commits post-branch will follow.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


/freebsd-11-stable/MAINTAINERS
/freebsd-11-stable/cddl
/freebsd-11-stable/cddl/contrib/opensolaris
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/dtrace/test/tst/common/print
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zfs
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs
/freebsd-11-stable/contrib/amd
/freebsd-11-stable/contrib/apr
/freebsd-11-stable/contrib/apr-util
/freebsd-11-stable/contrib/atf
/freebsd-11-stable/contrib/binutils
/freebsd-11-stable/contrib/bmake
/freebsd-11-stable/contrib/byacc
/freebsd-11-stable/contrib/bzip2
/freebsd-11-stable/contrib/com_err
/freebsd-11-stable/contrib/compiler-rt
/freebsd-11-stable/contrib/dialog
/freebsd-11-stable/contrib/dma
/freebsd-11-stable/contrib/dtc
/freebsd-11-stable/contrib/ee
/freebsd-11-stable/contrib/elftoolchain
/freebsd-11-stable/contrib/elftoolchain/ar
/freebsd-11-stable/contrib/elftoolchain/brandelf
/freebsd-11-stable/contrib/elftoolchain/elfdump
/freebsd-11-stable/contrib/expat
/freebsd-11-stable/contrib/file
/freebsd-11-stable/contrib/gcc
/freebsd-11-stable/contrib/gcclibs/libgomp
/freebsd-11-stable/contrib/gdb
/freebsd-11-stable/contrib/gdtoa
/freebsd-11-stable/contrib/groff
/freebsd-11-stable/contrib/ipfilter
/freebsd-11-stable/contrib/ldns
/freebsd-11-stable/contrib/ldns-host
/freebsd-11-stable/contrib/less
/freebsd-11-stable/contrib/libarchive
/freebsd-11-stable/contrib/libarchive/cpio
/freebsd-11-stable/contrib/libarchive/libarchive
/freebsd-11-stable/contrib/libarchive/libarchive_fe
/freebsd-11-stable/contrib/libarchive/tar
/freebsd-11-stable/contrib/libc++
/freebsd-11-stable/contrib/libc-vis
/freebsd-11-stable/contrib/libcxxrt
/freebsd-11-stable/contrib/libexecinfo
/freebsd-11-stable/contrib/libpcap
/freebsd-11-stable/contrib/libstdc++
/freebsd-11-stable/contrib/libucl
/freebsd-11-stable/contrib/libxo
/freebsd-11-stable/contrib/llvm
/freebsd-11-stable/contrib/llvm/projects/libunwind
/freebsd-11-stable/contrib/llvm/tools/clang
/freebsd-11-stable/contrib/llvm/tools/lldb
/freebsd-11-stable/contrib/llvm/tools/llvm-dwarfdump
/freebsd-11-stable/contrib/llvm/tools/llvm-lto
/freebsd-11-stable/contrib/mdocml
/freebsd-11-stable/contrib/mtree
/freebsd-11-stable/contrib/ncurses
/freebsd-11-stable/contrib/netcat
/freebsd-11-stable/contrib/ntp
/freebsd-11-stable/contrib/nvi
/freebsd-11-stable/contrib/one-true-awk
/freebsd-11-stable/contrib/openbsm
/freebsd-11-stable/contrib/openpam
/freebsd-11-stable/contrib/openresolv
/freebsd-11-stable/contrib/pf
/freebsd-11-stable/contrib/sendmail
/freebsd-11-stable/contrib/serf
/freebsd-11-stable/contrib/sqlite3
/freebsd-11-stable/contrib/subversion
/freebsd-11-stable/contrib/tcpdump
/freebsd-11-stable/contrib/tcsh
/freebsd-11-stable/contrib/tnftp
/freebsd-11-stable/contrib/top
/freebsd-11-stable/contrib/top/install-sh
/freebsd-11-stable/contrib/tzcode/stdtime
/freebsd-11-stable/contrib/tzcode/zic
/freebsd-11-stable/contrib/tzdata
/freebsd-11-stable/contrib/unbound
/freebsd-11-stable/contrib/vis
/freebsd-11-stable/contrib/wpa
/freebsd-11-stable/contrib/xz
/freebsd-11-stable/crypto/heimdal
/freebsd-11-stable/crypto/openssh
/freebsd-11-stable/crypto/openssl
/freebsd-11-stable/gnu/lib
/freebsd-11-stable/gnu/usr.bin/binutils
/freebsd-11-stable/gnu/usr.bin/cc/cc_tools
/freebsd-11-stable/gnu/usr.bin/gdb
/freebsd-11-stable/lib/libc/locale/ascii.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris
/freebsd-11-stable/sys/contrib/dev/acpica
/freebsd-11-stable/sys/contrib/ipfilter
/freebsd-11-stable/sys/contrib/libfdt
/freebsd-11-stable/sys/contrib/octeon-sdk
/freebsd-11-stable/sys/contrib/x86emu
/freebsd-11-stable/sys/contrib/xz-embedded
/freebsd-11-stable/usr.sbin/bhyve/atkbdc.h
/freebsd-11-stable/usr.sbin/bhyve/bhyvegc.c
/freebsd-11-stable/usr.sbin/bhyve/bhyvegc.h
/freebsd-11-stable/usr.sbin/bhyve/console.c
/freebsd-11-stable/usr.sbin/bhyve/console.h
/freebsd-11-stable/usr.sbin/bhyve/pci_fbuf.c
/freebsd-11-stable/usr.sbin/bhyve/pci_xhci.c
/freebsd-11-stable/usr.sbin/bhyve/pci_xhci.h
/freebsd-11-stable/usr.sbin/bhyve/ps2kbd.c
/freebsd-11-stable/usr.sbin/bhyve/ps2kbd.h
/freebsd-11-stable/usr.sbin/bhyve/ps2mouse.c
/freebsd-11-stable/usr.sbin/bhyve/ps2mouse.h
/freebsd-11-stable/usr.sbin/bhyve/rfb.c
/freebsd-11-stable/usr.sbin/bhyve/rfb.h
/freebsd-11-stable/usr.sbin/bhyve/sockstream.c
/freebsd-11-stable/usr.sbin/bhyve/sockstream.h
/freebsd-11-stable/usr.sbin/bhyve/usb_emul.c
/freebsd-11-stable/usr.sbin/bhyve/usb_emul.h
/freebsd-11-stable/usr.sbin/bhyve/usb_mouse.c
/freebsd-11-stable/usr.sbin/bhyve/vga.c
/freebsd-11-stable/usr.sbin/bhyve/vga.h
# 302054 21-Jun-2016 bz

Get closer to a VIMAGE network stack teardown from top to bottom rather
than removing the network interfaces first. This change is rather larger
and convoluted as the ordering requirements cannot be separated.

Move the pfil(9) framework to SI_SUB_PROTO_PFIL, move Firewalls and
related modules to their own SI_SUB_PROTO_FIREWALL.
Move initialization of "physical" interfaces to SI_SUB_DRIVERS,
move virtual (cloned) interfaces to SI_SUB_PSEUDO.
Move Multicast to SI_SUB_PROTO_MC.

Re-work parts of multicast initialisation and teardown, not taking the
huge amount of memory into account if used as a module yet.

For interface teardown we try to do as many of them as we can on
SI_SUB_INIT_IF, but for some this makes no sense, e.g., when tunnelling
over a higher layer protocol such as IP. In that case the interface
has to go along (or before) the higher layer protocol is shutdown.

Kernel hhooks need to go last on teardown as they may be used at various
higher layers and we cannot remove them before we cleaned up the higher
layers.

For interface teardown there are multiple paths:
(a) a cloned interface is destroyed (inside a VIMAGE or in the base system),
(b) any interface is moved from a virtual network stack to a different
network stack ("vmove"), or (c) a virtual network stack is being shut down.
All code paths go through if_detach_internal() where we, depending on the
vmove flag or the vnet state, make a decision on how much to shut down;
in case we are destroying a VNET the individual protocol layers will
cleanup their own parts thus we cannot do so again for each interface as
we end up with, e.g., double-frees, destroying locks twice or acquiring
already destroyed locks.
When calling into protocol cleanups we equally have to tell them
whether they need to detach upper layer protocols ("ulp") or not
(e.g., in6_ifdetach()).

Provide or enahnce helper functions to do proper cleanup at a protocol
rather than at an interface level.

Approved by: re (hrs)
Obtained from: projects/vnet
Reviewed by: gnn, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D6747


# 301502 06-Jun-2016 bz

Provide a public interface to rt_flushifroutes which takes the address
family as an argument as well.
This will be used to cleanup individual protocols during VNET teardown.

Obtained from: projects/vnet
Sponsored by: The FreeBSD Foundation


# 301217 02-Jun-2016 gnn

This change re-adds L2 caching for TCP and UDP, as originally added in D4306
but removed due to other changes in the system. Restore the llentry pointer
to the "struct route", and use it to cache the L2 lookup (ARP or ND6) as
appropriate.

Submitted by: Mike Karels
Differential Revision: https://reviews.freebsd.org/D6262


# 297234 24-Mar-2016 bz

Fix compile errors after r297225:

- properly V_irtualise variable access unbreaking VIMAGE kernels.
- remove the volatile from the function return type to make architecture
using gcc happy [-Wreturn-type]
"type qualifiers ignored on function return type"
I am not entirely happy with this solution putting the u_int there
but it will do for now.


# 297225 24-Mar-2016 gnn

FreeBSD previously provided route caching for TCP (and UDP). Re-add
route caching for TCP, with some improvements. In particular, invalidate
the route cache if a new route is added, which might be a better match.
The cache is automatically invalidated if the old route is deleted.

Submitted by: Mike Karels
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D4306


# 295529 11-Feb-2016 dteske

Merge SVN r295220 (bz) from projects/vnet/

Fix a panic that occurs when a vnet interface is unavailable at the time the
vnet jail referencing said interface is stopped.

Sponsored by: FIS Global, Inc.


# 294710 25-Jan-2016 melifaro

Fix flowtable part missed in r294706.


# 294706 25-Jan-2016 melifaro

MFP r287070,r287073: split radix implementation and route table structure.

There are number of radix consumers in kernel land (pf,ipfw,nfs,route)
with different requirements. In fact, first 3 don't have _any_ requirements
and first 2 does not use radix locking. On the other hand, routing
structure do have these requirements (rnh_gen, multipath, custom
to-be-added control plane functions, different locking).
Additionally, radix should not known anything about its consumers internals.

So, radix code now uses tiny 'struct radix_head' structure along with
internal 'struct radix_mask_head' instead of 'struct radix_node_head'.
Existing consumers still uses the same 'struct radix_node_head' with
slight modifications: they need to pass pointer to (embedded)
'struct radix_head' to all radix callbacks.

Routing code now uses new 'struct rib_head' with different locking macro:
RADIX_NODE_HEAD prefix was renamed to RIB_ (which stands for routing
information base).

New net/route_var.h header was added to hold routing subsystem internal
data. 'struct rib_head' was placed there. 'struct rtentry' will also
be moved there soon.


# 294020 14-Jan-2016 melifaro

Fix panic in IP redirect. Panic was introduced in r293466.

Found by: Yamagi Burmeister <lists at yamagi.org>>


# 293886 14-Jan-2016 melifaro

Remove now-unused wrappers for various routing functions.


# 293829 13-Jan-2016 melifaro

Remove RTF_RNH_LOCKED support from rtalloc1_fib().

Last caller using it was eliminated in r293471.

Sponsored by: Yandex LLC


# 293466 09-Jan-2016 melifaro

(Temporarily) remove route_redirect_event eventhandler.

Such handler should pass different set of variables, instead
of directly providing 2 locked route entries.
Given that it hasn't been really used since at least 2012, remove
current code.
Will re-add it after finishing most major routing-related changes.

Discussed with: np


# 293465 09-Jan-2016 melifaro

Please Coverity by removing unneccessary check (rt_key() is always set).

Coverity CID: 1347797


# 293424 08-Jan-2016 melifaro

Do more fine-grained locking in rtrequest1_fib().

Last consumer using RTF_RNH_LOCKED flag was eliminated in r291643.
Restrict passing RTF_RNH_LOCKED to rtrequest1_fib() and do better
locking for RTM_ADD / RTM_DELETE cases.


# 293159 04-Jan-2016 melifaro

Add rib_lookup_info() to provide API for retrieving individual route
entries data in unified format.

There are control plane functions that require information other than
just next-hop data (e.g. individual rtentry fields like flags or
prefix/mask). Given that the goal is to avoid rte reference/refcounting,
re-use rt_addrinfo structure to store most rte fields. If caller wants
to retrieve key/mask or gateway (which are sockaddrs and are allocated
separately), it needs to provide sufficient-sized sockaddrs structures
w/ ther pointers saved in passed rt_addrinfo.

Convert:
* lltable new records checks (in_lltable_rtcheck(),
nd6_is_new_addr_neighbor().
* rtsock pre-add/change route check.
* IPv6 NS ND-proxy check (RADIX_MPATH code was eliminated because
1) we don't support RTF_ANNOUNCE ND-proxy for networks and there should
not be multiple host routes for such hosts 2) if we have multiple
routes we should inspect them (which is not done). 3) the entire idea
of abusing KRT as storage for ND proxy seems odd. Userland programs
should be used for that purpose).


# 292163 13-Dec-2015 melifaro

Fix PINNED routes handling.
Before r291643, adding new interface prefix had the following logic:
try_add:
EEXIST && (PINNED) {
try_del(w/o PINNED flag)
if (OK)
try_add(PINNED)
}

In r291643, deletion was performed w/ PINNED flag held which leaded
to new interface prefixes (like ::1) overriding older ones.
Fix this by requesting deletion w/o RTF_PINNED.

PR: kern/205285
Submitted by: Fabian Keil <fk at fabiankeil.de>


# 291643 02-Dec-2015 melifaro

Move RTF_PINNED handling to generic route code.
This eliminates last RTF_RNH_LOCKED rtrequest1_fib() user.


# 291565 01-Dec-2015 ngie

Fix LINT-NOIP kernels after r291467

rn is only used if INET or INET6 are defined

Sponsored by: EMC / Isilon Storage Division


# 291467 30-Nov-2015 melifaro

Move flowtable rte checks to separate function.


# 291466 30-Nov-2015 melifaro

Add new rt_foreach_fib_walk_del() function for deleting route entries
by filter function instead of picking into routing table details in
each consumer.
Remove now-unused rt_expunge() (eliminating last external RTF_RNH_LOCKED
user).
This simplifies future nexthops/mulitipath changes and rtrequest1_fib()
locking refactoring.

Actual changes:
Add "rt_chain" field to permit rte grouping while doing batched delete
from routing table (thus growing rte 200->208 on amd64).
Add "rti_filter" / "rti_filterdata" / "rti_spare" fields to rt_addrinfo
to pass filter function to various routing subsystems in standard way.
Convert all rt_expunge() customers to new rt_addinfo-based api and eliminate
rt_expunge().


# 290828 14-Nov-2015 melifaro

Pass provided af instead of AF_UNSPEC to setwa_f callback.


# 290154 29-Oct-2015 bdrewery

Avoid passing an uninitialized 'i'. Currently nothing was depending on it
anyhow.

Coverity CID: 1331562


# 289461 17-Oct-2015 melifaro

Remove several compat functions from pre-fib era.


# 287798 14-Sep-2015 vangyzen

Fix the handling of IPv6 On-Link Redirects.

On receipt of a redirect message, install an interface route for the
redirected destination. On removal of the corresponding Neighbor Cache
entry, remove the interface route.

This requires changes in rtredirect_fib() to cope with an AF_LINK
address for the gateway and with the absence of RTF_GATEWAY.

This fixes the "Redirected On-Link" test cases in the Tahi IPv6 Ready Logo
Phase 2 test suite.

Unrelated to the above, fix a recursion on the radix node head lock
triggered by the Tahi Redirected to Alternate Router test cases.

When I first wrote this patch in October 2012, all Section 2
(Neighbor Discovery) test cases passed on 10-CURRENT, 9-STABLE,
and 8-STABLE. cem@ recently rebased the 10.x patch onto head and reported
that it passes Tahi. (Thanks!)

These other test cases also passed in 2012:

* the RTF_MODIFIED case, with IPv4 and IPv6 (using a
RTF_HOST|RTF_GATEWAY route for the destination)

* the redirected-to-self case, with IPv4 and IPv6

* a valid IPv4 redirect

All testing in 2012 was done with WITNESS and INVARIANTS.

Tested by: EMC / Isilon Storage Division via Conrad Meyer (cem) in 2015,
Mark Kelley <mark_kelley@dell.com> in 2012,
TC Telkamp <terence_telkamp@dell.com> in 2012
PR: 152791
Reviewed by: melifaro (current rev), bz (earlier rev)
Approved by: kib (mentor)
MFC after: 1 month
Relnotes: yes
Sponsored by: Dell Inc.
Differential Revision: https://reviews.freebsd.org/D3602


# 287476 05-Sep-2015 melifaro

Constantify lookup key in ifa_ifwith* functions.
Some places in our network stack already have const
arguments (like if_output() routines and LLE functions).

Code using ifa_ifwith (and similar functins) along with
LLE/_output functions is currently bound to use tricks
like __DECONST(). Provide a cleaner way by making sockaddr
lookup key really constant.

MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D3464


# 286594 10-Aug-2015 melifaro

Rename rt_foreach_fib() to rt_foreach_fib_walk().

Suggested by: julian


# 286458 08-Aug-2015 melifaro

MFP r274295:

* Move interface route cleanup to route.c:rt_flushifroutes()
* Convert most of "for (fibnum = 0; fibnum < rt_numfibs; fibnum++)" users
to use new rt_foreach_fib() instead of hand-rolling cycles.


# 286057 30-Jul-2015 loos

Follow r256586 and rename the kernel version of the Free() macro to
R_Free(). This matches the other macros and reduces the chances to clash
with other headers.

This also fixes the build of radix.c outside of the kernel environment.

Reviewed by: glebius


# 281583 16-Apr-2015 araujo

Remove duplicate header entry.


# 274611 16-Nov-2014 melifaro

Finish r274175: do control plane MTU tracking.

Update route MTU in case of ifnet MTU change.
Add new RTF_FIXEDMTU to track explicitly specified MTU.

Old behavior:
ifconfig em0 mtu 1500->9000 -> all routes traversing em0 do not change MTU.
User has to manually update all routes.
ifconfig em0 mtu 9000->1500 -> all routes traversing em0 do not change MTU.
However, if ip[6]_output finds route with rt_mtu > interface mtu, rt_mtu
gets updated.

New behavior:
ifconfig em0 mtu 1500->9000 -> all interface routes in all fibs gets updated
with new MTU unless RTF_FIXEDMTU flag set on them.
ifconfig em0 mtu 9000->1500 -> all routes in all fibs gets updated with new
MTU unless RTF_FIXEDMTU flag set on them AND rt_mtu is less than ifp mtu.

route add ... -mtu XXX automatically sets RTF_FIXEDMTU flag.
route change .. -mtu 0 automatically removes RTF_FIXEDMTU flag.

PR: 194238
MFC after: 1 month
CR: D1125


# 274589 16-Nov-2014 melifaro

Revert r274585: rte lock is properly destroyed in uma dtor callback.

Pointed by: glebius


# 274585 16-Nov-2014 melifaro

Make witness happy: destroy rte lock before free.

MFC after: 2 weeks


# 274187 06-Nov-2014 melifaro

Fix build.

Pointy hat to: melifaro


# 274177 06-Nov-2014 melifaro

Finish r274118: remove useless fields from struct domain.

Sponsored by: Yandex LLC


# 274175 06-Nov-2014 melifaro

Make checks for rt_mtu generic:

Some virtual if drivers has (ab)used ifa ifa_rtrequest hook to enforce
route MTU to be not bigger that interface MTU. While ifa_rtrequest hooking
might be an option in some situation, it is not feasible to do MTU checks
there: generic (or per-domain) routing code is perfectly capable of doing
this.

We currrently have 3 places where MTU is altered:

1) route addition.
In this case domain overrides radix _addroute callback (in[6]_addroute)
and all necessary checks/fixes are/can be done there.

2) route change (especially, GW change).
In this case, there are no explicit per-domain calls, but one can
override rte by setting ifa_rtrequest hook to domain handler
(inet6 does this).

3) ifconfig ifaceX mtu YYYY
In this case, we have no callbacks, but ip[6]_output performes runtime
checks and decreases rt_mtu if necessary.

Generally, the goals are to be able to handle all MTU changes in
control plane, not in runtime part, and properly deal with increased
interface MTU.

This commit changes the following:
* removes hooks setting MTU from drivers side
* adds proper per-doman MTU checks for case 1)
* adds generic MTU check for case 2)

* The latter is done by using new dom_ifmtu callback since
if_mtu denotes L3 interface MTU, e.g. maximum trasmitted _packet_ size.
However, IPv6 mtu might be different from if_mtu one (e.g. default 1280)
for some cases, so we need an abstract way to know maximum MTU size
for given interface and domain.
* moves rt_setmetrics() before MTU/ifa_rtrequest hooks since it copies
user-supplied data which must be checked.
* removes RT_LOCK_ASSERT() from other ifa_rtrequest hooks to be able to
use this functions on new non-inserted rte.

More changes will follow soon.

MFC after: 1 month
Sponsored by: Yandex LLC


# 271916 21-Sep-2014 hrs

Make net.add_addr_allfibs vnet-local.


# 271438 11-Sep-2014 asomers

Revisions 264905 and 266860 added a "int fib" argument to ifa_ifwithnet and
ifa_ifwithdstaddr. For the sake of backwards compatibility, the new
arguments were added to new functions named ifa_ifwithnet_fib and
ifa_ifwithdstaddr_fib, while the old functions became wrappers around the
new ones that passed RT_ALL_FIBS for the fib argument. However, the
backwards compatibility is not desired for FreeBSD 11, because there are
numerous other incompatible changes to the ifnet(9) API. We therefore
decided to remove it from head but leave it in place for stable/9 and
stable/10. In addition, this commit adds the fib argument to
ifa_ifwithbroadaddr for consistency's sake.

sys/sys/param.h
Increment __FreeBSD_version

sys/net/if.c
sys/net/if_var.h
sys/net/route.c
Add fibnum argument to ifa_ifwithbroadaddr, and remove the _fib
versions of ifa_ifwithdstaddr, ifa_ifwithnet, and ifa_ifwithroute.

sys/net/route.c
sys/net/rtsock.c
sys/netinet/in_pcb.c
sys/netinet/ip_options.c
sys/netinet/ip_output.c
sys/netinet6/nd6.c
Fixup calls of modified functions.

share/man/man9/ifnet.9
Document changed API.

CR: https://reviews.freebsd.org/D458
MFC after: Never
Sponsored by: Spectra Logic


# 267992 28-Jun-2014 hselasky

Pull in r267961 and r267973 again. Fix for issues reported will follow.


# 267985 27-Jun-2014 gjb

Revert r267961, r267973:

These changes prevent sysctl(8) from returning proper output,
such as:

1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory


# 267961 27-Jun-2014 hselasky

Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after: 2 weeks
Sponsored by: Mellanox Technologies


# 266860 29-May-2014 asomers

Fix unintended KBI change from r264905. Add _fib versions of
ifa_ifwithnet() and ifa_ifwithdstaddr() The legacy functions will call the
_fib() versions with RT_ALL_FIBS, preserving legacy behavior.

sys/net/if_var.h
sys/net/if.c
Add legacy-compatible functions as described above. Ensure legacy
behavior when RT_ALL_FIBS is passed as fibnum.

sys/netinet/in_pcb.c
sys/netinet/ip_output.c
sys/netinet/ip_options.c
sys/net/route.c
sys/net/rtsock.c
sys/netinet6/nd6.c
Call with _fib() functions if we must use a specific fib, or the
legacy functions otherwise.

tests/sys/netinet/fibs_test.sh
tests/sys/netinet/udp_dontroute.c
Improve the udp_dontroute test. The bug that this test exercises is
that ifa_ifwithnet() will return the wrong address, if multiple
interfaces have addresses on the same subnet but with different
fibs. The previous version of the test only considered one possible
failure mode: that ifa_ifwithnet_fib() might fail to find any
suitable address at all. The new version also checks whether
ifa_ifwithnet_fib() finds the correct address by checking where the
ARP request goes.

Reported by: bz, hrs
Reviewed by: hrs
MFC after: 1 week
X-MFC-with: 264905
Sponsored by: Spectra Logic


# 265280 03-May-2014 melifaro

Remove additional fib checks from rtalloc1_fib.
It looks like current consumers are either unaware
of MRT (and uses RT_DEFAULT_FIB implicitly) or
know what thay are doing, In latter case they
will be either hit by KASSERT or ESCRH will be returned
due to NULL rnh.


# 265279 03-May-2014 melifaro

Pass radix head ptr along with rte to rtexpunge().
Rename rtexpunge to rt_expunge().


# 265103 29-Apr-2014 melifaro

Move rt_setmetrics() from rtsock.c to route.c.
All rtsock-initiated rte creation/modification are now
performed in route.c holding radix tree write lock.
This reduces the need for per-rte mutex.

Sponsored by: Yandex LLC
MFC after: 1 month


# 265091 29-Apr-2014 melifaro

Do not use senderr() in rtrequest1_fib_change().

Suggested by: glebius
MFC after: 4 weeks


# 264989 26-Apr-2014 melifaro

Remove useless `register' declarations.

MFC after: 1 month


# 264986 26-Apr-2014 melifaro

Decouple RTM_CHANGE from RTM_GET handling in rtsock.c:route_output().
RTM_CHANGE is now handled inside route.c:rtrequest1_fib() as it should be.
Note change change handler is a separate function rtrequest1_fib_change().

MFC after: 1 month


# 264973 26-Apr-2014 melifaro

Unify sa_equal() macro usage.

MFC after: 2 weeks


# 264905 24-Apr-2014 asomers

Fix subnet and default routes on different FIBs on the same subnet.

These two bugs are closely related. The root cause is that ifa_ifwithnet
does not consider FIBs when searching for an interface address.

sys/net/if_var.h
sys/net/if.c
Add a fib argument to ifa_ifwithnet and ifa_ifwithdstadddr. Those
functions will only return an address whose interface fib equals the
argument.

sys/net/route.c
Update calls to ifa_ifwithnet and ifa_ifwithdstaddr with fib
arguments.

sys/netinet/in.c
Update in_addprefix to consider the interface fib when adding
prefixes. This will prevent it from not adding a subnet route when
one already exists on a different fib.

sys/net/rtsock.c
sys/netinet/in_pcb.c
sys/netinet/ip_output.c
sys/netinet/ip_options.c
sys/netinet6/nd6.c
Add RT_DEFAULT_FIB arguments to ifa_ifwithdstaddr and ifa_ifwithnet.
In some cases it there wasn't a clear specific fib number to use.
In others, I was unable to test those functions so I chose
RT_DEFAULT_FIB to minimize divergence from current behavior. I will
fix some of the latter changes along with PR kern/187553.

tests/sys/netinet/fibs_test.sh
tests/sys/netinet/udp_dontroute.c
tests/sys/netinet/Makefile
Revert r263738. The udp_dontroute test was right all along.
However, bugs kern/187550 and kern/187553 cancelled each other out
when it came to this test. Because of kern/187553, ifa_ifwithnet
searched the default fib instead of the requested one, but because
of kern/187550, there was an applicable subnet route on the default
fib. The new test added in r263738 doesn't work right, however. I
can verify with dtrace that ifa_ifwithnet returned the wrong address
before I applied this commit, but route(8) miraculously found the
correct interface to use anyway. I don't know how.

Clear expected failure messages for kern/187550 and kern/187552.

PR: kern/187550
PR: kern/187552
Reviewed by: melifaro
MFC after: 3 weeks
Sponsored by: Spectra Logic


# 264887 24-Apr-2014 asomers

Fix host and network routes for new interfaces when net.add_addr_allfibs=0

sys/net/route.c
In rtinit1, use the interface fib instead of the process fib. The
latter wasn't very useful because ifconfig(8) is usually invoked
with the default process fib. Changing ifconfig(8) to use setfib(2)
would be redundant, because it already sets the interface fib.

tests/sys/netinet/fibs_test.sh
Clear the expected ATF failure

sys/net/if.c
Pass the interface fib in calls to rtrequest1_fib and rtalloc1_fib

sys/netinet/in.c
sys/net/if_var.h
Add a fibnum argument to ifa_switch_loopback_route, a subroutine of
in_scrubprefix. Pass it the interface fib.

PR: kern/187549
Reviewed by: melifaro
MFC after: 3 weeks
Sponsored by: Spectra Logic Corporation


# 264241 07-Apr-2014 tuexen

Call sctp_addr_change() from rt_addrmsg() instead of rt_newaddrmsg_fib(),
since rt_addrmsg() gets also called from other functions.

MFC after: 3 days


# 263203 15-Mar-2014 glebius

Garbage collect long time obsoleted (or never used) stuff from routing API.


# 262806 05-Mar-2014 glebius

The route code used to mtx_destroy() a locked mutex before rtentry free. Now,
after r262763 it started to return locked mutexes to UMA. To fix that,
conditionally unlock the mutex in the destructor.

Tested by: "Sergey V. Dyatko" <sergey.dyatko@gmail.com>


# 262763 04-Mar-2014 glebius

- Remove rt_metrics_lite and simply put its members into rtentry.
- Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This
removes another cache trashing ++ from packet forwarding path.
- Create zini/fini methods for the rtentry UMA zone. Via initialize
mutex and counter in them.
- Fix reporting of rmx_pksent to routing socket.
- Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode.

The change is mostly targeted for stable/10 merge. For head,
rt_pksent is expected to just disappear.

Discussed with: melifaro
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 262758 04-Mar-2014 gnn

Revert previous commit (262727) and bounce patch back to the
submitter.

Pointed out by: jhb


# 262727 04-Mar-2014 gnn

Naming consistency fix. The routing code defines
RADIX_NODE_HEAD_LOCK as grabbing the write lock,
but RADIX_NODE_HEAD_LOCK_ASSERT as checking the read lock.

Submitted by: Vijay Singh <vijju.singh at gmail.com>
MFC after: 1 month


# 261601 07-Feb-2014 glebius

o Revamp API between flowtable and netinet, netinet6.
- ip_output() and ip_output6() simply call flowtable_lookup(),
passing mbuf and address family. That's the only code under
#ifdef FLOWTABLE in the protocols code now.
o Revamp statistics gathering and export.
- Remove hand made pcpu stats, and utilize counter(9).
- Snapshot of statistics is available via 'netstat -rs'.
- All sysctls are moved into net.flowtable namespace, since
spreading them over net.inet isn't correct.
o Properly separate at compile time INET and INET6 parts.
o General cleanup.
- Remove chain of multiple flowtables. We simply have one for
IPv4 and one for IPv6.
- Flowtables are allocated in flowtable.c, symbols are static.
- With proper argument to SYSINIT() we no longer need flowtable_ready.
- Hash salt doesn't need to be per-VNET.
- Removed rudimentary debugging, which use quite useless in dtrace era.

The runtime behavior of flowtable shouldn't be changed by this commit.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 260508 10-Jan-2014 melifaro

Simplify inet alias handling code: if we're adding/removing alias which
has the same prefix as some other alias on the same interface, use
newly-added rt_addrmsg() instead of hand-rolled in_addralias_rtmsg().

This eliminates the following rtsock messages:

Pinned RTM_ADD for prefix (for alias addition).
Pinned RTM_DELETE for prefix (for alias withdrawal).

Example (got 10.0.0.1/24 on vlan4, playing with 10.0.0.2/24):

before commit, addition:

got message of size 116 on Fri Jan 10 14:13:15 2014
RTM_NEWADDR: address being added to iface: len 116, metric 0, flags:
sockaddrs: <NETMASK,IFP,IFA,BRD>
255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255

got message of size 192 on Fri Jan 10 14:13:15 2014
RTM_ADD: Add Route: len 192, pid: 0, seq 0, errno 0, flags:<UP,PINNED>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK>
10.0.0.0 10.0.0.2 (255) ffff ffff ff

after commit, addition:

got message of size 116 on Fri Jan 10 13:56:26 2014
RTM_NEWADDR: address being added to iface: len 116, metric 0, flags:
sockaddrs: <NETMASK,IFP,IFA,BRD>
255.255.255.0 vlan4:8.0.27.c5.29.d4 14.0.0.2 14.0.0.255

before commit, wihdrawal:

got message of size 192 on Fri Jan 10 13:58:59 2014
RTM_DELETE: Delete Route: len 192, pid: 0, seq 0, errno 0, flags:<UP,PINNED>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK>
10.0.0.0 10.0.0.2 (255) ffff ffff ff

got message of size 116 on Fri Jan 10 13:58:59 2014
RTM_DELADDR: address being removed from iface: len 116, metric 0, flags:
sockaddrs: <NETMASK,IFP,IFA,BRD>
255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255

adter commit, withdrawal:

got message of size 116 on Fri Jan 10 14:14:11 2014
RTM_DELADDR: address being removed from iface: len 116, metric 0, flags:
sockaddrs: <NETMASK,IFP,IFA,BRD>
255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255

Sending both RTM_ADD/RTM_DELETE messages to rtsock is completely wrong
(and requires some hacks to keep prefix in route table on RTM_DELETE).

I've tested this change with quagga (no change) and bird (*).

bird alias handling is already broken in *BSD sysdep code, so nothing
changes here, too.

I'm going to MFC this change if there will be no complains about behavior
change.

While here, fix some style(9) bugs introduced by r260488
(pointed by glebius and bde).

Sponsored by: Yandex LLC
MFC after: 4 weeks


# 260488 09-Jan-2014 melifaro

Split rt_newaddrmsg_fib() into two different functions.
Adding/deleting interface addresses involves access to 3 different subsystems,
int different parts of code. Each call can fail, so reporting successful
operation by rtsock in the middle of the process error-prone.

Further split routing notification API and actual rtsock calls via creating
public-available rt_addrmsg() / rt_routemsg() functions with "private"
rtsock_* backend.

MFC after: 2 weeks


# 260460 08-Jan-2014 melifaro

Constanly use RT_ALL_FIBS everywhere instead of -1.

MFC after: 2 weeks


# 260379 06-Jan-2014 melifaro

Partially fix IPv4 interface routes deletion in RADIX_MPATH.

Noticed by: Nikolay Denev <ndenev at gmail.com>
MFC after: 1 month


# 260295 04-Jan-2014 melifaro

Change semantics for rnh_lookup() function: now
it performs exact match search, regardless of netmask existance.
This simplifies most of rnh_lookup() consumers.

Fix panic triggered by deleting non-existent host route.

PR: kern/185092
Submitted by: Nikolay Denev <ndenev at gmail.com>
MFC after: 1 month


# 258591 25-Nov-2013 rodrigc

In vnet_route_uninit(), free some memory that is allocated in vnet_route_init().

To reproduce the problem:
(1) Take a GENERIC kernel config, and add options for: VIMAGE, WITNESS,
INVARIANTS.
(2) Run this command in a loop:
jail -l -u root -c path=/ name=foo persist vnet && jexec foo ifconfig lo0 127.0.0.1/8 && jail -r foo

see: http://lists.freebsd.org/pipermail/freebsd-current/2010-November/021280.html
http://lists.freebsd.org/pipermail/freebsd-current/2010-November/021291.html

This doesn't eliminate all the "Freed UMA keg was not empty" warning messages
on the console, but it helps.


# 257176 26-Oct-2013 glebius

The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 256624 16-Oct-2013 melifaro

Fix long-standing issue with incorrect radix mask calculation.

Usual symptoms are messages like
rn_delete: inconsistent annotation
rn_addmask: mask impossibly already in tree
or inability to flush/delete particular prefix in ipfw table.

Changes:
* Assume 32 bytes as maximum radix key length
* Remove rn_init()
* Statically allocate rn_ones/rn_zeroes
* Make separate mask tree for each "normal" tree instead of system global one
* Remove "optimization" on masks reusage and key zeroying
* Change rn_addmask() arguments to accept tree pointer (no users in base)

PR: kern/182851, kern/169206, kern/135476, kern/134531
Found by: Slawa Olhovchenkov <slw@zxy.spb.ru>
MFC after: 2 weeks
Reviewed by: glebius
Sponsored by: Yandex LLC


# 250764 18-May-2013 melifaro

Fix rte leak introduced in r248070.

MFC after: 2 weeks


# 250700 16-May-2013 julian

Finally change the mbuf to have its own fib field instead of stealing
4 flag bits. This was supposed to happen in 8.0, and again in 2012..

MFC after: never


# 248070 08-Mar-2013 melifaro

Fix long-standing issue with interface routes being unprotected:
Use RTM_PINNED flag to mark route as immutable.
Forbid deleting immutable routes without special rtrequest1_fib() flag.
Adding interface address with prefix already in route table is handled
by atomically deleting old prefix and adding interface one.

Discussed with: andre, eri
MFC after: 3 weeks


# 247842 05-Mar-2013 melifaro

Write lock is not required for find&compare operation.

MFC after: 2 weeks


# 233113 18-Mar-2012 bz

Hide kernel option ROUTETABLES evaluations in the implementation
rather than the header file. With this also move RT_MAXFIBS and
RT_NUMFIBS into the implemantion to avoid further usage in other
code. rt_numfibs is all that should be needed.

This allows users to change the number of FIBs from 1..RT_MAXFIBS(16)
dynamically using the tunable without the need to change the kernel
config for the maximum anymore. This means that thet multi-FIB
feature is now fully available with GENERIC kernels.
The kernel option ROUTETABLES can still be used to set the default
numbers of FIBs in absence of the tunable.

Ok.ed by: julian, hrs, melifaro
MFC after: 2 weeks


# 231852 17-Feb-2012 bz

Merge multi-FIB IPv6 support from projects/multi-fibv6/head/:

Extend the so far IPv4-only support for multiple routing tables (FIBs)
introduced in r178888 to IPv6 providing feature parity.

This includes an extended rtalloc(9) KPI for IPv6, the necessary
adjustments to the network stack, and user land support as in netstat.

Sponsored by: Cisco Systems, Inc.
Reviewed by: melifaro (basically)
MFC after: 10 days


# 230510 24-Jan-2012 bz

Replace random ARIN direct assignment legacy IPs with proper RFC 5735
TEST-NET1 block for use in documentation and example code addresses.

MFC after: 3 days


# 228532 15-Dec-2011 glebius

Simplify rtrequest(RTM_ADD): ifa can't be NULL after rt_getifa_fib().


# 226710 24-Oct-2011 qingli

The host-id/interface-id can have a specific value and is properly
masked out when adding a prefix route through the "route" command.
However, when deleting the route, simply changing the command keyword
from "add" to "delete" does not work. The failoure is observed in
both IPv4 and IPv6 route insertion. The patch makes the route command
behavior consistent between the "add" and the "delete" operation.

MFC after: 1 week


# 225837 28-Sep-2011 bz

Pass the fibnum where we need filtering of the message on the
rtsock allowing routing daemons to filter routing updates on an
rtsock per FIB.

Adjust raw_input() and split it into wrapper and a new function
taking an optional callback argument even though we only have one
consumer [1] to keep the hackish flags local to rtsock.c.

PR: kern/134931
Submitted by: multiple (see PR)
Suggested by: rwatson [1]
Reviewed by: rwatson
MFC after: 3 days


# 225617 16-Sep-2011 kmacy

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


# 224703 08-Aug-2011 kevlo

In rtinit1(), before rtrequest1_fib() is called, info.rti_flags is
initialized by flags (function argument) or-ed with ifa->ifa_flags.
If both NIC has a loopback route to itself, so IFA_RTSELF is set on ifa(s).
As IFA_RTSELF is defined by RTF_HOST, rtrequest1_fib() is called with
RTF_HOST flag even if netmask is not NULL. Consequently, netmask is set
to zero in rtrequest1_fib(), and request to add network route is changed
under hands to request to add host route.

Tested by: Andrew Boyer <aboyer at averesystems.com>
Submitted by: Svatopluk Kraus <onwahe at gmail dot com>
Approved by: re (hrs)


# 223359 21-Jun-2011 bz

Garbage collect never used global, sysctl, externs.

MFC after: 1 week


# 223334 20-Jun-2011 bz

Leave an extra comment about flowtable and IPv6 support rectifying a
previous comment.

MFC after: 1 week


# 219786 19-Mar-2011 dchagin

ouch, newrt is used on the return path, my fault.
Partialy revert the previous change.

MFC after: 1 Week.


# 219783 19-Mar-2011 dchagin

A bit rearranged rtalloc1_fib() code.
Initialize a variable when it is really needed.
To avoid code duplication move the miss label to line up and jump on it.

MFC after: 1 Week


# 219776 19-Mar-2011 dchagin

Remove a now unused variable.

MFC after: 1 Week


# 218909 21-Feb-2011 brucec

Fix typos - remove duplicate "the".

PR: bin/154928
Submitted by: Eitan Adler <lists at eitanadler.com>
MFC after: 3 days


# 217322 12-Jan-2011 mdf

sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly.

Commit the net* piece.


# 215701 22-Nov-2010 dim

After some off-list discussion, revert a number of changes to the
DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various
people working on the affected files. A better long-term solution is
still being considered. This reversal may give some modules empty
set_pcpu or set_vnet sections, but these are harmless.

Changes reverted:

------------------------------------------------------------------------
r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines

Instead of unconditionally emitting .globl's for the __start_set_xxx and
__stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu
sections are actually defined.

------------------------------------------------------------------------
r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.

------------------------------------------------------------------------
r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines

Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.


# 215317 14-Nov-2010 dim

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.


# 208553 25-May-2010 qingli

This patch fixes the problem where proxy ARP entries cannot be added
over the if_ng interface.

MFC after: 3 days


# 207369 29-Apr-2010 bz

MFP4: @176978-176982, 176984, 176990-176994, 177441

"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by: jhb
Discussed with: rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
MFC after: 6 days


# 204902 08-Mar-2010 qingli

One of the advantages of enabling ECMP (a.k.a RADIX_MPATH) is to
allow for connection load balancing across interfaces. Currently
the address alias handling method is colliding with the ECMP code.
For example, when two interfaces are configured on the same prefix,
only one prefix route is installed. So connection load balancing
among the available interfaces is not possible.

The other advantage of ECMP is for failover. The issue with the
current code, is that the interface link-state is not reflected
in the route entry. For example, if there are two interfaces on
the same prefix, the cable on one interface is unplugged, new and
existing connections should switch over to the other interface.
This is not done today and packets go into a black hole.

Also, there is a small bug in the kernel where deleting ECMP routes
in the userland will always return an error even though the command
is successfully executed.

MFC after: 5 days


# 201282 30-Dec-2009 qingli

The proxy arp entries could not be added into the system over the
IFF_POINTOPOINT link types. The reason was due to the routing
entry returned from the kernel covering the remote end is of an
interface type that does not support ARP. This patch fixes this
problem by providing a hint to the kernel routing code, which
indicates the prefix route instead of the PPP host route should
be returned to the caller. Since a host route to the local end
point is also added into the routing table, and there could be
multiple such instantiations due to multiple PPP links can be
created with the same local end IP address, this patch also fixes
the loopback route installation failure problem observed prior to
this patch. The reference count of loopback route to local end would
be either incremented or decremented. The first instantiation would
create the entry and the last removal would delete the route entry.

MFC after: 5 days


# 200537 14-Dec-2009 luigi

Move the scan for max_keylen into route.c::route_init(),
and make max_keylen an argument for rn_init().
This removes an unnecessary dependency on domain.h from radix.c

MFC after: 7 days


# 199365 17-Nov-2009 tuexen

Fix a LOR showing up with sctp_bsd_addr(): Do not hold a rt lock
when calling rt_newaddrmsg().

Reviewed by: qingli
Approved by: rrs (mentor)
MFC after: 1 month


# 197727 03-Oct-2009 bz

Put #ifdef INET around parts of the FLOWTABLE code, to unbreak
nooptions INET kernel builds.

MFC after: 3 days
X-MFC: with r197687


# 197687 01-Oct-2009 qingli

The flow-table associates TCP/UDP flows and IP destinations with
specific routes. When the routing table changes, for example,
when a new route with a more specific prefix is inserted into the
routing table, the flow-table is not updated to reflect that change.
As such existing connections cannot take advantage of the new path.
In some cases the path is broken. This patch will update the affected
flow-table entries when a more specific route is added. The route
entry is properly marked when a route is deleted from the table.
In this case, when the flow-table performs a search, the stale
entry is updated automatically. Therefore this patch is not
necessary for route deletion.

Submitted by: simon, phk
Reviewed by: bz, kmacy
MFC after: 3 days


# 196019 01-Aug-2009 rwatson

Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by: bz
Approved by: re (vimage blanket)


# 195837 23-Jul-2009 rwatson

Introduce and use a sysinit-based initialization scheme for virtual
network stacks, VNET_SYSINIT:

- Add VNET_SYSINIT and VNET_SYSUNINIT macros to declare events that will
occur each time a network stack is instantiated and destroyed. In the
!VIMAGE case, these are simply mapped into regular SYSINIT/SYSUNINIT.
For the VIMAGE case, we instead use SYSINIT's to track their order and
properties on registration, using them for each vnet when created/
destroyed, or immediately on module load for already-started vnets.
- Remove vnet_modinfo mechanism that existed to serve this purpose
previously, as well as its dependency scheme: we now just use the
SYSINIT ordering scheme.
- Implement VNET_DOMAIN_SET() to allow protocol domains to declare that
they want init functions to be called for each virtual network stack
rather than just once at boot, compiling down to DOMAIN_SET() in the
non-VIMAGE case.
- Walk all virtualized kernel subsystems and make use of these instead
of modinfo or DOMAIN_SET() for init/uninit events. In some cases,
convert modular components from using modevent to using sysinit (where
appropriate). In some cases, do minor rejuggling of SYSINIT ordering
to make room for or better manage events.

Portions submitted by: jhb (VNET_SYSINIT), bz (cleanup)
Discussed with: jhb, bz, julian, zec
Reviewed by: bz
Approved by: re (VIMAGE blanket)


# 195727 16-Jul-2009 rwatson

Remove unused VNET_SET() and related macros; only VNET_GET() is
ever actually used. Rename VNET_GET() to VNET() to shorten
variable references.

Discussed with: bz, julian
Reviewed by: bz
Approved by: re (kensmith, kib)


# 195699 14-Jul-2009 rwatson

Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)


# 195624 11-Jul-2009 kmacy

Re-factoring for adding weighted routes introduced a
fairly irritating bug where the system will panic
when RADIX_MPATH is enabled. This change fixes this.

Approved by: re@


# 194760 23-Jun-2009 rwatson

Modify most routines returning 'struct ifaddr *' to return references
rather than pointers, requiring callers to properly dispose of those
references. The following routines now return references:

ifaddr_byindex
ifa_ifwithaddr
ifa_ifwithbroadaddr
ifa_ifwithdstaddr
ifa_ifwithnet
ifaof_ifpforaddr
ifa_ifwithroute
ifa_ifwithroute_fib
rt_getifa
rt_getifa_fib
IFP_TO_IA
ip_rtaddr
in6_ifawithifp
in6ifa_ifpforlinklocal
in6ifa_ifpwithaddr
in6_ifadd
carp_iamatch6
ip6_getdstifaddr

Remove unused macro which didn't have required referencing:

IFP_TO_IA6

This closes many small races in which changes to interface
or address lists while an ifaddr was in use could lead to use of freed
memory (etc). In a few cases, add missing if_addr_list locking
required to safely acquire references.

Because of a lack of deep copying support, we accept a race in which
an in6_ifaddr pointed to by mbuf tags and extracted with
ip6_getdstifaddr() doesn't hold a reference while in transmit. Once
we have mbuf tag deep copy support, this can be fixed.

Reviewed by: bz
Obtained from: Apple, Inc. (portions)
MFC after: 6 weeks (portions)


# 194640 22-Jun-2009 bz

Move virtualization of routing related variables into their own
Vimage module, which had been there already but now is stateful.

All variables are now file local; so this further limits the global
spreading of routing related things throughout the kernel.

Add a missing function local variable in case of MPATHing.

Reviewed by: zec


# 194629 22-Jun-2009 bz

Collect all VIMAGE_GLOBALS variables in one place.

No longer export rt_tables as all lookups go through
rt_tables_get_rnh().

We cannot make rt_tables (and rtstat, rttrash[1]) static as
netstat -r (-rs[1]) would stop working on a stripped
VIMAGE_GLOBALS kernel.

Reviewed by: zec
Presumably broken by: phk 13.5y ago in r12820 [1]


# 194622 22-Jun-2009 rwatson

Add a new function, ifa_ifwithaddr_check(), which rather than returning
a pointer to an ifaddr matching the passed socket address, returns a
boolean indicating whether one was present. In the (near) future,
ifa_ifwithaddr() will return a referenced ifaddr rather than a raw
ifaddr pointer, and the new wrapper will allow callers that care only
about the boolean condition to avoid having to free that reference.

MFC after: 3 weeks


# 194602 21-Jun-2009 rwatson

Clean up common ifaddr management:

- Unify reference count and lock initialization in a single function,
ifa_init().
- Move tear-down from a macro (IFAFREE) to a function ifa_free().
- Move reference count bump from a macro (IFAREF) to a function ifa_ref().
- Instead of using a u_int protected by a mutex to refcount(9) for
reference count management.

The ifa_mtx is now used for exactly one ioctl, and possibly should be
removed.

MFC after: 3 weeks


# 193731 08-Jun-2009 zec

Introduce an infrastructure for dismantling vnet instances.

Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state. The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.

While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers. Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.

Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels. Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.

Bump __FreeBSD_version to 800097.
Reviewed by: bz, julian
Approved by: rwatson, kib (re), julian (mentor)


# 193232 01-Jun-2009 bz

Convert the two dimensional array to be malloced and introduce
an accessor function to get the correct rnh pointer back.

Update netstat to get the correct pointer using kvm_read()
as well.

This not only fixes the ABI problem depending on the kernel
option but also permits the tunable to overwrite the kernel
option at boot time up to MAXFIBS, enlarging the number of
FIBs without having to recompile. So people could just use
GENERIC now.

Reviewed by: julian, rwatson, zec
X-MFC: not possible


# 191734 02-May-2009 zec

Unbreak options VIMAGE + nooptions INVARIANTS kernel builds.

Submitted by: julian
Approved by: julian (mentor)


# 191548 26-Apr-2009 zec

In preparation for turning on options VIMAGE in next commits,
rearrange / replace / adjust several INIT_VNET_* initializer
macros, all of which currently resolve to whitespace.

Reviewed by: bz (an older version of the patch)
Approved by: julian (mentor)


# 191080 14-Apr-2009 kmacy

Extend route command:
- add show as alias for get
- add weights to allow mpath to do more than equal cost
- add sticky / nostick to disable / re-enable per-connection load balancing

This adds a field to rt_metrics_lite so network bits of world will need to be re-built.

Reviewed by: jeli & qingli


# 190909 11-Apr-2009 zec

Introduce vnet module registration / initialization framework with
dependency tracking and ordering enforcement.

With this change, per-vnet initialization functions introduced with
r190787 are no longer directly called from traditional initialization
functions (which cc in most cases inlined to pre-r190787 code), but are
instead registered via the vnet framework first, and are invoked only
after all prerequisite modules have been initialized. In the long run,
this framework should allow us to both initialize and dismantle
multiple vnet instances in a correct order.

The problem this change aims to solve is how to replay the
initialization sequence of various network stack components, which
have been traditionally triggered via different mechanisms (SYSINIT,
protosw). Note that this initialization sequence was and still can be
subtly different depending on whether certain pieces of code have been
statically compiled into the kernel, loaded as modules by boot
loader, or kldloaded at run time.

The approach is simple - we record the initialization sequence
established by the traditional mechanisms whenever vnet_mod_register()
is called for a particular vnet module. The vnet_mod_register_multi()
variant allows a single initializer function to be registered multiple
times but with different arguments - currently this is only used in
kern/uipc_domain.c by net_add_domain() with different struct domain *
as arguments, which allows for protosw-registered initialization
routines to be invoked in a correct order by the new vnet
initialization framework.

For the purpose of identifying vnet modules, each vnet module has to
have a unique ID, which is statically assigned in sys/vimage.h.
Dynamic assignment of vnet module IDs is not supported yet.

A vnet module may specify a single prerequisite module at registration
time by filling in the vmi_dependson field of its vnet_modinfo struct
with the ID of the module it depends on. Unless specified otherwise,
all vnet modules depend on VNET_MOD_NET (container for ifnet list head,
rt_tables etc.), which thus has to and will always be initialized
first. The framework will panic if it detects any unresolved
dependencies before completing system initialization. Detection of
unresolved dependencies for vnet modules registered after boot
(kldloaded modules) is not provided.

Note that the fact that each module can specify only a single
prerequisite may become problematic in the long run. In particular,
INET6 depends on INET being already instantiated, due to TCP / UDP
structures residing in INET container. IPSEC also depends on INET,
which will in turn additionally complicate making INET6-only kernel
configs a reality.

The entire registration framework can be compiled out by turning on the
VIMAGE_GLOBALS kernel config option.

Reviewed by: bz
Approved by: julian (mentor)


# 190787 06-Apr-2009 zec

First pass at separating per-vnet initializer functions
from existing functions for initializing global state.

At this stage, the new per-vnet initializer functions are
directly called from the existing global initialization code,
which should in most cases result in compiler inlining those
new functions, hence yielding a near-zero functional change.

Modify the existing initializer functions which are invoked via
protosw, like ip_init() et. al., to allow them to be invoked
multiple times, i.e. per each vnet. Global state, if any,
is initialized only if such functions are called within the
context of vnet0, which will be determined via the
IS_DEFAULT_VNET(curvnet) check (currently always true).

While here, V_irtualize a few remaining global UMA zones
used by net/netinet/netipsec networking code. While it is
not yet clear to me or anybody else whether this is the right
thing to do, at this stage this makes the code more readable,
and makes it easier to track uncollected UMA-zone-backed
objects on vnet removal. In the long run, it's quite possible
that some form of shared use of UMA zone pools among multiple
vnets should be considered.

Bump __FreeBSD_version due to changes in layout of structs
vnet_ipfw, vnet_inet and vnet_net.

Approved by: julian (mentor)


# 186705 02-Jan-2009 qingli

The log message should terminate with a newline instead
of a tab character.


# 186167 16-Dec-2008 kmacy

style and spelling fix


# 186119 15-Dec-2008 qingli

This main goals of this project are:
1. separating L2 tables (ARP, NDP) from the L3 routing tables
2. removing as much locking dependencies among these layers as
possible to allow for some parallelism in the search operations
3. simplify the logic in the routing code,

The most notable end result is the obsolescent of the route
cloning (RTF_CLONING) concept, which translated into code reduction
in both IPv4 ARP and IPv6 NDP related modules, and size reduction in
struct rtentry{}. The change in design obsoletes the semantics of
RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland
applications such as "arp" and "ndp" have been modified to reflect
those changes. The output from "netstat -r" shows only the routing
entries.

Quite a few developers have contributed to this project in the
past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and
Andre Oppermann. And most recently:

- Kip Macy revised the locking code completely, thus completing
the last piece of the puzzle, Kip has also been conducting
active functional testing
- Sam Leffler has helped me improving/refactoring the code, and
provided valuable reviews
- Julian Elischer setup the perforce tree for me and has helped
me maintaining that branch before the svn conversion


# 185849 10-Dec-2008 kmacy

fix a reported panic when adding a route and one hit here when deleting a route

- pass RTF_RNH_LOCKED to rtalloc1_fib in 2 cases where the lock is held
- make sure the rnh lock is held across rt_setgate and rt_getifa_fib


# 185807 09-Dec-2008 bz

Fix a bug introduced in r185747: rather than dereferencing an uninitialized
*rt to something undefined, use the fibnum that came in as function argument.

Found with: Coverity Prevent(tm)
CID: 4168


# 185774 08-Dec-2008 kmacy

- avoid recursively locking the radix node head lock
- assert that it is held if RTF_RNH_LOCKED is not passed


# 185747 07-Dec-2008 kmacy

- convert radix node head lock from mutex to rwlock
- make radix node head lock not recursive
- fix LOR in rtexpunge
- fix LOR in rtredirect

Reviewed by: sam


# 185571 02-Dec-2008 bz

Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation


# 185348 26-Nov-2008 zec

Merge more of currently non-functional (i.e. resolving to
whitespace) macros from p4/vimage branch.

Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.

De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.

Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 185088 19-Nov-2008 zec

Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions. As a rule,
initialization at instatiation for such variables should never be
introduced again from now on. Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact. In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 183550 02-Oct-2008 zec

Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 183200 20-Sep-2008 zec

Move #defines for MRT-related constants from net/route.c to
net/route.h, because the vnet code will need those constants as
well.

Reviewed by: bz
Approved by: julian (mentor)
MFC after: never


# 183034 15-Sep-2008 julian

Hey, committed the same typo twice! must be a record


# 183032 15-Sep-2008 julian

rewrite rt_check. Ztake into account that whiel teh rtentry is unlocked,
someone else might change it, so after we re-acquire the lock on it,
we need to check it is still valid. People have been panicing in this
function due to soem edge cases which I have hopefully removed.

Reviewed by: keramida @
Obtained from: 1 week


# 183017 14-Sep-2008 julian

come on Julian, make up if you're committing one change or the other.
fix braino


# 183013 14-Sep-2008 julian

Revert a part of the MRT commit that proved un-needed.
rt_check() in its original form proved to be sufficient and
rt_check_fib() can go away (as can its evil twin in_rt_check()).

I believe this does NOT address the crashes people have been seeing
in rt_check.

MFC after: 1 week


# 182615 01-Sep-2008 brooks

Wrap a line that became too long with the addition of V_.

(This file contains many more unwrapped or badly wrapped lines.)


# 181803 17-Aug-2008 bz

Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch


# 180840 26-Jul-2008 julian

Add the ability to add new addresses for interfacesto just one FIB
(Other more specific related options will follow)
This allows one to set multiple p2p links to the same place
and select which to use by having each in different FIBS.


# 178898 10-May-2008 julian

move a #define from a place it shouldn't have been to a place it should
have been. Basically my testign didn't ocver one case that this broke.
thanks tinderbox!


# 178897 10-May-2008 julian

undef MAXFIBS before redefining it


# 178888 09-May-2008 julian

Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.

Constraints:
------------

I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.

One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".

One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.

This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.

To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.

The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.

The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.

In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.

One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).

You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.

This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.

Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.

Packets fall into one of a number of classes.

1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..

setfib -3 ping target.example.com # will use fib 3 for ping.

It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.

2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)

3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).

4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.

5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.

6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.

Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)

In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.

In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.

Early testing experience:
-------------------------

Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.

For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.

Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.

ipfw has grown 2 new keywords:

setfib N ip from anay to any
count ip from any to any fib N

In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.

SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.

Where to next:
--------------------

After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.

Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.

My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.

When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.

Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.

This work was sponsored by Ironport Systems/Cisco

Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco


# 178176 13-Apr-2008 bz

Fix the build in case RADIX_MPATH is not defined.


# 178167 13-Apr-2008 qingli

This patch provides the back end support for equal-cost multi-path
(ECMP) for both IPv4 and IPv6. Previously, multipath route insertion
is disallowed. For example,

route add -net 192.103.54.0/24 10.9.44.1
route add -net 192.103.54.0/24 10.9.44.2

The second route insertion will trigger an error message of
"add net 192.103.54.0/24: gateway 10.2.5.2: route already in table"

Multiple default routes can also be inserted. Here is the netstat
output:

default 10.2.5.1 UGS 0 3074 bge0 =>
default 10.2.5.2 UGS 0 0 bge0

When multipath routes exist, the "route delete" command requires
a specific gateway to be specified or else an error message would
be displayed. For example,

route delete default

would fail and trigger the following error message:

"route: writing to routing socket: No such process"
"delete net default: not in table"

On the other hand,

route delete default 10.2.5.2

would be successful: "delete net default: gateway 10.2.5.2"

One does not have to specify a gateway if there is only a single
route for a particular destination.

I need to perform more testings on address aliases and multiple
interfaces that have the same IP prefixes. This patch as it
stands today is not yet ready for prime time. Therefore, the ECMP
code fragments are fully guarded by the RADIX_MPATH macro.
Include the "options RADIX_MPATH" in the kernel configuration
to enable this feature.

Reviewed by: robert, sam, gnn, julian, kmacy


# 176244 13-Feb-2008 jhb

Use RTFREE_LOCKED() instead of rtfree() when releasing a reference on the
'rt' route in rtredirect() as 'rt' is always locked.

MFC after: 1 week
PR: kern/117913
Submitted by: Stefan Lambrev stefan.lambrev of moneybookers.com


# 174934 27-Dec-2007 mux

Add a workaround for a deadlock between the rt_setgate() and rt_check()
functions. It is easily triggered by running routed, and, I expect, by
running any other daemon that uses routing sockets.

Reviewed by: net@
MFC after: 1 week


# 174703 17-Dec-2007 kmacy

widen the routing event interface (arp update, redirect, and eventually pmtu change)
into separate functions

revert previous commit's changes to arpresolve and add a new interface
arpresolve2 which does arp resolution without an mbuf


# 174559 12-Dec-2007 kmacy

add interface for allowing consumers to register for ARP updates,
redirects, and path MTU changes

Reviewed by: silby


# 174374 06-Dec-2007 julian

No need to assert that a == b when we just set a = b.


# 172885 22-Oct-2007 jhb

Close a race when trying to lookup a gateway route in rt_check().
Specifically, if two threads were doing concurrent lookups and the existing
gateway was marked down, the the first thread would drop a reference on the
gateway route and then unlock the "root" route while it tried to allocate
a new route. The second thread could then also drop a reference on the
same gateway route resulting in a reference underflow. Fix this by
clearing the gateway route pointer after dropping the reference count but
before dropping the lock. Secondly, in this same case, the second thread
would overwrite the gateway route pointer w/o free'ing a reference to the
route installed by the first thread. In practice this would probably just
fix a lost reference that would result in a route never being freed.

This fixes panics observed in rt_check() and rtexpunge().

MFC after: 1 week
PR: kern/112490
Insight from: mehuljv at yahoo.com
Reviewed by: ru (found the "not-setting it to NULL" part)
Tested by: several


# 170557 11-Jun-2007 phk

Add missing \n to printf


# 169872 22-May-2007 glebius

Some minor cleanups:
- In rt_check() remove the senderr() macro and the "bad" label. They
used to simplify code, but now aren't.
- Remove extra RT_LOCK_ASSERT() in rt_setgate(). The RT_REMREF macro
does this.
- In rtfree() convert panics to KASSERTs.
- Strict the routing API: rtfree() should be called only in a case
when we are completely sure we've got the last reference on the
rtentry. In all other cases RTFREE_LOCKED() macro should be used.
If the reference isn't the last one spit out a warning printf.
Correct the only(?) case for this in rt_check().
- Fix typos in comments.


# 164549 23-Nov-2006 bde

Initialize a local variable in 2 places just before it is used, not always
at the start of rtalloc1(). This backs out part of revs 1.83 and 1.85.

Profiling on an i386 showed that that for sending tiny packets using
bge, -current takes 7 bzero()s where RELENG_4 takes only 1, and that
bzero()ing is now the dominant overhead (10-12%, up from 1%, but
profiling overestimated this a bit). This commit backs out 2 of the
6 extra bzero()s (1 in each of 2 calls per packet to rtalloc1()). They
were the largest ones by byte count (48 bytes each) but perhaps not
by time (small misaligned ones might take longer).


# 159305 05-Jun-2006 qingli

Assuming the interface has an address of x.x.x.195, a mask of
255.255.255.0, and a default route with gateway x.x.x.1. Now if
the address mask is changed to something more specific, e.g.,
255.255.255.128, then after the mask change the default gateway
is no longer reachable.

Since the default route is still present in the routing table,
when the output code tries to resolve the address of the default
gateway in function rt_check(), again, the default route will be
returned by rtalloc1(). Because the lock is currently held on the
rtentry structure, one more attempt to hold the lock will trigger
a crash due to "lock recursed on non-recursive mutex ..."

This is a general problem. The fix checks for the above condition
so that an existing route entry is not mistaken for a new cloned
route. Approriately, an ENETUNREACH error is returned back to the
caller

Approved by: andre


# 158661 16-May-2006 qingli

The current routing code allows insertion of indirect routes that have
gateways which are unreachable except through the default router. For
example, assuming there is a default route configured, and inserting
a route

"route add 64.102.54.0/24 60.80.1.1"

is currently allowed even when 60.80.1.1 is only reachable through
the default route. However, an error is thrown when this route is
utilized, say,

"ping 64.102.54.1" will return an error

This type of route insertion should be disallowed becasue:

1) Let's say that somehow our code allowed this packet to flow to
the default router, and the default router knows the next hop is
60.80.1.1, then the question is why bother inserting this route in
the 1st place, just simply use the default route.

2) Since we're not talking about source routing here, the default
router could very well choose a different path than using 60.80.1.1
for the next hop, again it defeats the purpose of adding this route.

Reviewed by: ru, gnn, bz
Approved by: andre


# 158294 04-May-2006 bz

In rtrequest and rtinit check for sa_len != 0 for the given
destination. These checks are needed so we do not install
a route looking like this:
(0) 192.0.2.200 UH tun0 =>

When removing this route the kernel will start to walk
the address space which looks like a hang on 64bit platforms
because it'll take ages while on 32bit you should see a panic
when kernel debugging options are turned on.

The problem is in rtrequest1:
if (netmask) {
rt_maskedcopy(dst, ndst, netmask);
} else
bcopy(dst, ndst, dst->sa_len);

In both cases the len might be 0 if the application forgot to
set it. If so ndst will be all-zero leading to above
mentioned strange routes.

This is an application error but we must not fail/hang/panic
because of this.

Looks ok: gnn
No objections: net@ (silence)
MFC after: 8 weeks


# 152315 11-Nov-2005 ru

- Store pointer to the link-level address right in "struct ifnet"
rather than in ifindex_table[]; all (except one) accesses are
through ifp anyway. IF_LLADDR() works faster, and all (except
one) ifaddr_byindex() users were converted to use ifp->if_addr.

- Stop storing a (pointer to) Ethernet address in "struct arpcom",
and drop the IFP2ENADDR() macro; all users have been converted
to use IF_LLADDR() instead.


# 150414 21-Sep-2005 glebius

Several fixes to rt_setgate(), that fix problems with route changing:

- Rearrange code so that in a case of failure the affected
route is not changed. Otherwise, a bogus rtentry will be
left and later rt_check() can recurse on its lock. [1]
- Remove comment about protocol cloning.
- Fix two places where rtentry mutex was recursed on, because
accessed via two different pointers, that were actually pointing
to the same rtentry in some cases. [1]
- Return EADDRINUSE instead of bogus EDQUOT, in case when gateway
uses the same route. [2]

Reported & tested by: ps, Andrej Zverev <az inec.ru> [1]
PR: kern/64090 [2]


# 150351 19-Sep-2005 andre

Use monotonic 'time_uptime' instead of 'time_second' as timebase
for rt->rt_rmx.rmx_expire.


# 148954 11-Aug-2005 glebius

o Make rt_check() function more strict:
- rt0 passed to rt_check() must not be NULL, assert this.
- rt returned by rt_check() must be valid locked rtentry,
if no error occured.
o Modify callers, so that they never pass NULL rt0
to rt_check().

Reviewed by: sam, ume (nd6.c)


# 148883 09-Aug-2005 glebius

In preparation for fixing races in ARP (and probably in other
L2/L3 mappings) make rt_check() return a locked rtentry.


# 147650 28-Jun-2005 qingli

Require gateways for routes to be of the same address family as the
route itself.

It fixes a bug where an IPv4 route for example has an IPv6 gateway
specified:

route add 10.1.1.1 -inet6 fe80::1%fxp0

Destination Gateway Flags Refs Use Netif Expire
10.1.1.1 fe80::1%fxp0 UGHS 0 0 fxp0

The fix rejects these illegal combinations:

route: writing to routing socket: Invalid argument
add host 10.1.1.1: gateway fe80::1%fxp0: Invalid argument

Reviewed by: KAME jinmei@isl.rdc.toshiba.co.jp
Reviewed by: andre (mentor)
Approved by: re
MFC after: 5


# 139823 06-Jan-2005 imp

/* -> /*- for license, minor formatting changes


# 134122 21-Aug-2004 csjp

When a prison is given the ability to create raw sockets (when the
security.jail.allow_raw_sockets sysctl MIB is set to 1) where privileged
access to jails is given out, it is possible for prison root to manipulate
various network parameters which effect the host environment. This commit
plugs a number of security holes associated with the use of raw sockets
and prisons.

This commit makes the following changes:

- Add a comment to rtioctl warning developers that if they add
any ioctl commands, they should use super-user checks where necessary,
as it is possible for PRISON root to make it this far in execution.
- Add super-user checks for the execution of the SIOCGETVIFCNT
and SIOCGETSGCNT IP multicast ioctl commands.
- Add a super-user check to rip_ctloutput(). If the calling cred
is PRISON root, make sure the socket option name is IP_HDRINCL,
otherwise deny the request.

Although this patch corrects a number of security problems associated
with raw sockets and prisons, the warning in jail(8) should still
apply, and by default we should keep the default value of
security.jail.allow_raw_sockets MIB to 0 (or disabled) until
we are certain that we have tracked down all the problems.

Looking forward, we will probably want to eliminate the
references to curthread.

This may be a MFC candidate for RELENG_5.

Reviewed by: rwatson
Approved by: bmilekic (mentor)


# 133513 11-Aug-2004 andre

Convert the routing table to use an UMA zone for rtentries. The zone is
called "rtentry".

This saves a considerable amount of kernel memory. R_Zmalloc previously
used 256 byte blocks (plus kmalloc overhead) whereas UMA only needs 132
bytes.

Idea from: OpenBSD


# 132780 28-Jul-2004 kan

Avoid casts as lvalues.


# 128626 24-Apr-2004 luigi

fix one typo and remove one wrong line


# 128622 24-Apr-2004 luigi

Correct and extend the description of the behaviour of rt_check().


# 128524 21-Apr-2004 luigi

Clearly comment the assumptions that allow us to cast a
'struct radix_node *' to a 'struct rtentry *' in this code,
and introduce a macro, RNTORT(), to do this type conversion.


# 128455 20-Apr-2004 luigi

Fix the initial check for NULL arguments in rtfree (previously
it checked for rt == NULL after dereferencing the pointer).
We never check for those events elsewhere, so probably these checks
might go away here as well.

Slightly simplify (and document) the logic for memory allocation
in rt_setgate().

The rest is mostly style changes -- replace 0 with NULL where appropriate,
remove the macro SA() that was only used once, remove some useless
debugging code in rt_fixchange, explain some odd-looking casts.


# 128399 18-Apr-2004 luigi

replace Bcopy with bcopy as in the rest of the file.


# 128357 17-Apr-2004 luigi

make route_init() static


# 128311 16-Apr-2004 luigi

Consistently use ifaddr_byindex() to access the link-level address
of an interface. No functional change.

On passing, comment a likely bug in net/rtsock.c:sysctl_ifmalist()
which, if confirmed, would deserve to be fixed and MFC'ed


# 128185 13-Apr-2004 luigi

route.h: introduce a macro, SA_SIZE(struct sockaddr *) which returns
the space occupied by a struct sockaddr when passed through a
routing socket.
Use it to replace the macro ROUNDUP(int), that does the same but
is redefined by every file which uses it, courtesy of
the School of Cut'n'Paste Programming(TM).

(partial) userland changes to follow.


# 128167 12-Apr-2004 luigi

in rtinit(), remove one useless variable, and move a few others
within the block where they are used.


# 128019 07-Apr-2004 imp

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson


# 124237 07-Jan-2004 sam

Remove extraneous unlock. This fixes a panic seen when manipulating static
entries in the ARP table.


# 123262 07-Dec-2003 sam

bandaid LOR in rt_setgate; a proper fix requires code refactoring


# 122986 25-Nov-2003 sam

workaround LOR in rt_setgate

Reviewed by: andre
Approved by: re (rwatson)


# 122921 20-Nov-2003 andre

Remove RTF_PRCLONING from routing table and adjust users of it
accordingly. The define is left intact for ABI compatibility
with userland.

This is a pre-step for the introduction of tcp_hostcache. The
network stack remains fully useable with this change.

Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)


# 122334 08-Nov-2003 sam

replace explicit changes to rt_refcnt by RT_ADDREF and RT_REMREF
macros that expand to include assertions when the system is built
with INVARIANTS

Supported by: FreeBSD Foundation


# 121770 30-Oct-2003 sam

Overhaul routing table entry cleanup by introducing a new rtexpunge
routine that takes a locked routing table reference and removes all
references to the entry in the various data structures. This
eliminates instances of recursive locking and also closes races
where the lock on the entry had to be dropped prior to calling
rtrequest(RTM_DELETE). This also cleans up confusion where the
caller held a reference to an entry that might have been reclaimed
(and in some cases used that reference).

Supported by: FreeBSD Foundation


# 121717 29-Oct-2003 sam

avoid recursive lock panic by unlocking before calling rtrequest;
this is consistent with other places but will be replaced
shortly by a "proper fix"

Supported by: FreeBSD Foundation
Pain felt by: Jiri Mikulas


# 121139 16-Oct-2003 sam

Correct handling of cloning loop avoidance: rtalloc1 may return a null
pointer in which case we should not do the unlock.

Supported by: FreeBSD Foundatin


# 120993 11-Oct-2003 sam

fix braino: null the pointer who's memory we just free'd, not some other
pointers that are (potentially) used later


# 120888 07-Oct-2003 sam

insure local variable is initialized prior to use


# 120820 05-Oct-2003 sam

fix typo that caused a panic when processing an ICMP redirect

Sponsored by: FreeBSD Foundation


# 120727 04-Oct-2003 sam

Locking for updates to routing table entries. Each rtentry gets a mutex
that covers updates to the contents. Note this is separate from holding
a reference and/or locking the routing table itself.

Other/related changes:

o rtredirect loses the final parameter by which an rtentry reference
may be returned; this was never used and added unwarranted complexity
for locking.
o minor style cleanups to routing code (e.g. ansi-fy function decls)
o remove the logic to bump the refcnt on the parent of cloned routes,
we assume the parent will remain as long as the clone; doing this avoids
a circularity in locking during delete
o convert some timeouts to MPSAFE callouts

Notes:

1. rt_mtx in struct rtentry is guarded by #ifdef _KERNEL as user-level
applications cannot/do-no know about mutex's. Doing this requires
that the mutex be the last element in the structure. A better solution
is to introduce an externalized version of struct rtentry but this is
a major task because of the intertwining of rtentry and other data
structures that are visible to user applications.
2. There are known LOR's that are expected to go away with forthcoming
work to eliminate many held references. If not these will be resolved
prior to release.
3. ATM changes are untested.

Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS (partly)


# 120701 03-Oct-2003 sam

cleanups prior to adding locking (and in some cases to eliminate locking):

o move route_cb to be private to rtsock.c
o replace global static route_proto by locals
o eliminate global #define shorthands for info references
o remove some register decls
o ansi-fy function decls
o move items to be close in scope to their usage
o add rt_dispatch function for dispatching the actual message
o cleanup tangled logic for doing all-but-me msg send

Support by: FreeBSD Foundation


# 113428 13-Apr-2003 hsu

No need to unlock if error detected before locking.

Submitted by: harti


# 111767 02-Mar-2003 mdodd

Reduce code duplication. This adds the function rt_check() to route.c.

Approved by: sam (in principle)


# 111119 19-Feb-2003 imp

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 109623 21-Jan-2003 alfred

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 108272 25-Dec-2002 ru

I'm not sure what was the problem at the time of revision 1.37
when julian@ added it, but the commented out code had at least
one bug -- not freeing the allocated mbuf.

Anyway, this comment no longer applies as of revision 1.67, so
remove it.


# 108270 25-Dec-2002 ru

Revision 1.67 changes correspond to CSRG revision 8.3.1.1 changes.


# 108269 25-Dec-2002 ru

If the caller of rtrequest*(RTM_DELETE, ...) asked for a copy of
the entry being removed (ret_nrt != NULL), increment the entry's
rt_refcnt like we do it for RTM_ADD and RTM_RESOLVE, rather than
messing around with 1->0 transitions for rtfree() all over.


# 108250 24-Dec-2002 hsu

SMP locking for radix nodes.


# 108206 23-Dec-2002 ru

rn_walktree*() compute the next leaf before applying a function
to current leaves because function may vanish the current node.

If parent RTA_GENMASK route has a clone (a "cloning clone"), an
rn_walktree_from() starting from parent will cause another walk
starting from clone. If a function is either rt_fixdelete() or
rt_fixchange(), this recursive walk may vanish the leaf that is
remembered by an outer walk (the "next leaf" above), panicing a
system when it resumes with an outer walk.

The following script paniced my single-user mode booted system:

: sysctl net.inet.ip.forwarding=1
: ipfw add 1 allow ip from any to any
: ifconfig lo0 127.1
: route add -net 10 -genmask 255.255.255.0 127.1
: telnet 10.1 # rt_fixchange() panic
: telnet 10.2
: telnet 10.1
: route delete -net 10 # rt_fixdelete() panic

For the time being, avoid these races by disallowing recursive
walks in rt_fixchange() and rt_fixdelete().

Also, make a slight optimization in the rtrequest(RTM_RESOLVE)
case: there is no reason to call rt_fixchange() in this case.

PR: kern/37606
MFC after: 5 days


# 108033 18-Dec-2002 hsu

Lock up ifaddr reference counts.


# 106968 15-Nov-2002 luigi

Massive cleanup of the ip_mroute code.

No functional changes, but:

+ the mrouting module now should behave the same as the compiled-in
version (it did not before, some of the rsvp code was not loaded
properly);
+ netinet/ip_mroute.c is now truly optional;
+ removed some redundant/unused code;
+ changed many instances of '0' to NULL and INADDR_ANY as appropriate;
+ removed several static variables to make the code more SMP-friendly;
+ fixed some minor bugs in the mrouting code (mostly, incorrect return
values from functions).

This commit is also a prerequisite to the addition of support for PIM,
which i would like to put in before DP2 (it does not change any of
the existing APIs, anyways).

Note, in the process we found out that some device drivers fail to
properly handle changes in IFF_ALLMULTI, leading to interesting
behaviour when a multicast router is started. This bug is not
corrected by this commit, and will be fixed with a separate commit.

Detailed changes:
--------------------
netinet/ip_mroute.c all the above.
conf/files make ip_mroute.c optional
net/route.c fix mrt_ioctl hook
netinet/ip_input.c fix ip_mforward hook, move rsvp_input() here
together with other rsvp code, and a couple
of indentation fixes.
netinet/ip_output.c fix ip_mforward and ip_mcast_src hooks
netinet/ip_var.h rsvp function hooks
netinet/raw_ip.c hooks for mrouting and rsvp functions, plus
interface cleanup.
netinet/ip_mroute.h remove an unused and optional field from a struct

Most of the code is from Pavlin Radoslavov and the XORP project

Reviewed by: sam
MFC after: 1 week


# 97649 31-May-2002 silby

Ensure that packet counts are always reset to 0 when
a route is cloned. Previously, they took on the count
of their parent route (which was sometimes nonzero.)

Submitted by: Andre Oppermann <oppermann@pipeline.ch>
MFC after: 5 days


# 92725 19-Mar-2002 alfred

Remove __P.


# 87060 28-Nov-2001 brian

Fix a typo in a comment


# 85074 17-Oct-2001 ru

Pull post-4.4BSD change to sys/net/route.c from BSD/OS 4.2.

Have sys/net/route.c:rtrequest1(), which takes ``rt_addrinfo *''
as the argument. Pass rt_addrinfo all the way down to rtrequest1
and ifa->ifa_rtrequest. 3rd argument of ifa->ifa_rtrequest is now
``rt_addrinfo *'' instead of ``sockaddr *'' (almost noone is
using it anyways).

Benefit: the following command now works. Previously we needed
two route(8) invocations, "add" then "change".
# route add -inet6 default ::1 -ifp gif0

Remove unsafe typecast in rtrequest(), from ``rtentry *'' to
``sockaddr *''. It was introduced by 4.3BSD-Reno and never
corrected.

Obtained from: BSD/OS, NetBSD
MFC after: 1 month
PR: kern/28360


# 85052 17-Oct-2001 ru

64-bit fixes from CSRG.


# 84971 15-Oct-2001 ru

Don't even attempt to clone host routes.

MFC after: 1 week


# 80353 25-Jul-2001 fenner

Don't bother passing p to rtioctl just so it can fail to pass it to mrt_ioctl


# 80350 25-Jul-2001 ume

As commented in defined in sys/net/route.c, rt_fixchange() has a bad
effect, which would cause unnecessary route deletion:

* Unfortunately, this has the obnoxious
* property of also triggering for insertion /above/ a pre-existing network
* route and clones. Sigh. This may be fixed some day.

The effect has been even worse, because recent versions of route.c set
the parent rtentry for cloned routes from an interface-direct route.
For example, suppose that we have an interface "ne0" that has an IPv4
subnet "10.0.0.0/24". Then we may have a cloned route like 10.0.0.1
on the interface, whose parent route is 10.0.0.0/24 (to the interface
ne0). Now, when we add the default route (i.e. 0.0.0.0/0),
rt_fixchange() will remove the cloned route 10.0.0.1. The (bad) effect
also prevents rt_setgate from configuring rt_gwroute, which would not
be an intended behavior.

As suggested in the comments to rt_fixchange(), we need stricter check
in the function, to prevent unintentional route deletion.

This fix also solve the "IPV6 panic?" problem in nd6_timer().

Submitted by: JINMEI Tatuya <jinmei@isl.rdc.toshiba.co.jp>
MFC after: 4 days


# 77689 04-Jun-2001 ru

When looking for an interface appropriate for the (new or changing)
route in ifa_ifwithroute(), as the last resort, look up the route to
the gateway, not destination (to derive the interface from).

PR: kern/27852
Submitted by: Iasen Kostoff <tbyte@tbyte.org>
MFC after: 2 weeks


# 74299 15-Mar-2001 ru

net/route.c:

A route generated from an RTF_CLONING route had the RTF_WASCLONED flag
set but did not have a reference to the parent route, as documented in
the rtentry(9) manpage. This prevented such routes from being deleted
when their parent route is deleted.

Now, for example, if you delete an IP address from a network interface,
all ARP entries that were cloned from this interface route are flushed.

This also has an impact on netstat(1) output. Previously, dynamically
created ARP cache entries (RTF_STATIC flag is unset) were displayed as
part of the routing table display (-r). Now, they are only printed if
the -a option is given.

netinet/in.c, netinet/in_rmx.c:

When address is removed from an interface, also delete all routes that
point to this interface and address. Previously, for example, if you
changed the address on an interface, outgoing IP datagrams might still
use the old address. The only solution was to delete and re-add some
routes. (The problem is easily observed with the route(8) command.)

Note, that if the socket was already bound to the local address before
this address is removed, new datagrams generated from this socket will
still be sent from the old address.

PR: kern/20785, kern/21914
Reviewed by: wollman (the idea)


# 59529 23-Apr-2000 wollman

A couple months ago, Kirk and I were doing a walkthrough of the radix-tree
search routine, and scratching our heads over why it was so obfuscated.
This delta fixes a number of confusing style bugs and renames several
structure members to have more meaningful names. There remain a number
of odd control-flow structures. These changes do not affect the generated
code.


# 56030 15-Jan-2000 shin

Clear ro->ro_rt just after RTFREE().
Pleases let me make sure that no one touch the invalid ro_rt pointer,
after splx(s) and before next ro_rt initialization.
Though usually this seems to be already called at splnet,
I still sometime experience kernel crash at rtfree() in my
INET6 enabled environment where IPv6 connection is frequently used.
(Off-course, it might be just due to another bug.)


# 55009 22-Dec-1999 shin

IPSEC support in the kernel.
pr_input() routines prototype is also changed to support IPSEC and IPV6
chained protocol headers.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 54369 09-Dec-1999 jdp

Fix a route table leak in rtalloc() and rtalloc_ign(). It is
possible for ro->ro_rt to be non-NULL even though the RTF_UP flag
is cleared. (Example: a routing daemon or the "route" command
deletes a cloned route in active use by a TCP connection.) In that
case, the code was clobbering a reference to the routing table
entry without decrementing the entry's reference count.

The splnet() call probably isn't needed, but I haven't been able
to prove that yet. It isn't significant from a performance standpoint
since it is executed very rarely.

Reviewed by: wollman and others in the freebsd-current mailing list


# 54350 09-Dec-1999 shin

rtcalloc() is removed because it turned out not to be necessary for FreeBSD.
(It was added as a part of KAME patch)

Specified by: jdp@polstra.com


# 53647 23-Nov-1999 brian

Only emit the ``wrong ifa'' message if the matching interface
is neither IFF_LOOPBACK or IFF_POINTOPOINT. It's quite common
(and probably more correct) to route local IP numbers via lo0
and it makes configuration easier to assign the hostname address
to local POINTOPOINT links too.

This message usually remains hidden because the loopback interface
gets the highest interface number at boot time, but when the
ethernet interface is added later, the message can get pretty
annoying.

Also, fix a typo.

Not objected to by: freebsd-net


# 53541 22-Nov-1999 shin

KAME netinet6 basic part(no IPsec,no V6 Multicast Forwarding, no UDP/TCP
for IPv6 yet)

With this patch, you can assigne IPv6 addr automatically, and can reply to
IPv6 ping.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 50477 27-Aug-1999 peter

$Id$ -> $FreeBSD$


# 46161 29-Apr-1999 luoqi

Postpone route_init() until all domains are attached.


# 43305 27-Jan-1999 dillon

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# 35256 17-Apr-1998 des

Seventy-odd "its" / "it's" typos in comments fixed as per kern/6108.


# 33181 09-Feb-1998 eivind

Staticize.


# 33134 06-Feb-1998 eivind

Back out DIAGNOSTIC changes.


# 33108 04-Feb-1998 eivind

Turn DIAGNOSTIC into a new-style option.


# 32350 08-Jan-1998 eivind

Make INET a proper option.

This will not make any of object files that LINT create change; there
might be differences with INET disabled, but hardly anything compiled
before without INET anyway. Now the 'obvious' things will give a
proper error if compiled without inet - ipx_ip, ipfw, tcp_debug. The
only thing that _should_ work (but can't be made to compile reasonably
easily) is sppp :-(

This commit move struct arpcom from <netinet/if_ether.h> to
<net/if_arp.h>.


# 30813 28-Oct-1997 bde

Removed unused #includes.


# 29506 16-Sep-1997 bde

Fixed gratuitous ANSIisms.


# 29024 01-Sep-1997 bde

Added used #include - don't depend on <sys/mbuf.h> including
<sys/malloc.h> (unless we only use the bogusly shared M*WAIT flags).


# 24203 24-Mar-1997 bde

Don't include <sys/ioctl.h> in the kernel. Stage 1: don't include
it when it is not used. In most cases, the reasons for including it
went away when the special ioctl headers became self-sufficient.


# 23392 05-Mar-1997 julian

add a bunch of comments to describe what's going on.
This is some of the worst code I've had to wade through in
ages and I don't want to have to start from scratch again next time.

(I have a 2.2 version of these comments, can I commit them?)


# 22975 22-Feb-1997 peter

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 22010 25-Jan-1997 julian

fix mixleading comment (my error.. I wrote the comment)


# 21673 14-Jan-1997 jkh

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 18206 10-Sep-1996 julian

No code changes what so ever, but added about 150 lines of comments
Sorry if this makes it harder to merge in lite2 stuff but hey..
At least I can figure out what is going on whenever I end up going through those
files again..

do we have a policy regarding commenting existing code?


# 17997 02-Sep-1996 fenner

Bugfix and simplification for rev 1.34: make sure that the route
is non-null before trying to delete it in rt_setgate(), which then
allows removal of the special-case code from the RTM_ADD case.
This should fix the panics that joerg and Phil Karn have been seeing.


# 17802 24-Aug-1996 peter

route.c:RTM_ADD does not check for a netmask before doing a tree walk
like it does elsewhere. This is probably only happens when incorrect
args are given to route(8), or when running with non-IPv4 stacks but
incorrect args to the route command is no excuse for panicing!

Submitted by: Michael Clay <mclay@weareb.org>, PR#1532


# 17052 09-Jul-1996 fenner

Disallow host routes that point to themselves. These routes serve no
purpose, other than to get in the way of the ARP table and cause
"can't allocate llinfo" errors.

This change may cause gated or routed to start complaining when adding
such routes. If so, these programs will need to be fixed to not try
to add these routes.

Reviewed by: wollman


# 14904 29-Mar-1996 fenner

Eliminate panic("rtfree") caused by double-freeing the route
when rt == rt->rt_gwroute . rt == rt->gwroute shouldn't happen
in the first place, but that's another problem.

(try "route add -host <hostonmynet> <hostonmynet>; ping <hostonmynet>;
route delete <hostonmynet>")


# 14546 11-Mar-1996 dg

Move or add #include <queue.h> in preparation for upcoming struct socket
changes.


# 14328 02-Mar-1996 peter

Add more options into the conf/options and i386/conf/options.i386 files
and the #include hooks so that 'make depend' is more useful. This
covers most of the options I regularly use (but not all) and some other
easy ones.


# 13616 24-Jan-1996 wollman

Fix memory leak in case of adding a host route on top of another one.

Pointed-out-by: Bill Fenner <fenner@parc.xerox.com>


# 12820 14-Dec-1995 phk

Another mega commit to staticize things.


# 12578 02-Dec-1995 bde

Fixed call to mrt_ioctl(). mrt_ioctl() for some reason has different
number of args when MROUTING is defined.


# 11921 29-Oct-1995 phk

Second batch of cleanup changes.
This time mostly making a lot of things static and some unused
variables here and there.


# 11539 16-Oct-1995 wollman

When adding a route fails because there is already a route with the same
(mask,value) in the tree, don't immediately return EEXIST. Instead, check
to see if the pre-existing route was generated by protcol-cloning. If so,
then it is OK to simply blow away the old route and re-attempt the insertion.
If not, then fall back to the same error code as before.


# 9759 29-Jul-1995 bde

Eliminate sloppy common-style declarations. There should be none left for
the LINT configuation.


# 9469 10-Jul-1995 wollman

When adding a route, set rt_ifa and rt_ifp a little earlier so that
the protocol-specific add routine can examine it if desired.


# 8876 30-May-1995 rgrimes

Remove trailing whitespace.


# 8070 25-Apr-1995 wollman

Finally finish the cloning cleanup work by making sure that clones
go away whenever a clone's parent is changed, or a route is added in a
certain set of circumstances.

This also includes code to forbid setting a route's gateway to an
address which can only be reached through that route, thus (hopefully)
eliminating one class of cloning bottomless-recursion bugs.


# 7335 24-Mar-1995 wollman

Don't delete clones if they are PINNED.


# 7279 23-Mar-1995 wollman

radix.c: correct exit condition in rn_walktree_from()
route.c: be a little more careful when running deleting children of dying
. routes


# 7224 21-Mar-1995 wollman

Protocol-cloned routes should gain a reference to their parents to make
sure that rt->rt_parent values can never be re-used harmfully.


# 7199 20-Mar-1995 dg

Made minor readability tweak.


# 7197 20-Mar-1995 wollman

Better fix for the deletion of parents of cloned routes problem,
superseding the `nextchild' hack. This also provides a way
forward to fix RTM_CHANGE and RTM_ADD as well.


# 7090 16-Mar-1995 bde

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 5801 23-Jan-1995 dg

Added back the missing last few bytes of the file.


# 5791 23-Jan-1995 wollman

route.c: keep track of where cloned routes come from, and make sure to
delete them when the ``parent'' goes away

route.h: add glue to track this to rtentry structure. WARNING WILL ROBINSON!
This will be yet another incompatible change in your route-using binaries.
I apologize, but this was the only way to do it. I took this opportunity
to increase the size of the metrics to what I believe will be the final
length for 2.1, so that when the T/TCP stuff is done, this won't happen
again.


# 5104 13-Dec-1994 wollman

Implemented rtalloc_ign().


# 5099 13-Dec-1994 wollman

Add support for two separate cloning flags, one set by the lower layers,
and one set by the protocol family. Also add another parameter to
rtalloc1() to allow for any interface flags to be ignored; currently
this is only useful for RTF_PRCLONING. Get rid of rt_prflags and re-unite
with rt_flags. Add T/TCP ``route metrics''.

NB: YOU MUST RECOMPILE `route' AND OTHER RELATED PROGRAMS AS A RESULT OF
THIS CHANGE.

This also adds a new interface parameter, `ifi_physical', which will
eventually replace IFF_ALTPHYS as the mechanism for specifying the
particular physical connection desired on a multiple-connection card.

NB: YOU MUST RECOMPILE `ifconfig' AND OTHER RELATED PROGRAMS AS A RESULT OF
THIS CHANGE.


# 4104 02-Nov-1994 wollman

Collapse two fields so that we have space for another 32 flags.
NB: You will have to recompile programs which use the `rt_use' member in
order to get the correct values. This should not cause incorrect operation,
but the statistics may look a little confusing.


# 4073 02-Nov-1994 wollman

Add code to be a bit smarter about IP routes, conditioned on the option
IN_RMX. (Eventually this will be standard, but I just wrote the code today
and don't want to break anyone.)


# 3514 11-Oct-1994 wollman

Fix a bug which caused panics when attempting to change just the flags of
a route. (This still doesn't work, but it doesn't panic now.) It looks
like there may be a number of incipient bugs in this code.

Also, get ready for the time when all IP gateway routes are cloning, which
is necessary to keep proper TCP statistics.


# 3311 02-Oct-1994 phk

GCC cleanup.
Reviewed by:
Submitted by:
Obtained from:


# 2754 14-Sep-1994 wollman

Shuffle some functions and variables around to make it possible for
multicast routing to be implemented as an LKM. (There's still a bit of
work to do in this area.)


# 2554 07-Sep-1994 wollman

The mrt_ioctl goop properly depends on MROUTING, not MULTICAST. (Oof!)


# 2544 07-Sep-1994 se

Reviewed by: Stefan Esser
Submitted by:
rtioctl(): changed parameter to mrt_ioctl from "cmd" to "req" to make
it compile with MULTICAST defined.


# 2531 06-Sep-1994 wollman

Initial get-the-easy-case-working upgrade of the multicast code
to something more recent than the ancient 1.2 release contained in
4.4. This code has the following advantages as compared to
previous versions (culled from the README file for the SunOS release):

- True multicast delivery
- Configurable rate-limiting of forwarded multicast traffic on each
physical interface or tunnel, using a token-bucket limiter.
- Simplistic classification of packets for prioritized dropping.
- Administrative scoping of multicast address ranges.
- Faster detection of hosts leaving groups.
- Support for multicast traceroute (code not yet available).
- Support for RSVP, the Resource Reservation Protocol.

What still needs to be done:

- The multicast forwarder needs testing.
- The multicast routing daemon needs to be ported.
- Network interface drivers need to have the `#ifdef MULTICAST' goop ripped
out of them.
- The IGMP code should probably be bogon-tested.

Some notes about the porting process:

In some cases, the Berkeley people decided to incorporate functionality from
later releases of the multicast code, but then had to do things differently.
As a result, if you look at Deering's patches, and then look at
our code, it is not always obvious whether the patch even applies. Let
the reader beware.

I ran ip_mroute.c through several passes of `unifdef' to get rid of
useless grot, and to permanently enable the RSVP support, which we will
include as standard.

Ported by: Garrett Wollman
Submitted by: Steve Deering and Ajit Thyagarajan (among others)


# 1817 02-Aug-1994 dg

Added $Id$


# 1549 25-May-1994 rgrimes

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# 1542 24-May-1994 rgrimes

This commit was generated by cvs2svn to compensate for changes in r1541,
which included commits to RCS files with non-trunk default branches.


# 1541 24-May-1994 rgrimes

BSD 4.4 Lite Kernel Sources