History log of /freebsd-current/sys/net/route.h
Revision Date Author Comments
# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 2ff63af9 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .h pattern

Remove /^\s*\*+\s*\$FreeBSD\$.*$\n/


# 90d62512 09-Mar-2023 Alexander V. Chernikov <melifaro@FreeBSD.org>

netlink: add rtsock-compatible header to use with netlink snl(3).

Some routing socket defines (`RTM_` and `RTA_` ones) clash with the ones
used by the the Netlink.
As some rtsock definitions like interface flags or route flags are used in
both netlink and rtsock, provide a convenient way to include those without
running into the define collision.

Differential Revision: https://reviews.freebsd.org/D38982
MFC after: 2 weeks


# 1bcd230f 03-Dec-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

netlink: add interface notification on link status / flags change.

* Add link-state change notifications by subscribing to ifnet_link_event.
In the Linux netlink model, link state is reported in 2 places: first is
the IFLA_OPERSTATE, which stores state per RFC2863.
The second is an IFF_LOWER_UP interface flag. As many applications rely
on the latter, reserve 1 bit from if_flags, named as IFF_NETLINK_1.
This flag is mapped to IFF_LOWER_UP in the netlink headers. This is done
to avoid making applications think this flag is actually
supported / presented in non-netlink outputs.
* Add flag change notifications, by hooking into rt_ifmsg().
In the netlink model, notification should include the bitmask for the
change flags. Update rt_ifmsg() to include such bitmask.

Differential Revision: https://reviews.freebsd.org/D37597


# 1922eb3e 17-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protosw: retire pr_slowtimo and pr_fasttimo

They were useful many years ago, when the callwheel was not efficient,
and the kernel tried to have as little callout entries scheduled as
possible.

Reviewed by: tuexen, melifaro
Differential revision: https://reviews.freebsd.org/D36163


# 036f1bc6 14-Aug-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

routing: retire rib_lookup_info()

This function was added in pre-epoch era ( 9a1b64d5a0224 ) to
provide public rtentry access interface & hide rtentry internals.
The implementation is based on the large on-stack copying and
refcounting of the referenced objects (ifa/ifp).
It has become obsolete after epoch & nexthop introduction. Convert
the last remaining user and remove the function itself.

Differential Revision: https://reviews.freebsd.org/D36197


# d8b42ddc 11-Aug-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

rtsock: subscribe to ifnet eventhandlers instead of direct calls.

Stop treating rtsock as a "special" consumer and use already-provided
ifaddr arrival/departure notifications.

MFC after: 2 weeks

Test Plan:
```
21:05 [0] m@devel0 route -n monitor

-> ifconfig vtnet0.2 create

got message of size 24 on Tue Aug 9 21:05:44 2022
RTM_IFANNOUNCE: interface arrival/departure: len 24, if# 3, what: arrival

got message of size 168 on Tue Aug 9 21:05:54 2022
RTM_IFINFO: iface status change: len 168, if# 3, link: up, flags:<BROADCAST,RUNNING,SIMPLEX,MULTICAST>

-> ifconfig vtnet0.2 destroy

got message of size 24 on Tue Aug 9 21:05:54 2022
RTM_IFANNOUNCE: interface arrival/departure: len 24, if# 3, what: departure

```

Reviewed By: glebius
Differential Revision: https://reviews.freebsd.org/D36095
MFC after: 2 weeks


# 66230639 03-Aug-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

routing: split nexthop creation and rtentry creation.

This change is required for the upcoming introduction of the next
nexhop-based operations KPI, as it will create rtentry and nexthops
at different stages of route table modification.

Differential Revision: https://reviews.freebsd.org/D36072
MFC after: 2 weeks


# 23677398 02-Apr-2022 Gordon Bergling <gbe@FreeBSD.org>

net(3): Fix a typo in a source code comment

- s/paramenters/parameters/

MFC after: 3 days


# 63f7f392 26-Dec-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

routing: Add unified level-based logging support for the routing subsystem.

Summary: MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D33664


# 62e1a437 22-Aug-2021 Zhenlei Huang <zlei.huang@gmail.com>

routing: Allow using IPv6 next-hops for IPv4 routes (RFC 5549).

Implement kernel support for RFC 5549/8950.

* Relax control plane restrictions and allow specifying IPv6 gateways
for IPv4 routes. This behavior is controlled by the
net.route.rib_route_ipv6_nexthop sysctl (on by default).

* Always pass final destination in ro->ro_dst in ip_forward().

* Use ro->ro_dst to exract packet family inside if_output() routines.
Consistently use RO_GET_FAMILY() macro to handle ro=NULL case.

* Pass extracted family to nd6_resolve() to get the LLE with proper encap.
It leverages recent lltable changes committed in c541bd368f86.

Presence of the functionality can be checked using ipv4_rfc5549_support feature(3).
Example usage:
route add -net 192.0.0.0/24 -inet6 fe80::5054:ff:fe14:e319%vtnet0

Differential Revision: https://reviews.freebsd.org/D30398
MFC after: 2 weeks


# 8e8f1cc9 23-Apr-2021 Mark Johnston <markj@FreeBSD.org>

Re-enable network ioctls in capability mode

This reverts a portion of 274579831b61 ("capsicum: Limit socket
operations in capability mode") as at least rtsol and dhcpcd rely on
being able to configure network interfaces while in capability mode.

Reported by: bapt, Greg V
Sponsored by: The FreeBSD Foundation


# 27457983 07-Apr-2021 Mark Johnston <markj@FreeBSD.org>

capsicum: Limit socket operations in capability mode

Capsicum did not prevent certain privileged networking operations,
specifically creation of raw sockets and network configuration ioctls.
However, these facilities can be used to circumvent some of the
restrictions that capability mode is supposed to enforce.

Add capability mode checks to disallow network configuration ioctls and
creation of sockets other than PF_LOCAL and SOCK_DGRAM/STREAM/SEQPACKET
internet sockets.

Reviewed by: oshogbo
Discussed with: emaste
Reported by: manu
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D29423


# b1d63265 08-Mar-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Flush remaining routes from the routing table during VNET shutdown.

Summary:
This fixes rtentry leak for the cloned interfaces created inside the
VNET.

PR: 253998
Reported by: rashey at superbox.pl
MFC after: 3 days

Loopback teardown order is `SI_SUB_INIT_IF`, which happens after `SI_SUB_PROTO_DOMAIN` (route table teardown).
Thus, any route table operations are too late to schedule.
As the intent of the vnet teardown procedures to minimise the amount of effort by doing global cleanups instead of per-interface ones, address this by adding a relatively light-weight routing table cleanup function, `rib_flush_routes()`.
It removes all remaining routes from the routing table and schedules the deletion, which will happen later, when `rtables_destroy()` waits for the current epoch to finish.

Test Plan:
```
set_skip:set_skip_group_lo -> passed [0.053s]
tail -n 200 /var/log/messages | grep rtentry
```

Reviewers: #network, kp, bz

Reviewed By: kp

Subscribers: imp, ae

Differential Revision: https://reviews.freebsd.org/D29116


# 64d5c277 14-Feb-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Remove now-unused RTF_RNH_LOCKED route flag.

MFC after: 1 week


# 81728a53 08-Jan-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Split rtinit() into multiple functions.

rtinit[1]() is a function used to add or remove interface address prefix routes,
similar to ifa_maintain_loopback_route().
It was intended to be family-agnostic. There is a problem with this approach
in reality.

1) IPv6 code does not use it for the ifa routes. There is a separate layer,
nd6_prelist_(), providing interface for maintaining interface routes. Its part,
responsible for the actual route table interaction, mimics rtenty() code.

2) rtinit tries to combine multiple actions in the same function: constructing
proper route attributes and handling iterations over multiple fibs, for the
non-zero net.add_addr_allfibs use case. It notably increases the code complexity.

3) dstaddr handling. flags parameter re-uses RTF_ flags. As there is no special flag
for p2p connections, host routes and p2p routes are handled in the same way.
Additionally, mapping IFA flags to RTF flags makes the interface pretty messy.
It make rtinit() to clash with ifa_mainain_loopback_route() for IPV4 interface
aliases.

4) rtinit() is the last customer passing non-masked prefixes to rib_action(),
complicating rib_action() implementation.

5) rtinit() coupled ifa announce/withdrawal notifications, producing "false positive"
ifa messages in certain corner cases.

To address all these points, the following has been done:

* rtinit() has been split into multiple functions:
- Route attribute construction were moved to the per-address-family functions,
dealing with (2), (3) and (4).
- funnction providing net.add_addr_allfibs handling and route rtsock notificaions
is the new routing table inteface.
- rtsock ifa notificaion has been moved out as well. resulting set of funcion are only
responsible for the actual route notifications.

Side effects:
* /32 alias does not result in interface routes (/32 route and "host" route)
* RTF_PINNED is now set for IPv6 prefixes corresponding to the interface addresses

Differential revision: https://reviews.freebsd.org/D28186


# d68cf57b 07-Jan-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Refactor rt_addrmsg() and rt_routemsg().

Summary:
* Refactor rt_addrmsg(): make V_rt_add_addr_allfibs decision locally.
* Fix rt_routemsg() and multipath by accepting nexthop instead of interface pointer.
* Refactor rtsock_routemsg(): avoid accessing rtentry fields directly.
* Simplify in_addprefix() by moving prefix search to a separate function.

Reviewers: #network

Subscribers: imp, ae, bz

Differential Revision: https://reviews.freebsd.org/D28011


# 77df2c21 30-Nov-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Renumber NHR_* flags after NHR_IFAIF removal in r368127.

Suggested by: rpokala


# b712e3e3 29-Nov-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Refactor fib4/fib6 functions.

No functional changes.

* Make lookup path of fib<4|6>_lookup_debugnet() separate functions
(fib<46>_lookup_rt()). These will be used in the control plane code
requiring unlocked radix operations and actual prefix pointer.
* Make lookup part of fib<4|6>_check_urpf() separate functions.
This change simplifies the switch to alternative lookup implementations,
which helps algorithmic lookups introduction.
* While here, use static initializers for IPv4/IPv6 keys

Differential Revision: https://reviews.freebsd.org/D27405


# 7a6dc73c 28-Nov-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Cleanup nexthops request flags:
* remove NHR_IFAIF as it was used by previous version of nexthop KPI
* update NHR_REF description


# 7511a638 22-Nov-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Refactor rib iterator functions.

* Make rib_walk() order of arguments consistent with the rest of RIB api
* Add rib_walk_ext() allowing to exec callback before/after iteration.
* Rename rt_foreach_fib_walk_del -> rib_foreach_table_walk_del
* Rename rt_forach_fib_walk -> rib_foreach_table_walk
* Move rib_foreach_table_walk{_del} to route/route_helpers.c
* Slightly refactor rib_foreach_table_walk{_del} to make the implementation
consistent and prepare for upcoming iterator optimizations.

Differential Revision: https://reviews.freebsd.org/D27219


# 0c325f53 18-Oct-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Implement flowid calculation for outbound connections to balance
connections over multiple paths.

Multipath routing relies on mbuf flowid data for both transit
and outbound traffic. Current code fills mbuf flowid from inp_flowid
for connection-oriented sockets. However, inp_flowid is currently
not calculated for outbound connections.

This change creates simple hashing functions and starts calculating hashes
for TCP,UDP/UDP-Lite and raw IP if multipath routes are present in the
system.

Reviewed by: glebius (previous version),ae
Differential Revision: https://reviews.freebsd.org/D26523


# fedeb08b 03-Oct-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Introduce scalable route multipath.

This change is based on the nexthop objects landed in D24232.

The change introduces the concept of nexthop groups.
Each group contains the collection of nexthops with their
relative weights and a dataplane-optimized structure to enable
efficient nexthop selection.

Simular to the nexthops, nexthop groups are immutable. Dataplane part
gets compiled during group creation and is basically an array of
nexthop pointers, compiled w.r.t their weights.

With this change, `rt_nhop` field of `struct rtentry` contains either
nexthop or nexthop group. They are distinguished by the presense of
NHF_MULTIPATH flag.
All dataplane lookup functions returns pointer to the nexthop object,
leaving nexhop groups details inside routing subsystem.

User-visible changes:

The change is intended to be backward-compatible: all non-mpath operations
should work as before with ROUTE_MPATH and net.route.multipath=1.

All routes now comes with weight, default weight is 1, maximum is 2^24-1.

Current maximum multipath group width is statically set to 64.
This will become sysctl-tunable in the followup changes.

Using functionality:
* Recompile kernel with ROUTE_MPATH
* set net.route.multipath to 1

route add -6 2001:db8::/32 2001:db8::2 -weight 10
route add -6 2001:db8::/32 2001:db8::3 -weight 20

netstat -6On

Nexthop groups data

Internet6:
GrpIdx NhIdx Weight Slots Gateway Netif Refcnt
1 ------- ------- ------- --------------------------------------- --------- 1
13 10 1 2001:db8::2 vlan2
14 20 2 2001:db8::3 vlan2

Next steps:
* Land outbound hashing for locally-originated routes ( D26523 ).
* Fix net/bird multipath (net/frr seems to work fine)
* Add ROUTE_MPATH to GENERIC
* Set net.route.multipath=1 by default

Tested by: olivier
Reviewed by: glebius
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D26449


# 2259a030 21-Sep-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Rework part of routing code to reduce difference to D26449.

* Split rt_setmetrics into get_info_weight() and rt_set_expire_info(),
as these two can be applied at different entities and at different times.
* Start filling route weight in route change notifications
* Pass flowid to UDP/raw IP route lookups
* Rework nd6_subscription_cb() and sysctl_dumpentry() to prepare for the fact
that rtentry can contain multiple nexthops.

Differential Revision: https://reviews.freebsd.org/D26497


# f5247a23 21-Aug-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Make net.fibs growable.

Allow to dynamically grow the amount of fibs in each vnet.

This change alters current behavior. Currently, if one defines
ROUTETABLES > 1 in the kernel config, each vnet will be created
with the number of fibs defined in the kernel config.
After this commit vnets will be created with fibs=1.

Dynamic net.fibs is not compatible with net.add_addr_allfibs.
The plan is to deprecate the latter and make
net.add_addr_allfibs=0 default behaviour.

Reviewed by: glebius
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D26062


# 6cbadc42 13-Aug-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Move rtzone handling code to net/route_ctl.c

After moving the route control plane code from net/route.c,
all rtzone users ended up being in net/route_ctl.c.
Move uma(9) rtzone setup/teardown code to net/route_ctl.c as well
to have everything in a single place.

While here, remove custom initializers from the zone.
It was added originally to avoid setup/teardown of costy per-cpu couters.
With these counters removed, the only remaining job was avoiding rte mutex
setup/teardown. Mutex setup is relatively cheap. Additionally, this mutex
will soon be removed. With that in mind, there is no sense in keeping
custom zone callbacks.

Differential Revision: https://reviews.freebsd.org/D26051


# e1c05fd2 21-Jul-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Transition from rtrequest1_fib() to rib_action().

Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib,
in6_rtrequest, rtrequest_fib> and their uses and switch to
to rib_action(). This is part of the new routing KPI.

Submitted by: Neel Chauhan <neel AT neelc DOT org>
Differential Revision: https://reviews.freebsd.org/D25546


# 72587123 19-Jul-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Temporarly revert r363319 to unbreak the build.

Reported by: CI
Pointy hat to: melifaro


# 8cee15d9 19-Jul-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Transition from rtrequest1_fib() to rib_action().

Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib,
in6_rtrequest, rtrequest_fib> and their uses and switch to
to rib_action(). This is part of the new routing KPI.

Submitted by: Neel Chauhan <neel AT neelc DOT org>
Differential Revision: https://reviews.freebsd.org/D25546


# da187ddb 01-Jun-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

* Add rib_<add|del|change>_route() functions to manipulate the routing table.

The main driver for the change is the need to improve notification mechanism.
Currently callers guess the operation data based on the rtentry structure
returned in case of successful operation result. There are two problems with
this appoach. First is that it doesn't provide enough information for the
upcoming multipath changes, where rtentry refers to a new nexthop group,
and there is no way of guessing which paths were added during the change.
Second is that some rtentry fields can change during notification and
protecting from it by requiring customers to unlock rtentry is not desired.

Additionally, as the consumers such as rtsock do know which operation they
request in advance, making explicit add/change/del versions of the functions
makes sense, especially given the functions don't share a lot of code.

With that in mind, introduce rib_cmd_info notification structure and
rib_<add|del|change>_route() functions, with mandatory rib_cmd_info pointer.
It will be used in upcoming generalized notifications.

* Move definitions of the new functions and some other functions/structures
used for the routing table manipulation to a separate header file,
net/route/route_ctl.h. net/route.h is a frequently used file included in
~140 places in kernel, and 90% of the users don't need these definitions.

Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D25067


# e7403d02 01-Jun-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Revert r361704, it accidentally committed merged D25067 and D25070.


# 79674562 01-Jun-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

* Add rib_<add|del|change>_route() functions to manipulate the routing table.

The main driver for the change is the need to improve notification mechanism.
Currently callers guess the operation data based on the rtentry structure
returned in case of successful operation result. There are two problems with
this appoach. First is that it doesn't provide enough information for the
upcoming multipath changes, where rtentry refers to a new nexthop group,
and there is no way of guessing which paths were added during the change.
Second is that some rtentry fields can change during notification and
protecting from it by requiring customers to unlock rtentry is not desired.

Additionally, as the consumers such as rtsock do know which operation they
request in advance, making explicit add/change/del versions of the functions
makes sense, especially given the functions don't share a lot of code.

With that in mind, introduce rib_cmd_info notification structure and
rib_<add|del|change>_route() functions, with mandatory rib_cmd_info pointer.
It will be used in upcoming generalized notifications.

* Move definitions of the new functions and some other functions/structures
used for the routing table manipulation to a separate header file,
net/route/route_ctl.h. net/route.h is a frequently used file included in
~140 places in kernel, and 90% of the users don't need these definitions.

Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D25067


# 2bbab0af 23-May-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Use epoch(9) for rtentries to simplify control plane operations.

Currently the only reason of refcounting rtentries is the need to report
the rtable operation details immediately after the execution.
Delaying rtentry reclamation allows to stop refcounting and simplify the code.
Additionally, this change allows to reimplement rib_lookup_info(), which
is used by some of the customers to get the matching prefix along
with nexthops, in more efficient way.

The change keeps per-vnet rtzone uma zone. It adds nh_vnet field to
nhop_priv to be able to reliably set curvnet even during vnet teardown.
Rest of the reference counting code will be removed in the D24867 .

Differential Revision: https://reviews.freebsd.org/D24866


# d2233725 10-May-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Remove rtalloc1(_fib) KPI.

Last user of rtalloc1() KPI has been eliminated in rS360631.
As kernel is now fully switched to use new routing KPI defined in
rS359823, remove old lookup functions.

Differential Revision: https://reviews.freebsd.org/D24776


# 656442a7 08-May-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Embed dst sockaddr into rtentry and remove rte packet counter

Currently each rtentry has dst&gateway allocated separately from another zone,
bloating cache accesses.

Current 'struct rtentry' has 12 "mandatory" radix pointers in the beginning,
leaving 4 usable pointers/32 bytes in the first 2 cache lines (amd64).
Fields needed for the datapath are destination sockaddr and rt_nhop.

So far it doesn't look like there is other routable addressing protocol other
than IPv4/IPv6/MPLS, which uses keys longer than 20 bytes.
With that in mind, embed dst into struct rtentry, making the first 24 bytes
of rtentry within 128 bytes. That is enough to make IPv6 address within first
128 bytes.

It is still pretty easy to add code for supporting separately-allocated dst,
however it doesn't make a lot of sense in having such code without a use case.

As rS359823 moved the gateway to the nexthop structure, the dst embedding change
removes the need for any additional allocations done by rt_setgate().

Lastly, as a part of cleanup, remove counter(9) allocation code, as this field
is not used in packet processing anymore.

Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D24669


# 682b902d 07-May-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Add rib_lookup() sockaddr lookup wrapper and make ifa_ifwithroute use it.

Create rib_lookup() wrapper around per-af dataplane lookup functions.
This will help in the cases of having control plane af-agnostic code.

Switch ifa_ifwithroute() to use this function instead of rtalloc1().

Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D24731


# fc88ecd3 28-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Move route-specific ddb commands to route/route_ddb.c

Currently functionality resides in rtsock.c, which is a controlling
interface, partially external to the routing subsystem.
Additionally, DDB-supporting functionality is > 100SLOC, which deserves
a separate file.

Given that, move this functionality to a newly-created net/route/ subdir.

Reviewed by: cem
Differential Revision: https://reviews.freebsd.org/D24561


# fe6da727 28-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Move struct rtentry definition to nhop_var.h.

One of the goals of the new routing KPI defined in r359823
is to entirely hide`struct rtentry` from the consumers.
It will allow to improve routing subsystem internals and deliver
features much faster.

This is one of the last changes, effectively moving struct rtentry
definition to a net/route_var.h header, internal to the routing subsystem.

Differential Revision: https://reviews.freebsd.org/D24580


# 1b0051ba 28-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Eliminate now-unused parts of old routing KPI.

r360292 switched most of the remaining routing customers to a new KPI,
leaving a bunch of wrappers for old routing lookup functions unused.

Remove them from the tree as a part of routing cleanup.

Differential Revision: https://reviews.freebsd.org/D24569


# 57e70e44 25-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix userland build broken by r360292.


# 983066f0 25-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Convert route caching to nexthop caching.

This change is build on top of nexthop objects introduced in r359823.

Nexthops are separate datastructures, containing all necessary information
to perform packet forwarding such as gateway interface and mtu. Nexthops
are shared among the routes, providing more pre-computed cache-efficient
data while requiring less memory. Splitting the LPM code and the attached
data solves multiple long-standing problems in the routing layer,
drastically reduces the coupling with outher parts of the stack and allows
to transparently introduce faster lookup algorithms.

Route caching was (re)introduced to minimise (slow) routing lookups, allowing
for notably better performance for large TCP senders. Caching works by
acquiring rtentry reference, which is protected by per-rtentry mutex.
If the routing table is changed (checked by comparing the rtable generation id)
or link goes down, cache record gets withdrawn.

Nexthops have the same reference counting interface, backed by refcount(9).
This change merely replaces rtentry with the actual forwarding nextop as a
cached object, which is mostly mechanical. Other moving parts like cache
cleanup on rtable change remains the same.

Differential Revision: https://reviews.freebsd.org/D24340


# 5cdc4844 16-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix userland build broken by r360014.


# 539642a2 16-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Add nhop parameter to rti_filter callback.

One of the goals of the new routing KPI defined in r359823 is to
entirely hide`struct rtentry` from the consumers. It will allow to
improve routing subsystem internals and deliver more features much faster.
This change is one of the ongoing changes to eliminate direct
struct rtentry field accesses.

Additionally, with the followup multipath changes, single rtentry can point
to multiple nexthops.

With that in mind, convert rti_filter callback used when traversing the
routing table to accept pair (rt, nhop) instead of nexthop.

Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D24440


# dd4776f0 14-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Reorganise nd6 notification code to avoid direct rtentry field access.

One of the goals of the new routing KPI defined in r359823 is to entirely hide
`struct rtentry` from the consumers. Doing so will allow to improve routing
subsystem internals and deliver features more easily. This change is one of
the ongoing changes to eliminate direct struct rtentry field accesses.

It introduces rtfree_func() wrapper around RTFREE() and reorganises nd6 notification
code to avoid accessing most of the rtentry fields.

Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D24404


# a6663252 12-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Introduce nexthop objects and new routing KPI.

This is the foundational change for the routing subsytem rearchitecture.
More details and goals are available in https://reviews.freebsd.org/D24141 .

This patch introduces concept of nexthop objects and new nexthop-based
routing KPI.

Nexthops are objects, containing all necessary information for performing
the packet output decision. Output interface, mtu, flags, gw address goes
there. For most of the cases, these objects will serve the same role as
the struct rtentry is currently serving.
Typically there will be low tens of such objects for the router even with
multiple BGP full-views, as these objects will be shared between routing
entries. This allows to store more information in the nexthop.

New KPI:

struct nhop_object *fib4_lookup(uint32_t fibnum, struct in_addr dst,
uint32_t scopeid, uint32_t flags, uint32_t flowid);
struct nhop_object *fib6_lookup(uint32_t fibnum, const struct in6_addr *dst6,
uint32_t scopeid, uint32_t flags, uint32_t flowid);

These 2 function are intended to replace all all flavours of
<in_|in6_>rtalloc[1]<_ign><_fib>, mpath functions and the previous
fib[46]-generation functions.

Upon successful lookup, they return nexthop object which is guaranteed to
exist within current NET_EPOCH. If longer lifetime is desired, one can
specify NHR_REF as a flag and get a referenced version of the nexthop.
Reference semantic closely resembles rtentry one, allowing sed-style conversion.

Additionally, another 2 functions are introduced to support uRPF functionality
inside variety of our firewalls. Their primary goal is to hide the multipath
implementation details inside the routing subsystem, greatly simplifying
firewalls implementation:

int fib4_lookup_urpf(uint32_t fibnum, struct in_addr dst, uint32_t scopeid,
uint32_t flags, const struct ifnet *src_if);
int fib6_lookup_urpf(uint32_t fibnum, const struct in6_addr *dst6, uint32_t scopeid,
uint32_t flags, const struct ifnet *src_if);

All functions have a separate scopeid argument, paving way to eliminating IPv6 scope
embedding and allowing to support IPv4 link-locals in the future.

Structure changes:
* rtentry gets new 'rt_nhop' pointer, slightly growing the overall size.
* rib_head gets new 'rnh_preadd' callback pointer, slightly growing overall sz.

Old KPI:
During the transition state old and new KPI will coexists. As there are another 4-5
decent-sized conversion patches, it will probably take a couple of weeks.
To support both KPIs, fields not required by the new KPI (most of rtentry) has to be
kept, resulting in the temporary size increase.
Once conversion is finished, rtentry will notably shrink.

More details:
* architectural overview: https://reviews.freebsd.org/D24141
* list of the next changes: https://reviews.freebsd.org/D24232

Reviewed by: ae,glebius(initial version)
Differential Revision: https://reviews.freebsd.org/D24232


# 34a5582c 22-Jan-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Bring back redirect route expiration.

Redirect (and temporal) route expiration was broken a while ago.
This change brings route expiration back, with unified IPv4/IPv6 handling code.

It introduces net.inet.icmp.redirtimeout sysctl, allowing to set
an expiration time for redirected routes. It defaults to 10 minutes,
analogues with net.inet6.icmp6.redirtimeout.

Implementation uses separate file, route_temporal.c, as route.c is already
bloated with tons of different functions.
Internally, expiration is implemented as an per-rnh callout scheduled when
route with non-zero rt_expire time is added or rt_expire is changed.
It does not add any overhead when no temporal routes are present.

Callout traverses entire routing tree under wlock, scheduling expired routes
for deletion and calculating the next time it needs to be run. The rationale
for such implemention is the following: typically workloads requiring large
amount of routes have redirects turned off already, while the systems with
small amount of routes will not inhibit large overhead during tree traversal.

This changes also fixes netstat -rn display of route expiration time, which
has been broken since the conversion from kread() to sysctl.

Reviewed by: bz
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D23075


# ead85fe4 09-Jan-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Add fibnum, family and vnet pointer to each rib head.

Having metadata such as fibnum or vnet in the struct rib_head
is handy as it eases building functionality in the routing space.
This change is required to properly bring back route redirect support.

Reviewed by: bz
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D23047


# e02d3fe7 07-Jan-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix rtsock route message generation for interface addresses.

Reviewed by: olivier
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D22974


# c83dda36 31-Dec-2019 Alexander V. Chernikov <melifaro@FreeBSD.org>

Split gigantic rtsock route_output() into smaller functions.

Amount of changes to the original code has been intentionally minimised
to ease diffing.
The changes are mostly mechanical, with the following exceptions:

* lltable handler is now called directly based of RTF_LLINFO flag presense.
* "report" logic for updating rtm in RTM_GET/RTM_DELETE has been simplified,
fixing several potential use-after-free cases in rt_addrinfo.
* llable asserts has been replaced with error-returning, preventing kernel
crashes when lltable gw af family is invalid (root required).

MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D22864


# 185c3d2b 16-Dec-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Convert routing statistics to VNET_PCPUSTAT.

Submitted by: ocochard
Reviewed by: melifaro, glebius
Differential Revision: https://reviews.freebsd.org/D22834


# fda45409 18-Oct-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Make rt_getifa_fib() static.


# 6ca363eb 08-May-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Existense of PCB route caching doesn't allow us to use new fast route
lookup KPI in ip_output() like it is already used in ip_forward().
However, when there is no PCB provided we can use fast KPI, gaining
performance advantage.

Typical case when ip_output() is called without a PCB pointer is a
sendto(2) on a not connected UDP socket. In practice DNS servers do
this.

Reviewed by: melifaro
Differential Revision: https://reviews.freebsd.org/D19804


# d25f8522 26-Nov-2018 Mark Johnston <markj@FreeBSD.org>

Plug routing sysctl leaks.

Various structures exported by sysctl_rtsock() contain padding fields
which were not being zeroed.

Reported by: Thomas Barabosch, Fraunhofer FKIE
Reviewed by: ae
MFC after: 3 days
Security: kernel memory disclosure
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18333


# d4672c61 03-Sep-2018 Bjoern A. Zeeb <bz@FreeBSD.org>

Rather than duplicating the functionality of a macro after r322866
use the already existing one. No functional changes.

Reviewed by: karels, ae
Approved by: re (rgrimes)
Differential Revision: https://reviews.freebsd.org/D17004


# fc21c53f 22-Jan-2018 Ryan Stone <rstone@FreeBSD.org>

Reduce code duplication for inpcb route caching

Add a new macro to clear both the L3 and L2 route caches, to
hopefully prevent future instances where only the L3 cache was
cleared when both should have been.

MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13989
Reviewed by: karels


# dbeab32f 22-Jan-2018 Ryan Stone <rstone@FreeBSD.org>

Invalidate inpcb LLE cache if cached route is invalidated

When the inpcb route cache is invalidated after a change to the
routing tables, we need to invalidate the LLE cache as well.
Previous to this change packets for the connection would continue
to use the old L2 information from the old L3 gateway, and the
packets for the connection would likely be blackholed.

MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13988
Reviewed by: karels


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# dcfa556b0 24-Aug-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Garbage collect RT_NORTREF, which is no longer in use after FLOWTABLE removal.


# b83aa367 13-Jun-2017 Andrey V. Elsukov <ae@FreeBSD.org>

Resurrect RTF_RNH_LOCKED flag and restore ability to call rtalloc1_fib()
with acquired RIB lock.

This fixes a possible panic due to trying to acquire RIB rlock when it is
already exclusive locked.

PR: 215963, 215122
MFC after: 1 week
Sponsored by: Yandex LLC


# 2f8c6c0a 16-Apr-2017 Patrick Kelsey <pkelsey@FreeBSD.org>

Fix userland tools that don't check the format of routing socket
messages before accessing message fields that may not be present,
removing dead/duplicate/misleading code along the way.

Document the message format for each routing socket message in
route.h.

Fix a bug in usr.bin/netstat introduced in r287351 that resulted in
pointer computation with essentially random 16-bit offsets and
dereferencing of the results.

Reviewed by: ae
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D10330


# fbbd9655 28-Feb-2017 Warner Losh <imp@FreeBSD.org>

Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96


# 10addc1e 10-Feb-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Last consumer of _WANT_RTENTRY gone.


# 9809e9dc 01-Aug-2016 Conrad Meyer <cem@FreeBSD.org>

rtentry: Initialize rt_mtx with MTX_NEW

The "rtentry" zone does not use UMA_ZONE_ZINIT, so it is invalid to assume the
mutex's memory will be zero. Without MTX_NEW, garbage backing memory may
trigger the "re-initializing a mutex" assertion.

PR: 200991
Submitted by: Chang-Hsien Tsai <luke.tw AT gmail.com>


# 80ae8d60 05-Jun-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

Provide a public interface to rt_flushifroutes which takes the address
family as an argument as well.
This will be used to cleanup individual protocols during VNET teardown.

Obtained from: projects/vnet
Sponsored by: The FreeBSD Foundation


# 6d768226 02-Jun-2016 George V. Neville-Neil <gnn@FreeBSD.org>

This change re-adds L2 caching for TCP and UDP, as originally added in D4306
but removed due to other changes in the system. Restore the llentry pointer
to the "struct route", and use it to cache the L2 lookup (ARP or ND6) as
appropriate.

Submitted by: Mike Karels
Differential Revision: https://reviews.freebsd.org/D6262


# 4f321dbd 24-Mar-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

Fix compile errors after r297225:

- properly V_irtualise variable access unbreaking VIMAGE kernels.
- remove the volatile from the function return type to make architecture
using gcc happy [-Wreturn-type]
"type qualifiers ignored on function return type"
I am not entirely happy with this solution putting the u_int there
but it will do for now.


# 84cc0778 24-Mar-2016 George V. Neville-Neil <gnn@FreeBSD.org>

FreeBSD previously provided route caching for TCP (and UDP). Re-add
route caching for TCP, with some improvements. In particular, invalidate
the route cache if a new route is added, which might be a better match.
The cache is automatically invalidated if the old route is deleted.

Submitted by: Mike Karels
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D4306


# 61eee0e2 24-Jan-2016 Alexander V. Chernikov <melifaro@FreeBSD.org>

MFP r287070,r287073: split radix implementation and route table structure.

There are number of radix consumers in kernel land (pf,ipfw,nfs,route)
with different requirements. In fact, first 3 don't have _any_ requirements
and first 2 does not use radix locking. On the other hand, routing
structure do have these requirements (rnh_gen, multipath, custom
to-be-added control plane functions, different locking).
Additionally, radix should not known anything about its consumers internals.

So, radix code now uses tiny 'struct radix_head' structure along with
internal 'struct radix_mask_head' instead of 'struct radix_node_head'.
Existing consumers still uses the same 'struct radix_node_head' with
slight modifications: they need to pass pointer to (embedded)
'struct radix_head' to all radix callbacks.

Routing code now uses new 'struct rib_head' with different locking macro:
RADIX_NODE_HEAD prefix was renamed to RIB_ (which stands for routing
information base).

New net/route_var.h header was added to hold routing subsystem internal
data. 'struct rib_head' was placed there. 'struct rtentry' will also
be moved there soon.


# 10e0e235 14-Jan-2016 Alexander V. Chernikov <melifaro@FreeBSD.org>

Remove now-unused wrappers for various routing functions.


# 0eb64f4e 13-Jan-2016 Alexander V. Chernikov <melifaro@FreeBSD.org>

Remove RTF_RNH_LOCKED support from rtalloc1_fib().

Last caller using it was eliminated in r293471.

Sponsored by: Yandex LLC


# e5f3746a 11-Jan-2016 Alexander V. Chernikov <melifaro@FreeBSD.org>

Do not rewrite all ro_flags.


# 64e94934 09-Jan-2016 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix userland build broken by r293470.

Pointy hat to: melifaro


# 36402a68 09-Jan-2016 Alexander V. Chernikov <melifaro@FreeBSD.org>

Finish r275196: do not dereference rtentry in if_output() routines.

The only piece of information that is required is rt_flags subset.

In particular, if_loop() requires RTF_REJECT and RTF_BLACKHOLE flags
to check if this particular mbuf needs to be dropped (and what
error should be returned).
Note that if_loop() will always return EHOSTUNREACH for "reject" routes
regardless of RTF_HOST flag existence. This is due to upcoming routing
changes where RTF_HOST value won't be available as lookup result.

All other functions require RTF_GATEWAY flag to check if they need
to return EHOSTUNREACH instead of EHOSTDOWN error.

There are 11 places where non-zero 'struct route' is passed to if_output().
For most of the callers (forwarding, bpf, arp) does not care about exact
error value. In fact, the only place where this result is propagated
is ip_output(). (ip6_output() passes NULL route to nd6_output_ifp()).

Given that, add 3 new 'struct route' flags (RT_REJECT, RT_BLACKHOLE and
RT_IS_GW) and inline function (rt_update_ro_flags()) to copy necessary
rte flags to ro_flags. Call this function in ip_output() after looking up/
verifying rte.

Reviewed by: ae


# ea8d1492 09-Jan-2016 Alexander V. Chernikov <melifaro@FreeBSD.org>

Remove sys/eventhandler.h from net/route.h

Reviewed by: ae


# f2b2e77a 08-Jan-2016 Alexander V. Chernikov <melifaro@FreeBSD.org>

(Temporarily) remove route_redirect_event eventhandler.

Such handler should pass different set of variables, instead
of directly providing 2 locked route entries.
Given that it hasn't been really used since at least 2012, remove
current code.
Will re-add it after finishing most major routing-related changes.

Discussed with: np


# 9a1b64d5 04-Jan-2016 Alexander V. Chernikov <melifaro@FreeBSD.org>

Add rib_lookup_info() to provide API for retrieving individual route
entries data in unified format.

There are control plane functions that require information other than
just next-hop data (e.g. individual rtentry fields like flags or
prefix/mask). Given that the goal is to avoid rte reference/refcounting,
re-use rt_addrinfo structure to store most rte fields. If caller wants
to retrieve key/mask or gateway (which are sockaddrs and are allocated
separately), it needs to provide sufficient-sized sockaddrs structures
w/ ther pointers saved in passed rt_addrinfo.

Convert:
* lltable new records checks (in_lltable_rtcheck(),
nd6_is_new_addr_neighbor().
* rtsock pre-add/change route check.
* IPv6 NS ND-proxy check (RADIX_MPATH code was eliminated because
1) we don't support RTF_ANNOUNCE ND-proxy for networks and there should
not be multiple host routes for such hosts 2) if we have multiple
routes we should inspect them (which is not done). 3) the entire idea
of abusing KRT as storage for ND proxy seems odd. Userland programs
should be used for that purpose).


# 0d4df029 03-Jan-2016 Alexander V. Chernikov <melifaro@FreeBSD.org>

Handle IPV6_PATHMTU option by spliting ip6_getpmtu_ctl() from ip6_getpmtu().
Add ro_mtu field to 'struct route' to be able to pass lookup MTU back to
the caller.

Currently, ip6_getpmtu() has 2 totally different use cases:
1) control plane (IPV6_PATHMTU req), where we just need to calculate MTU
and return it, w/o any reusability.
2) Actual ip6_output() data path where we (nearly) always use the provided
route lookup data. If this data is not 'valid' we need to perform another
lookup and save the result (which cannot be re-used by ip6_output()).

Given that, handle 1) by calling separate function doing rte lookup itself.
Resulting MTU is calculated by (newly-added) ip6_calcmtu() used by both
ip6_getpmtu_ctl() and ip6_getpmtu().
For 2) instead of storing ref'ed rte, store mtu (the only needed data
from the lookup result) inside newly-added ro_mtu field.
'struct route' was shrinked by 8(or 4 bytes) in r292978. Grow it again
by 4 bytes. New ro_mtu field will be used in other places like
ip/tcp_output (EMSGSIZE handling from output routines).

Reviewed by: ae


# 4fb3a820 30-Dec-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Implement interface link header precomputation API.

Add if_requestencap() interface method which is capable of calculating
various link headers for given interface. Right now there is support
for INET/INET6/ARP llheader calculation (IFENCAP_LL type request).
Other types are planned to support more complex calculation
(L2 multipath lagg nexthops, tunnel encap nexthops, etc..).

Reshape 'struct route' to be able to pass additional data (with is length)
to prepend to mbuf.

These two changes permits routing code to pass pre-calculated nexthop data
(like L2 header for route w/gateway) down to the stack eliminating the
need for other lookups. It also brings us closer to more complex scenarios
like transparently handling MPLS nexthops and tunnel interfaces.
Last, but not least, it removes layering violation introduced by flowtable
code (ro_lle) and simplifies handling of existing if_output consumers.

ARP/ND changes:
Make arp/ndp stack pre-calculate link header upon installing/updating lle
record. Interface link address change are handled by re-calculating
headers for all lles based on if_lladdr event. After these changes,
arpresolve()/nd6_resolve() returns full pre-calculated header for
supported interfaces thus simplifying if_output().
Move these lookups to separate ether_resolve_addr() function which ether
returs error or fully-prepared link header. Add <arp|nd6_>resolve_addr()
compat versions to return link addresses instead of pre-calculated data.

BPF changes:
Raw bpf writes occupied _two_ cases: AF_UNSPEC and pseudo_AF_HDRCMPLT.
Despite the naming, both of there have ther header "complete". The only
difference is that interface source mac has to be filled by OS for
AF_UNSPEC (controlled via BIOCGHDRCMPLT). This logic has to stay inside
BPF and not pollute if_output() routines. Convert BPF to pass prepend data
via new 'struct route' mechanism. Note that it does not change
non-optimized if_output(): ro_prepend handling is purely optional.
Side note: hackish pseudo_AF_HDRCMPLT is supported for ethernet and FDDI.
It is not needed for ethernet anymore. The only remaining FDDI user is
dev/pdq mostly untouched since 2007. FDDI support was eliminated from
OpenBSD in 2013 (sys/net/if_fddisubr.c rev 1.65).

Flowtable changes:
Flowtable violates layering by saving (and not correctly managing)
rtes/lles. Instead of passing lle pointer, pass pointer to pre-calculated
header data from that lle.

Differential Revision: https://reviews.freebsd.org/D4102


# 427c2f4e 16-Dec-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Provide additional lle data in IPv6 lltable dump used by ndp(8).

Before the change, things like lle state were queried via
SIOCGNBRINFO_IN6 by ndp(8) for _each_ lle entry in dump.
This ioctl was added in 1999, probably to avoid touching rtsock code.

This change maps SIOCGNBRINFO_IN6 data to standard rtsock dump the
following way:
expire (already) maps to rtm_rmx.rmx_expire
isrouter -> rtm_flags & RTF_GATEWAY
asked -> rtm_rmx.rmx_pksent
state -> rtm_rmx.rmx_state (maps to rmx_weight via define)

Reviewed by: ae


# 65ff3638 08-Dec-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Merge helper fib* functions used for basic lookups.

Vast majority of rtalloc(9) users require only basic info from
route table (e.g. "does the rtentry interface match with the interface
I have?". "what is the MTU?", "Give me the IPv4 source address to use",
etc..).
Instead of hand-rolling lookups, checking if rtentry is up, valid,
dealing with IPv6 mtu, finding "address" ifp (almost never done right),
provide easy-to-use API hiding all the complexity and returning the
needed info into small on-stack structure.

This change also helps hiding route subsystem internals (locking, direct
rtentry accesses).
Additionaly, using this API improves lookup performance since rtentry is not
locked.
(This is safe, since all the rtentry changes happens under both radix WLOCK
and rtentry WLOCK).

Sponsored by: Yandex LLC


# e8b0643e 29-Nov-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Add new rt_foreach_fib_walk_del() function for deleting route entries
by filter function instead of picking into routing table details in
each consumer.
Remove now-unused rt_expunge() (eliminating last external RTF_RNH_LOCKED
user).
This simplifies future nexthops/mulitipath changes and rtrequest1_fib()
locking refactoring.

Actual changes:
Add "rt_chain" field to permit rte grouping while doing batched delete
from routing table (thus growing rte 200->208 on amd64).
Add "rti_filter" / "rti_filterdata" / "rti_spare" fields to rt_addrinfo
to pass filter function to various routing subsystems in standard way.
Convert all rt_expunge() customers to new rt_addinfo-based api and eliminate
rt_expunge().


# f221bcaa 17-Oct-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Remove several compat functions from pre-fib era.


# 441f9243 04-Sep-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Constantify lookup key in ifa_ifwith* functions.
Some places in our network stack already have const
arguments (like if_output() routines and LLE functions).

Code using ifa_ifwith (and similar functins) along with
LLE/_output functions is currently bound to use tricks
like __DECONST(). Provide a cleaner way by making sockaddr
lookup key really constant.

MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D3464


# 2caee4be 10-Aug-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Rename rt_foreach_fib() to rt_foreach_fib_walk().

Suggested by: julian


# 4bdf0b6a 08-Aug-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

MFP r274295:

* Move interface route cleanup to route.c:rt_flushifroutes()
* Convert most of "for (fibnum = 0; fibnum < rt_numfibs; fibnum++)" users
to use new rt_foreach_fib() instead of hand-rolling cycles.


# 5d14e4cd 29-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Provide rte_<get|set> methods to access rtentry for external consumers.


# 1be1588a 29-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

* Make ifa_add_loopback_route() prepare gw before insertion.
* Temporarily move ifa_switch_loopback_route() implementation to route.c


# 7f948f12 16-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Finish r274175: do control plane MTU tracking.

Update route MTU in case of ifnet MTU change.
Add new RTF_FIXEDMTU to track explicitly specified MTU.

Old behavior:
ifconfig em0 mtu 1500->9000 -> all routes traversing em0 do not change MTU.
User has to manually update all routes.
ifconfig em0 mtu 9000->1500 -> all routes traversing em0 do not change MTU.
However, if ip[6]_output finds route with rt_mtu > interface mtu, rt_mtu
gets updated.

New behavior:
ifconfig em0 mtu 1500->9000 -> all interface routes in all fibs gets updated
with new MTU unless RTF_FIXEDMTU flag set on them.
ifconfig em0 mtu 9000->1500 -> all routes in all fibs gets updated with new
MTU unless RTF_FIXEDMTU flag set on them AND rt_mtu is less than ifp mtu.

route add ... -mtu XXX automatically sets RTF_FIXEDMTU flag.
route change .. -mtu 0 automatically removes RTF_FIXEDMTU flag.

PR: 194238
MFC after: 1 month
CR: D1125


# 206344ac 16-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Remove unused rt_endzero define. Remove rt_mtx from public rtentry version.


# 69d149ad 09-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Since we no longer return individual radix entries, it is
not possible to do per-rte accounting. Remove rt_kpktsent.


# 033074c4 09-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Replace 'struct route *' if_output() argument with 'struct nhop_info *'.
Leave 'struct route' as is for legacy routing api users.
Remove most of rtalloc_ign*-derived functions.


# 55e5eda6 08-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Separate radix and routing: use different structures for route and
for other customers.

Introduce new 'struct rib_head' for routing purposes and make
all routing api use it.


# 1398ffe5 08-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Convert most of "for (fibnum = 0; fibnum < rt_numfibs; fibnum++)" users
to use new rt_foreach_fib() instead of hand-rolling cycles.


# 8c3cfe0b 04-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Hide 'struct rtentry' and all its macro inside new header:
net/route_internal.h
The goal is to make its opaque for all code except route/rtsock and
proto domain _rmx.


# b4e8f808 19-Oct-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Switch IPv4 output path to use new routing api.

The goals of the new API is to provide consumers with minimal
needed information, but as fast as possible. So we provide
full nexthop info copied into alighed on-cache structure
instead of rte/ia pointers, their refcounts and locks.
This does not provide solution for protecting from egress
ifp destruction, but does not make it any worse.

Current changes:

nhops:
Add fib4_lookup_prepend() function which stores either full
L2+L3 prepend info (e.g. MAC header in case of plain IPv4) or
L3 info with NH_FLAGS_L2_INCOMPLETE flag indicating that no valid L2
info exists and we have to take "slow" path.

ip_output:
Currently ip[ 46]_output consumers use 'struct route' for
the following purposes:
1) double lookup avoidance(route caching)
2) plain route caching
3) get path MTU to be able to notify source.
The former pattern is mostly used by various tunnels
(gif, gre, stf). (Actually, gre is the only remaining,
others were already converted. Their locking model did
not scale good enogh to benefit from such caching, so
we have (temporarily) removed it without any performance
loss).
Plain route caching used by SCTP is simply wrong and should be removed.
Temporary break it for now just to be able to compile.
Optimize path mtu reporting by providing it in new 'route_info' stucture.

Minimize games with @ia locking/refcounting for route lookup:
add special nhop[46]_extended structure to store more route attributes.
Pointer to given structure can be passed to fib4_lookup_prepend() to indicate
we want this info (we actually needs it for UDP and raw IP).

ether_output:
Provide light-weight ether_output2() call to deal with
transmitting L2 frame (e.g. properly handle broadcast/simloop/bridge/
other L2 hooks before actually transmitting frame by if_transmit()).
Add a hack based on new RT_NHOP ro_flag to distinguish which version should
we call. Better way is probably to add a new "if_output_frame" driver
callbacks.

Next steps:
* Convert ip_fastfwd part
* Implement auto-growing array for per-radix nexthops
* Implement LLE tracking for nexthop calculations to be able to
immediately provide all necessary info in single route lookup
for gateway routes
* Switch radix locking scheme to runtime/cfg lock
* Implement multipath support for rtsock
* Implement "tracked nexthops" for tunnels (e.g. _proper_
nexthop caching)
* Add IPv6 support for remaining parts (postponed not to
interfere with user/ae/inet6 branch)
* Consider adding "if_output_frame" driver call to
ease logical frame pushing.


# 9f21b0b8 21-Sep-2014 Hiroki Sato <hrs@FreeBSD.org>

Fix build.


# ee0bd4b9 20-Sep-2014 Hiroki Sato <hrs@FreeBSD.org>

Make net.add_addr_allfibs vnet-local.


# b980262e 03-May-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Pass radix head ptr along with rte to rtexpunge().
Rename rtexpunge to rt_expunge().


# 0fb9298d 29-Apr-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Move rt_setmetrics() from rtsock.c to route.c.
All rtsock-initiated rte creation/modification are now
performed in route.c holding radix tree write lock.
This reduces the need for per-rte mutex.

Sponsored by: Yandex LLC
MFC after: 1 month


# 36d55f0f 26-Apr-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Unify sa_equal() macro usage.

MFC after: 2 weeks


# de51fbd6 19-Apr-2014 John-Mark Gurney <jmg@FreeBSD.org>

garbage collect something that hasn't been triggered in almost 5 years...
the last consumer was removed a couple years ago...


# 66dcee72 15-Mar-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Garbage collect long time obsoleted (or never used) stuff from routing API.


# 256ea2ab 05-Mar-2014 Gleb Smirnoff <glebius@FreeBSD.org>

The route code used to mtx_destroy() a locked mutex before rtentry free. Now,
after r262763 it started to return locked mutexes to UMA. To fix that,
conditionally unlock the mutex in the destructor.

Tested by: "Sergey V. Dyatko" <sergey.dyatko@gmail.com>


# 5274e55e 04-Mar-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Hide struct rtentry from userland.


# e3a7aa6f 04-Mar-2014 Gleb Smirnoff <glebius@FreeBSD.org>

- Remove rt_metrics_lite and simply put its members into rtentry.
- Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This
removes another cache trashing ++ from packet forwarding path.
- Create zini/fini methods for the rtentry UMA zone. Via initialize
mutex and counter in them.
- Fix reporting of rmx_pksent to routing socket.
- Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode.

The change is mostly targeted for stable/10 merge. For head,
rt_pksent is expected to just disappear.

Discussed with: melifaro
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# d375edc9 09-Jan-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Simplify inet alias handling code: if we're adding/removing alias which
has the same prefix as some other alias on the same interface, use
newly-added rt_addrmsg() instead of hand-rolled in_addralias_rtmsg().

This eliminates the following rtsock messages:

Pinned RTM_ADD for prefix (for alias addition).
Pinned RTM_DELETE for prefix (for alias withdrawal).

Example (got 10.0.0.1/24 on vlan4, playing with 10.0.0.2/24):

before commit, addition:

got message of size 116 on Fri Jan 10 14:13:15 2014
RTM_NEWADDR: address being added to iface: len 116, metric 0, flags:
sockaddrs: <NETMASK,IFP,IFA,BRD>
255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255

got message of size 192 on Fri Jan 10 14:13:15 2014
RTM_ADD: Add Route: len 192, pid: 0, seq 0, errno 0, flags:<UP,PINNED>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK>
10.0.0.0 10.0.0.2 (255) ffff ffff ff

after commit, addition:

got message of size 116 on Fri Jan 10 13:56:26 2014
RTM_NEWADDR: address being added to iface: len 116, metric 0, flags:
sockaddrs: <NETMASK,IFP,IFA,BRD>
255.255.255.0 vlan4:8.0.27.c5.29.d4 14.0.0.2 14.0.0.255

before commit, wihdrawal:

got message of size 192 on Fri Jan 10 13:58:59 2014
RTM_DELETE: Delete Route: len 192, pid: 0, seq 0, errno 0, flags:<UP,PINNED>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK>
10.0.0.0 10.0.0.2 (255) ffff ffff ff

got message of size 116 on Fri Jan 10 13:58:59 2014
RTM_DELADDR: address being removed from iface: len 116, metric 0, flags:
sockaddrs: <NETMASK,IFP,IFA,BRD>
255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255

adter commit, withdrawal:

got message of size 116 on Fri Jan 10 14:14:11 2014
RTM_DELADDR: address being removed from iface: len 116, metric 0, flags:
sockaddrs: <NETMASK,IFP,IFA,BRD>
255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255

Sending both RTM_ADD/RTM_DELETE messages to rtsock is completely wrong
(and requires some hacks to keep prefix in route table on RTM_DELETE).

I've tested this change with quagga (no change) and bird (*).

bird alias handling is already broken in *BSD sysdep code, so nothing
changes here, too.

I'm going to MFC this change if there will be no complains about behavior
change.

While here, fix some style(9) bugs introduced by r260488
(pointed by glebius and bde).

Sponsored by: Yandex LLC
MFC after: 4 weeks


# 4cbac30b 09-Jan-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Split rt_newaddrmsg_fib() into two different functions.
Adding/deleting interface addresses involves access to 3 different subsystems,
int different parts of code. Each call can fail, so reporting successful
operation by rtsock in the middle of the process error-prone.

Further split routing notification API and actual rtsock calls via creating
public-available rt_addrmsg() / rt_routemsg() functions with "private"
rtsock_* backend.

MFC after: 2 weeks


# 7d9b6df1 08-Jan-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Constanly use RT_ALL_FIBS everywhere instead of -1.

MFC after: 2 weeks


# f672f56f 24-Jun-2013 Qing Li <qingli@FreeBSD.org>

Due to the routing related networking kernel redesign work
in FBSD 8.0, interface routes have been returened to the
applications without the RTF_GATEWAY bit. This incompatibility
has caused some issues with Zebra, Qugga and the like.
This patch provides the RTF_GATEWAY flag bit in returned interface
routes so to behave similarly to pre 8.0 systems.

Reviewed by: hrs
Verified by: mackn at opendns dot com


# 3034f43f 08-Mar-2013 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix long-standing issue with interface routes being unprotected:
Use RTM_PINNED flag to mark route as immutable.
Forbid deleting immutable routes without special rtrequest1_fib() flag.
Adding interface address with prefix already in route table is handled
by atomically deleting old prefix and adding interface one.

Discussed with: andre, eri
MFC after: 3 weeks


# bf984051 04-Jul-2012 Gleb Smirnoff <glebius@FreeBSD.org>

When ip_output()/ip6_output() is supplied a struct route *ro argument,
it skips FLOWTABLE lookup. However, the non-NULL ro has dual meaning
here: it may be supplied to provide route, and it may be supplied to
store and return to caller the route that ip_output()/ip6_output()
finds. In the latter case skipping FLOWTABLE lookup is pessimisation.

The difference between struct route filled by FLOWTABLE and filled
by rtalloc() family is that the former doesn't hold a reference on
its rtentry. Reference is hold by flow entry, and it is about to
be released in future. Thus, route filled by FLOWTABLE shouldn't
be passed to RTFREE() macro.

- Introduce new flag for struct route/route_in6, that marks route
not holding a reference on rtentry.
- Introduce new macro RO_RTFREE() that cleans up a struct route
depending on its kind.
- All callers to ip_output()/ip6_output() that do supply non-NULL
but empty route should use RO_RTFREE() to free results of
lookup.
- ip_output()/ip6_output() now do FLOWTABLE lookup always when
ro->ro_rt == NULL.

Tested by: tuexen (SCTP part)


# bfca216e 18-Mar-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

Hide kernel option ROUTETABLES evaluations in the implementation
rather than the header file. With this also move RT_MAXFIBS and
RT_NUMFIBS into the implemantion to avoid further usage in other
code. rt_numfibs is all that should be needed.

This allows users to change the number of FIBs from 1..RT_MAXFIBS(16)
dynamically using the tunable without the need to change the kernel
config for the maximum anymore. This means that thet multi-FIB
feature is now fully available with GENERIC kernels.
The kernel option ROUTETABLES can still be used to set the default
numbers of FIBs in absence of the tunable.

Ok.ed by: julian, hrs, melifaro
MFC after: 2 weeks


# a8498625 02-Feb-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

Move a comment from rtinit1() to the top of the file where dealing with
the (maximum) number of FIBs trying to clarify that evetually FIBs
should probably attached to domain(9) specific storage. [1]

Add a comment on a limitimation on the rt_add_addr_allfibs option.

Use RT_DEFAULT_FIB instead of 0 where applicable.

Add empty line to functions without local variables per style.

Put public yet unused in-tree function rtinit_fib() under BURN_BRIDGES
to indicate that it might go away in the future.

No functional change.

Discussed with: julian [1] (clarification on what the original one meant)
Sponsored by: Cisco Systems, Inc.


# 556d81dd 03-Feb-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

Rather than putting magic 0s as FIB argument into the rt* calls, provide
a macro RT_DEFAULT_FIB defined to 0 to more easily identify the cases
tied to the default FIB.

Sponsored by: Cisco Systems, Inc.


# 528737fd 28-Sep-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Pass the fibnum where we need filtering of the message on the
rtsock allowing routing daemons to filter routing updates on an
rtsock per FIB.

Adjust raw_input() and split it into wrapper and a new function
taking an optional callback argument even though we only have one
consumer [1] to keep the hackish flags local to rtsock.c.

PR: kern/134931
Submitted by: multiple (see PR)
Suggested by: rwatson [1]
Reviewed by: rwatson
MFC after: 3 days


# 1eeb6d97 20-Sep-2011 Kip Macy <kmacy@FreeBSD.org>

Make KBI changes required for future MFCing of inpcb rtentry / llentry caching.

Reviewed by: rwatson, bz
Approved by: re (kib)


# f5857e2d 21-Jun-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Garbage collect never used global, sysctl, externs.

MFC after: 1 week


# 2093339e 20-Mar-2011 Dmitry Chagin <dchagin@FreeBSD.org>

Remove dead code.

MFC after: 1 Week


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 94190b39 01-Apr-2010 Qing Li <qingli@FreeBSD.org>

MFC 205222

Verify interface up status using its link state only
if the interface has such capability. The interface
capability flag indicates whether such capability
exists. This approach is much more backward compatible.
Physical device driver changes will be part of another
commit.

Also updated the ifconfig utility to show the LINKSTATE
capability if present.

Reviewed by: rwatson, imp, juli


# 243785f9 01-Apr-2010 Qing Li <qingli@FreeBSD.org>

MFC 205024

The if_tap interface is of IFT_ETHERNET type, but it
does not set or update the if_link_state variable.
As such RT_LINK_IS_UP() fails for the if_tap interface.

Also, the RT_LINK_IS_UP() needs to bypass all loopback
interfaces because loopback interfaces are considered
up logically as long as the system is running.

This patch fixes the above issues by setting and updating
the if_link_state variable when the tap interface is
opened or closed respectively. Similary approach is
already done in the if_tun device.


# c951da56 01-Apr-2010 Qing Li <qingli@FreeBSD.org>

MFC 204902

One of the advantages of enabling ECMP (a.k.a RADIX_MPATH) is to
allow for connection load balancing across interfaces. Currently
the address alias handling method is colliding with the ECMP code.
For example, when two interfaces are configured on the same prefix,
only one prefix route is installed. So connection load balancing
among the available interfaces is not possible.

The other advantage of ECMP is for failover. The issue with the
current code, is that the interface link-state is not reflected
in the route entry. For example, if there are two interfaces on
the same prefix, the cable on one interface is unplugged, new and
existing connections should switch over to the other interface.
This is not done today and packets go into a black hole.

Also, there is a small bug in the kernel where deleting ECMP routes
in the userland will always return an error even though the command
is successfully executed.


# 6b533b5d 16-Mar-2010 Qing Li <qingli@FreeBSD.org>

Verify interface up status using its link state only
if the interface has such capability. The interface
capability flag indicates whether such capability
exists. This approach is much more backward compatible.
Physical device driver changes will be part of another
commit.

Also updated the ifconfig utility to show the LINKSTATE
capability if present.

Reviewed by: rwatson, imp, juli
MFC after: 3 days


# 355ad3ea 11-Mar-2010 Qing Li <qingli@FreeBSD.org>

The if_tap interface is of IFT_ETHERNET type, but it
does not set or update the if_link_state variable.
As such RT_LINK_IS_UP() fails for the if_tap interface.

Also, the RT_LINK_IS_UP() needs to bypass all loopback
interfaces because loopback interfaces are considered
up logically as long as the system is running.

This patch fixes the above issues by setting and updating
the if_link_state variable when the tap interface is
opened or closed respectively. Similary approach is
already done in the if_tun device.

MFC after: 3 days


# c7ea0aa6 08-Mar-2010 Qing Li <qingli@FreeBSD.org>

One of the advantages of enabling ECMP (a.k.a RADIX_MPATH) is to
allow for connection load balancing across interfaces. Currently
the address alias handling method is colliding with the ECMP code.
For example, when two interfaces are configured on the same prefix,
only one prefix route is installed. So connection load balancing
among the available interfaces is not possible.

The other advantage of ECMP is for failover. The issue with the
current code, is that the interface link-state is not reflected
in the route entry. For example, if there are two interfaces on
the same prefix, the cable on one interface is unplugged, new and
existing connections should switch over to the other interface.
This is not done today and packets go into a black hole.

Also, there is a small bug in the kernel where deleting ECMP routes
in the userland will always return an error even though the command
is successfully executed.

MFC after: 5 days


# 32c53401 05-Jan-2010 Qing Li <qingli@FreeBSD.org>

MFC r201282, r201543

r201282
-------
The proxy arp entries could not be added into the system over the
IFF_POINTOPOINT link types. The reason was due to the routing
entry returned from the kernel covering the remote end is of an
interface type that does not support ARP. This patch fixes this
problem by providing a hint to the kernel routing code, which
indicates the prefix route instead of the PPP host route should
be returned to the caller. Since a host route to the local end
point is also added into the routing table, and there could be
multiple such instantiations due to multiple PPP links can be
created with the same local end IP address, this patch also fixes
the loopback route installation failure problem observed prior to
this patch. The reference count of loopback route to local end would
be either incremented or decremented. The first instantiation would
create the entry and the last removal would delete the route entry.

r201543
-------
The IFA_RTSELF address flag marks a loopback route has been installed
for the interface address. This marker is necessary to properly support
PPP types of links where multiple links can have the same local end
IP address. The IFA_RTSELF flag bit maps to the RTF_HOST value, which
was combined into the route flag bits during prefix installation in
IPv6. This inclusion causing the prefix route to be unusable. This
patch fixes this bug by excluding the IFA_RTSELF flag during route
installation.

PR: ports/141342, kern/141134


# c7ab6602 30-Dec-2009 Qing Li <qingli@FreeBSD.org>

The proxy arp entries could not be added into the system over the
IFF_POINTOPOINT link types. The reason was due to the routing
entry returned from the kernel covering the remote end is of an
interface type that does not support ARP. This patch fixes this
problem by providing a hint to the kernel routing code, which
indicates the prefix route instead of the PPP host route should
be returned to the caller. Since a host route to the local end
point is also added into the routing table, and there could be
multiple such instantiations due to multiple PPP links can be
created with the same local end IP address, this patch also fixes
the loopback route installation failure problem observed prior to
this patch. The reference count of loopback route to local end would
be either incremented or decremented. The first instantiation would
create the entry and the last removal would delete the route entry.

MFC after: 5 days


# 9a311445 08-Sep-2009 Navdeep Parhar <np@FreeBSD.org>

Add arp_update_event. This replaces route_arp_update_event, which
has not worked since the arp-v2 rewrite.

The event handler will be called with the llentry write-locked and
can examine la_flags to determine whether the entry is being added
or removed.

Reviewed by: gnn, kmacy
Approved by: gnn (mentor)
MFC after: 1 month


# f987f193 22-Jun-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Collect all VIMAGE_GLOBALS variables in one place.

No longer export rt_tables as all lookups go through
rt_tables_get_rnh().

We cannot make rt_tables (and rtstat, rttrash[1]) static as
netstat -r (-rs[1]) would stop working on a stripped
VIMAGE_GLOBALS kernel.

Reviewed by: zec
Presumably broken by: phk 13.5y ago in r12820 [1]


# c2c2a7c1 01-Jun-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Convert the two dimensional array to be malloced and introduce
an accessor function to get the correct rnh pointer back.

Update netstat to get the correct pointer using kvm_read()
as well.

This not only fixes the ABI problem depending on the kernel
option but also permits the tunable to overwrite the kernel
option at boot time up to MAXFIBS, enlarging the number of
FIBs without having to recompile. So people could just use
GENERIC now.

Reviewed by: julian, rwatson, zec
X-MFC: not possible


# 279aa3d4 16-Apr-2009 Kip Macy <kmacy@FreeBSD.org>

Change if_output to take a struct route as its fourth argument in order
to allow passing a cached struct llentry * down to L2

Reviewed by: rwatson


# 8f22a721 15-Apr-2009 Kip Macy <kmacy@FreeBSD.org>

revert RTM_VERSION change - it doesn't do what I thought it does and changing breaks
ifconfig needlessly


# de4ab55e 15-Apr-2009 Kip Macy <kmacy@FreeBSD.org>

add an llentry to struct route{_in6} to allow it to be passed around with
the rtentry


# 427ac07f 14-Apr-2009 Kip Macy <kmacy@FreeBSD.org>

Extend route command:
- add show as alias for get
- add weights to allow mpath to do more than equal cost
- add sticky / nostick to disable / re-enable per-connection load balancing

This adds a field to rt_metrics_lite so network bits of world will need to be re-built.

Reviewed by: jeli & qingli


# 14981d80 12-Jan-2009 Qing Li <qingli@FreeBSD.org>

Revive the RTF_LLINFO flag in route.h. The kernel code is guarded
by the new kernel option COMPAT_ROUTE_FLAGS for binary backward
compatibility. The RTF_LLDATA flag maps to the same value as RTF_LLINFO.
RTF_LLDATA is used by the arp and ndp utilities. The RTF_LLDATA flag is
always returned to the userland regardless whether the COMPAT_ROUTE_FLAGS
is defined.


# 8eca593c 26-Dec-2008 Qing Li <qingli@FreeBSD.org>

This checkin addresses a couple of issues:
1. The "route" command allows route insertion through the interface-direct
option "-iface". During if_attach(), an sockaddr_dl{} entry is created
for the interface and is part of the interface address list. This
sockaddr_dl{} entry describes the interface in detail. The "route"
command selects this entry as the "gateway" object when the "-iface"
option is present. The "arp" and "ndp" commands also interact with the
kernel through the routing socket when adding and removing static L2
entries. The static L2 information is also provided through the
"gateway" object with an AF_LINK family type, similar to what is
provided by the "route" command. In order to differentiate between
these two types of operations, a RTF_LLDATA flag is introduced. This
flag is set by the "arp" and "ndp" commands when issuing the add and
delete commands. This flag is also set in each L2 entry returned by the
kernel. The "arp" and "ndp" command follows a convention where a RTM_GET
is issued first followed by a RTM_ADD/DELETE. This RTM_GET request fills
in the fields for a "rtm" object, which is reinjected into the kernel by
a subsequent RTM_ADD/DELETE command. The entry returend from RTM_GET
is a prefix route, so the RTF_LLDATA flag must be specified when issuing
the RTM_ADD/DELETE messages.

2. Enforce the convention that NET_RT_FLAGS with a 0 w_arg is the
specification for retrieving L2 information. Also optimized the
code logic.

Reviewed by: julian


# 6e6b3f7c 14-Dec-2008 Qing Li <qingli@FreeBSD.org>

This main goals of this project are:
1. separating L2 tables (ARP, NDP) from the L3 routing tables
2. removing as much locking dependencies among these layers as
possible to allow for some parallelism in the search operations
3. simplify the logic in the routing code,

The most notable end result is the obsolescent of the route
cloning (RTF_CLONING) concept, which translated into code reduction
in both IPv4 ARP and IPv6 NDP related modules, and size reduction in
struct rtentry{}. The change in design obsoletes the semantics of
RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland
applications such as "arp" and "ndp" have been modified to reflect
those changes. The output from "netstat -r" shows only the routing
entries.

Quite a few developers have contributed to this project in the
past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and
Andre Oppermann. And most recently:

- Kip Macy revised the locking code completely, thus completing
the last piece of the puzzle, Kip has also been conducting
active functional testing
- Sam Leffler has helped me improving/refactoring the code, and
provided valuable reviews
- Julian Elischer setup the perforce tree for me and has helped
me maintaining that branch before the svn conversion


# 3120b9d4 07-Dec-2008 Kip Macy <kmacy@FreeBSD.org>

- convert radix node head lock from mutex to rwlock
- make radix node head lock not recursive
- fix LOR in rtexpunge
- fix LOR in rtredirect

Reviewed by: sam


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 4e7840e2 20-Sep-2008 Marko Zec <zec@FreeBSD.org>

Move #defines for MRT-related constants from net/route.c to
net/route.h, because the vnet code will need those constants as
well.

Reviewed by: bz
Approved by: julian (mentor)
MFC after: never


# 5e7b481a 14-Sep-2008 Julian Elischer <julian@FreeBSD.org>

come on Julian, make up if you're committing one change or the other.
fix braino


# 93fcb5a2 14-Sep-2008 Julian Elischer <julian@FreeBSD.org>

Revert a part of the MRT commit that proved un-needed.
rt_check() in its original form proved to be sufficient and
rt_check_fib() can go away (as can its evil twin in_rt_check()).

I believe this does NOT address the crashes people have been seeing
in rt_check.

MFC after: 1 week


# abeae30c 05-Sep-2008 Julian Elischer <julian@FreeBSD.org>

Be consistent about whether these multi-lined macros are separated by
a blank line. Some were, some weren't. Decide in favour of the line
as it matches what an inline would do, and it's easier to read.


# 6f95a5eb 09-May-2008 Julian Elischer <julian@FreeBSD.org>

move a #define from a place it shouldn't have been to a place it should
have been. Basically my testign didn't ocver one case that this broke.
thanks tinderbox!


# 8b07e49a 09-May-2008 Julian Elischer <julian@FreeBSD.org>

Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.

Constraints:
------------

I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.

One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".

One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.

This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.

To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.

The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.

The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.

In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.

One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).

You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.

This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.

Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.

Packets fall into one of a number of classes.

1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..

setfib -3 ping target.example.com # will use fib 3 for ping.

It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.

2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)

3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).

4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.

5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.

6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.

Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)

In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.

In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.

Early testing experience:
-------------------------

Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.

For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.

Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.

ipfw has grown 2 new keywords:

setfib N ip from anay to any
count ip from any to any fib N

In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.

SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.

Where to next:
--------------------

After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.

Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.

My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.

When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.

Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.

This work was sponsored by Ironport Systems/Cisco

Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco


# e440aed9 12-Apr-2008 Qing Li <qingli@FreeBSD.org>

This patch provides the back end support for equal-cost multi-path
(ECMP) for both IPv4 and IPv6. Previously, multipath route insertion
is disallowed. For example,

route add -net 192.103.54.0/24 10.9.44.1
route add -net 192.103.54.0/24 10.9.44.2

The second route insertion will trigger an error message of
"add net 192.103.54.0/24: gateway 10.2.5.2: route already in table"

Multiple default routes can also be inserted. Here is the netstat
output:

default 10.2.5.1 UGS 0 3074 bge0 =>
default 10.2.5.2 UGS 0 0 bge0

When multipath routes exist, the "route delete" command requires
a specific gateway to be specified or else an error message would
be displayed. For example,

route delete default

would fail and trigger the following error message:

"route: writing to routing socket: No such process"
"delete net default: not in table"

On the other hand,

route delete default 10.2.5.2

would be successful: "delete net default: gateway 10.2.5.2"

One does not have to specify a gateway if there is only a single
route for a particular destination.

I need to perform more testings on address aliases and multiple
interfaces that have the same IP prefixes. This patch as it
stands today is not yet ready for prime time. Therefore, the ECMP
code fragments are fully guarded by the RADIX_MPATH macro.
Include the "options RADIX_MPATH" in the kernel configuration
to enable this feature.

Reviewed by: robert, sam, gnn, julian, kmacy


# f321ff15 27-Dec-2007 Maxime Henrion <mux@FreeBSD.org>

Add a workaround for a deadlock between the rt_setgate() and rt_check()
functions. It is easily triggered by running routed, and, I expect, by
running any other daemon that uses routing sockets.

Reviewed by: net@
MFC after: 1 week


# 29910a5a 17-Dec-2007 Kip Macy <kmacy@FreeBSD.org>

widen the routing event interface (arp update, redirect, and eventually pmtu change)
into separate functions

revert previous commit's changes to arpresolve and add a new interface
arpresolve2 which does arp resolution without an mbuf


# 8e7e854c 12-Dec-2007 Kip Macy <kmacy@FreeBSD.org>

add interface for allowing consumers to register for ARP updates,
redirects, and path MTU changes

Reviewed by: silby


# 22cafcf0 15-Mar-2006 Andre Oppermann <andre@FreeBSD.org>

- Fill in the correct rtm_index for RTM_ADD and RTM_CHANGE messages.

- Allow RTM_CHANGE to change a number of route flags as specified by
RTF_FMASK.

- The unused rtm_use field in struct rt_msghdr is redesignated as
rtm_fmask field to communicate route flag changes in RTM_CHANGE
messages from userland. The use count of a route was moved to
rtm_rmx a long time ago. For source code compatibility reasons
a define of rtm_use to rtm_fmask is provided.

These changes faciliate running of multiple cooperating routing
daemons at the same time without causing undesired interference.
Open[BGP|OSPF]D make use of these features to have IGP routes
override EGP ones.

Obtained from: OpenBSD (claudio@)
MFC after: 3 days


# 17a8471f 14-Sep-2005 Andre Oppermann <andre@FreeBSD.org>

Remove bogous semicolons at the end of the definitions of
'do { ... } while (0)' macros.

PR: kern/83088
Sumbitted by: <antoine.brodin at laposte.net>


# c398230b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# b83a279f 05-Oct-2004 Sam Leffler <sam@FreeBSD.org>

Add 802.11-specific events that are dispatched through the routing socket.
This really doesn't belong here but is preferred (for the moment) over
adding yet another mechanism for sending msgs from the kernel to user apps.

Reviewed by: imp


# 445e045b 28-Jul-2004 Alexander Kabaev <kan@FreeBSD.org>

Avoid casts as lvalues.


# 3916ebe8 24-Apr-2004 Luigi Rizzo <luigi@FreeBSD.org>

document the locking behaviour of the functions that access
the routing table.


# f76d5670 20-Apr-2004 Luigi Rizzo <luigi@FreeBSD.org>

Document an assumption on the structure of 'struct rtentry'


# 2eb5613f 17-Apr-2004 Luigi Rizzo <luigi@FreeBSD.org>

make route_init() static


# e74642df 13-Apr-2004 Luigi Rizzo <luigi@FreeBSD.org>

route.h: introduce a macro, SA_SIZE(struct sockaddr *) which returns
the space occupied by a struct sockaddr when passed through a
routing socket.
Use it to replace the macro ROUNDUP(int), that does the same but
is redefined by every file which uses it, courtesy of
the School of Cut'n'Paste Programming(TM).

(partial) userland changes to follow.


# f36cfd49 07-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson


# f7c5baa1 03-Apr-2004 Luigi Rizzo <luigi@FreeBSD.org>

+ arpresolve(): remove an unused argument
+ struct ifnet: remove unused fields, move ipv6-related field close
to each other, add a pointer to l3<->l2 translation tables (arp,nd6,
etc.) for future use.

+ struct route: remove an unused field, move close to each
other some fields that might likely go away in the future


# 97d8d152 20-Nov-2003 Andre Oppermann <andre@FreeBSD.org>

Introduce tcp_hostcache and remove the tcp specific metrics from
the routing table. Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.

It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination. Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.

tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.

It removes significant locking requirements from the tcp stack with
regard to the routing table.

Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)


# 26d02ca7 20-Nov-2003 Andre Oppermann <andre@FreeBSD.org>

Remove RTF_PRCLONING from routing table and adjust users of it
accordingly. The define is left intact for ABI compatibility
with userland.

This is a pre-step for the introduction of tcp_hostcache. The
network stack remains fully useable with this change.

Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)


# 7138d65c 08-Nov-2003 Sam Leffler <sam@FreeBSD.org>

replace explicit changes to rt_refcnt by RT_ADDREF and RT_REMREF
macros that expand to include assertions when the system is built
with INVARIANTS

Supported by: FreeBSD Foundation


# 9c63e9db 30-Oct-2003 Sam Leffler <sam@FreeBSD.org>

Overhaul routing table entry cleanup by introducing a new rtexpunge
routine that takes a locked routing table reference and removes all
references to the entry in the various data structures. This
eliminates instances of recursive locking and also closes races
where the lock on the entry had to be dropped prior to calling
rtrequest(RTM_DELETE). This also cleans up confusion where the
caller held a reference to an entry that might have been reclaimed
(and in some cases used that reference).

Supported by: FreeBSD Foundation


# d1dd20be 03-Oct-2003 Sam Leffler <sam@FreeBSD.org>

Locking for updates to routing table entries. Each rtentry gets a mutex
that covers updates to the contents. Note this is separate from holding
a reference and/or locking the routing table itself.

Other/related changes:

o rtredirect loses the final parameter by which an rtentry reference
may be returned; this was never used and added unwarranted complexity
for locking.
o minor style cleanups to routing code (e.g. ansi-fy function decls)
o remove the logic to bump the refcnt on the parent of cloned routes,
we assume the parent will remain as long as the clone; doing this avoids
a circularity in locking during delete
o convert some timeouts to MPSAFE callouts

Notes:

1. rt_mtx in struct rtentry is guarded by #ifdef _KERNEL as user-level
applications cannot/do-no know about mutex's. Doing this requires
that the mutex be the last element in the structure. A better solution
is to introduce an externalized version of struct rtentry but this is
a major task because of the intertwining of rtentry and other data
structures that are visible to user applications.
2. There are known LOR's that are expected to go away with forthcoming
work to eliminate many held references. If not these will be resolved
prior to release.
3. ATM changes are untested.

Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS (partly)


# becc44d7 03-Oct-2003 Sam Leffler <sam@FreeBSD.org>

cleanups prior to adding locking (and in some cases to eliminate locking):

o move route_cb to be private to rtsock.c
o replace global static route_proto by locals
o eliminate global #define shorthands for info references
o remove some register decls
o ansi-fy function decls
o move items to be close in scope to their usage
o add rt_dispatch function for dispatching the actual message
o cleanup tangled logic for doing all-but-me msg send

Support by: FreeBSD Foundation


# 1ebe9986 18-Jul-2003 Jeffrey Hsu <hsu@FreeBSD.org>

Add mutex for routing entries.

Reviewed by: bmilekic, silby


# 3c6b084e 05-Mar-2003 Peter Wemm <peter@FreeBSD.org>

Finish driving a stake through the heart of netns and the associated
ifdefs scattered around the place - its dead Jim!

The SMB stuff had stolen AF_NS, make it official.


# 7f760c48 02-Mar-2003 Matthew N. Dodd <mdodd@FreeBSD.org>

Reduce code duplication. This adds the function rt_check() to route.c.

Approved by: sam (in principle)


# 34fe62c7 24-Mar-2002 Bruce Evans <bde@FreeBSD.org>

Fixed some style bugs in the removal of __P(()). The main ones were
not removing tabs before "__P((", and not outdenting continuation lines
to preserve non-KNF lining up of code with parentheses. Switch to KNF
formatting and/or rewrap the whole prototype in some cases.


# 929ddbbb 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# 7b6edd04 18-Jan-2002 Ruslan Ermilov <ru@FreeBSD.org>

Introduce an interface announcement message for the routing
socket so that routing daemons and other interested parties
know when an interface is attached/detached.

PR: kern/33747
Obtained from: NetBSD
MFC after: 2 weeks


# be2ac88c 21-Nov-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Introduce a syncache, which enables FreeBSD to withstand a SYN flood
DoS in an improved fashion over the existing code.

Reviewed by: silby (in a previous iteration)
Sponsored by: DARPA, NAI Labs


# 8071913d 17-Oct-2001 Ruslan Ermilov <ru@FreeBSD.org>

Pull post-4.4BSD change to sys/net/route.c from BSD/OS 4.2.

Have sys/net/route.c:rtrequest1(), which takes ``rt_addrinfo *''
as the argument. Pass rt_addrinfo all the way down to rtrequest1
and ifa->ifa_rtrequest. 3rd argument of ifa->ifa_rtrequest is now
``rt_addrinfo *'' instead of ``sockaddr *'' (almost noone is
using it anyways).

Benefit: the following command now works. Previously we needed
two route(8) invocations, "add" then "change".
# route add -inet6 default ::1 -ifp gif0

Remove unsafe typecast in rtrequest(), from ``rtentry *'' to
``sockaddr *''. It was introduced by 4.3BSD-Reno and never
corrected.

Obtained from: BSD/OS, NetBSD
MFC after: 1 month
PR: kern/28360


# 4862bf8c 17-Oct-2001 Ruslan Ermilov <ru@FreeBSD.org>

64-bit fixes from CSRG.


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# c3cb7e5d 25-Jul-2001 Bill Fenner <fenner@FreeBSD.org>

Don't bother passing p to rtioctl just so it can fail to pass it to mrt_ioctl


# e7f32693 21-Jul-2000 Jayanth Vijayaraghavan <jayanth@FreeBSD.org>

When a connection is being dropped due to a listen queue overflow,
delete the cloned route that is associated with the connection.
This does not exhaust the routing table memory when the system
is under a SYN flood attack. The route entry is not deleted if there
is any prior information cached in it.

Reviewed by: Peter Wemm,asmodai


# 242c5536 12-Feb-2000 Peter Wemm <peter@FreeBSD.org>

Clean up some loose ends in the network code, including the X.25 and ISO
#ifdefs. Clean out unused netisr's and leftover netisr linker set gunk.
Tested on x86 and alpha, including world.

Approved by: jkh


# 664a31e4 28-Dec-1999 Peter Wemm <peter@FreeBSD.org>

Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL"
is an application space macro and the applications are supposed to be free
to use it as they please (but cannot). This is consistant with the other
BSD's who made this change quite some time ago. More commits to come.


# ae5bcbff 09-Dec-1999 Yoshinobu Inoue <shin@FreeBSD.org>

rtcalloc() is removed because it turned out not to be necessary for FreeBSD.
(It was added as a part of KAME patch)

Specified by: jdp@polstra.com


# 76429de4 05-Nov-1999 Yoshinobu Inoue <shin@FreeBSD.org>

KAME related header files additions and merges.
(only those which don't affect c source files so much)

Reviewed by: cvs-committers
Obtained from: KAME project


# 97998e86 13-Sep-1999 Ruslan Ermilov <ru@FreeBSD.org>

Add comments, fix typos.

Reviewed by: wollman


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# 64e41ba7 30-Jun-1999 Mike Smith <msmith@FreeBSD.org>

Increase the size of the route reference count from 15 bits to 31 bits.

This doesn't change the size or alignment of the structure on either i386
or Alpha, and thus should be binary-compatible (modulo problems with old
applications and routes with more than 2^15 references).

Reviewed by: peter


# dfd5dee1 06-May-1999 Peter Wemm <peter@FreeBSD.org>

Add sufficient braces to keep egcs happy about potentially ambiguous
if/else nesting.


# bd7ac1b2 23-Mar-1998 Bruce Evans <bde@FreeBSD.org>

Added a forward struct declaration so that this file is less
self-insufficient.


# bea0f0be 06-Sep-1997 Bruce Evans <bde@FreeBSD.org>

Some staticized variables were still declared to be extern.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 477180fb 13-Jan-1997 Garrett Wollman <wollman@FreeBSD.org>

Use the new if_multiaddrs list for multicast addresses rather than the
previous hackery involving struct in_ifaddr and arpcom. Get rid of the
abominable multi_kludge. Update all network interfaces to use the
new machanism. Distressingly few Ethernet drivers program the multicast
filter properly (assuming the hardware has one, which it usually does).


# 6303c69c 09-Oct-1996 Garrett Wollman <wollman@FreeBSD.org>

Get rid of obsolete RTF_MASK and RTF_CHAINDELETE flags.


# 8a15e7f4 26-Aug-1996 Julian Elischer <julian@FreeBSD.org>

change a comment to match what the BSD4.4 book says.


# 9f9b3dc4 06-May-1996 Garrett Wollman <wollman@FreeBSD.org>

Add three new route flags to help determine what sort of address
the destination represents. For IP:

- Iff it is a host route, RTF_LOCAL and RTF_BROADCAST indicate local
(belongs to this host) and broadcast addresses, respectively.

- For all routes, RTF_MULTICAST is set if the destination is multicast.

The RTF_BROADCAST flag is used by ip_output() to eliminate a call to
in_broadcast() in a common case; this gives about 1% in our packet-generation
experiments. All three flags might be used (although they aren't now)
to determine whether a packet can be forwarded; a given host route can
represent a forwardable address if:

(rt->rt_flags & (RTF_HOST | RTF_LOCAL | RTF_BROADCAST | RTF_MULTICAST))
== RTF_HOST

Obviously, one still has to do all the work if a host route is not present,
but this code allows one to cache the results of such a lookup if rtalloc1()
is called without masking RTF_PRCLONING.


# 6c5e9bbd 30-Jan-1996 Mike Pritchard <mpp@FreeBSD.org>

Fix a bunch of spelling errors in the comment fields of
a bunch of system include files.


# f708ef1b 14-Dec-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Another mega commit to staticize things.


# 52041295 16-Nov-1995 Poul-Henning Kamp <phk@FreeBSD.org>

All net.* sysctl converted now.


# cc6a66f2 26-Oct-1995 Julian Elischer <julian@FreeBSD.org>

Reviewed by: julian and jhay@mikom.csir.co.za
Submitted by: Mike Mitchell, supervisor@alb.asctmd.com

This is a bulk mport of Mike's IPX/SPX protocol stacks and all the
related gunf that goes with it..
it is not guaranteed to work 100% correctly at this time
but as we had several people trying to work on it
I figured it would be better to get it checked in so
they could all get teh same thing to work on..

Mikes been using it for a year or so
but on 2.0

more changes and stuff will be merged in from other developers now that this is in.

Mike Mitchell, Network Engineer
AMTECH Systems Corporation, Technology and Manufacturing
8600 Jefferson Street, Albuquerque, New Mexico 87113 (505) 856-8000
supervisor@alb.asctmd.com


# 3ba6426d 16-Oct-1995 Garrett Wollman <wollman@FreeBSD.org>

Change signature of rt->rt_output() so that it is compatible with
ifp->if_output() functions. This way, initial implementations of
rt_output functionality can just lazily use if_output until customized
versions are written.


# 9e52b982 03-Oct-1995 Garrett Wollman <wollman@FreeBSD.org>

Import of 4.4-Lite-2 sys/net to make merge and examination easier. Since we
are not on the vendor branch for any of these files, the conflicts shown make
no matter.

Obtained from: 4.4BSD-Lite-2


# 28f8db14 29-Jul-1995 Bruce Evans <bde@FreeBSD.org>

Eliminate sloppy common-style declarations. There should be none left for
the LINT configuation.


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# c2bed6a3 20-Mar-1995 Garrett Wollman <wollman@FreeBSD.org>

Better fix for the deletion of parents of cloned routes problem,
superseding the `nextchild' hack. This also provides a way
forward to fix RTM_CHANGE and RTM_ADD as well.


# 81e3e057 08-Feb-1995 Garrett Wollman <wollman@FreeBSD.org>

Define RTF_PINNED for future use.


# 54044407 07-Feb-1995 Garrett Wollman <wollman@FreeBSD.org>

Correct fix for merge conflicts: RTM_VERSION is always 5. Header files
included by user code must never depend on kernel compile options.


# e128aa71 06-Feb-1995 David Greenman <dg@FreeBSD.org>

Fixed unresolved CVS conflict on RTM_VERSION.


# 24e16f2e 06-Feb-1995 Garrett Wollman <wollman@FreeBSD.org>

Merge in the socket-level support for Transaction TCP from the OLAH_TTCP
branch.

Submitted by: Andras Olah <olah@cs.utwente.nl>


# 6670942f 23-Jan-1995 Bruce Evans <bde@FreeBSD.org>

Declare `struct mbuf' with the correct scope to avoid lots of warnings
for compiling routed... Previously a kernel function pointer that is
bogusly visible to applications was incompletely declared to hide the
problem.


# 18e1f1f1 22-Jan-1995 Garrett Wollman <wollman@FreeBSD.org>

route.c: keep track of where cloned routes come from, and make sure to
delete them when the ``parent'' goes away

route.h: add glue to track this to rtentry structure. WARNING WILL ROBINSON!
This will be yet another incompatible change in your route-using binaries.
I apologize, but this was the only way to do it. I took this opportunity
to increase the size of the metrics to what I believe will be the final
length for 2.1, so that when the T/TCP stuff is done, this won't happen
again.


# 995add1a 13-Dec-1994 Garrett Wollman <wollman@FreeBSD.org>

Add support for two separate cloning flags, one set by the lower layers,
and one set by the protocol family. Also add another parameter to
rtalloc1() to allow for any interface flags to be ignored; currently
this is only useful for RTF_PRCLONING. Get rid of rt_prflags and re-unite
with rt_flags. Add T/TCP ``route metrics''.

NB: YOU MUST RECOMPILE `route' AND OTHER RELATED PROGRAMS AS A RESULT OF
THIS CHANGE.

This also adds a new interface parameter, `ifi_physical', which will
eventually replace IFF_ALTPHYS as the mechanism for specifying the
particular physical connection desired on a multiple-connection card.

NB: YOU MUST RECOMPILE `ifconfig' AND OTHER RELATED PROGRAMS AS A RESULT OF
THIS CHANGE.


# f084e014 02-Nov-1994 Garrett Wollman <wollman@FreeBSD.org>

Collapse two fields so that we have space for another 32 flags.
NB: You will have to recompile programs which use the `rt_use' member in
order to get the correct values. This should not cause incorrect operation,
but the statistics may look a little confusing.


# cea1da3b 20-Aug-1994 Paul Richards <paul@FreeBSD.org>

Make idempotent.

Submitted by: Paul


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources