History log of /freebsd-current/sys/net/rtsock.c
Revision Date Author Comments
# ce69e373 03-Feb-2024 Gleb Smirnoff <glebius@FreeBSD.org>

Revert "sockets: retire sorflush()"

Provide a comment in sorflush() why the socket I/O sx(9) lock is actually
important.

This reverts commit 507f87a799cf0811ce30f0ae7f10ba19b2fd3db3.


# ab6d773d 22-Jan-2024 Gordon Bergling <gbe@FreeBSD.org>

rtsock: Fix a typo in a source code comment

- s/adddress/address/

MFC after: 3 days


# 507f87a7 16-Jan-2024 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: retire sorflush()

With removal of dom_dispose method the function boils down to two
meaningful function calls: socantrcvmore() and sbrelease(). The latter is
only relevant for protocols that use generic socket buffers.

The socket I/O sx(9) lock acquisition in sorflush() is not relevant for
shutdown(2) operation as it doesn't do any I/O that may interleave with
read(2) or write(2). The socket buffer mutex acquisition inside
sbrelease() is what guarantees thread safety. This sx(9) acquisition in
soshutdown() can be tracked down to 4.4BSD times, where it used to be
sblock(), and it was carried over through the years evolving together with
sockets with no reconsideration of why do we carry it over. I can't tell
if that sblock() made sense back then, but it doesn't make any today.

Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D43415


# 5bba2728 16-Jan-2024 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: make pr_shutdown fully protocol specific method

Disassemble a one-for-all soshutdown() into protocol specific methods.
This creates a small amount of copy & paste, but makes code a lot more
self documented, as protocol specific method would execute only the code
that is relevant to that protocol and nothing else. This also fixes a
couple recent regressions and reduces risk of future regressions. The
extended KPI for the new pr_shutdown removes need for the extra pr_flush
which was added for the sake of SCTP which could not perform its shutdown
properly with the old one. Particularly for SCTP this change streamlines
a lot of code.

Some notes on why certain parts of code were copied or were not to certain
protocols:
* The (SS_ISCONNECTED | SS_ISCONNECTING | SS_ISDISCONNECTING) check is
needed only for those protocols that may be connected or disconnected.
* The above reduces into only SS_ISCONNECTED for those protocols that
always connect instantly.
* The ENOTCONN and continue processing hack is left only for datagram
protocols.
* The SOLISTENING(so) block is copied to those protocols that listen(2).
* sorflush() on SHUT_RD is copied almost to every protocol, but that
will be refactored later.
* wakeup(&so->so_timeo) is copied to protocols that can make a non-instant
connect(2), can SO_LINGER or can accept(2).

There are three protocols (netgraph(4), Bluetooth, SDP) that did not have
pr_shutdown, but old soshutdown() would still perform sorflush() on
SHUT_RD for them and also wakeup(9). Those protocols partially supported
shutdown(2) returning EOPNOTSUP for SHUT_WR/SHUT_RDWR, now they fully lost
shutdown(2) support. I'm pretty sure netgraph(4) and Bluetooth are okay
about that and SDP is almost abandoned anyway.

Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D43413


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 21a722d9 25-Sep-2023 Zhenlei Huang <zlei@FreeBSD.org>

rtsock: Add sysctl flag CTLFLAG_TUN to loader tunable

The sysctl variable `net.route.netisr_maxqlen` is actually a loader
tunable. Add sysctl flag CTLFLAG_TUN to it so that `sysctl -T` will
report it correctly.

No functional change intended.

Reviewed by: glebius
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D41928


# 2ff63af9 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .h pattern

Remove /^\s*\*+\s*\$FreeBSD\$.*$\n/


# 2cda6a2f 26-Mar-2023 Alexander V. Chernikov <melifaro@FreeBSD.org>

routing: add public rt_is_exportable() version to check if
the route can be exported to userland when jailed.

Differential Revision: https://reviews.freebsd.org/D39204
MFC after: 2 weeks


# 2c2b37ad 13-Jan-2023 Justin Hibbits <jhibbits@FreeBSD.org>

ifnet/API: Move struct ifnet definition to a <net/if_private.h>

Hide the ifnet structure definition, no user serviceable parts inside,
it's a netstack implementation detail. Include it temporarily in
<net/if_var.h> until all drivers are updated to use the accessors
exclusively.

Reviewed by: glebius
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D38046


# 42904794 15-Jan-2023 Alexander V. Chernikov <melifaro@FreeBSD.org>

rtsock: fix socket closure.

Currently `close(2)` erroneously return `EOPNOTSUPP` for `PF_ROUTE` sockets.
It happened after making rtsock socket implementation self-contained (
36b10ac2cd18 ). Rtsock code marks socket as connected in `rts_attach()`.
`soclose()` tries to disconnect such socket using `.pr_disconnect` callback.
Rtsock does not implement this callback, resulting in the default method being
substituted. This default method returns `ENOTSUPP`, failing `soclose()` logic.

This diff restores the previous behaviour by adding custom `pr_disconnect()`
returning `ENOTCONN`.

Reviewed by: glebius
Differential Revision: https://reviews.freebsd.org/D38059


# 1bcd230f 03-Dec-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

netlink: add interface notification on link status / flags change.

* Add link-state change notifications by subscribing to ifnet_link_event.
In the Linux netlink model, link state is reported in 2 places: first is
the IFLA_OPERSTATE, which stores state per RFC2863.
The second is an IFF_LOWER_UP interface flag. As many applications rely
on the latter, reserve 1 bit from if_flags, named as IFF_NETLINK_1.
This flag is mapped to IFF_LOWER_UP in the netlink headers. This is done
to avoid making applications think this flag is actually
supported / presented in non-netlink outputs.
* Add flag change notifications, by hooking into rt_ifmsg().
In the netlink model, notification should include the bitmask for the
change flags. Update rt_ifmsg() to include such bitmask.

Differential Revision: https://reviews.freebsd.org/D37597


# 7e5bf684 20-Jan-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

netlink: add netlink support

Netlinks is a communication protocol currently used in Linux kernel to modify,
read and subscribe for nearly all networking state. Interfaces, addresses, routes,
firewall, fibs, vnets, etc are controlled via netlink.
It is async, TLV-based protocol, providing 1-1 and 1-many communications.

The current implementation supports the subset of NETLINK_ROUTE
family. To be more specific, the following is supported:
* Dumps:
- routes
- nexthops / nexthop groups
- interfaces
- interface addresses
- neighbors (arp/ndp)
* Notifications:
- interface arrival/departure
- interface address arrival/departure
- route addition/deletion
* Modifications:
- adding/deleting routes
- adding/deleting nexthops/nexthops groups
- adding/deleting neghbors
- adding/deleting interfaces (basic support only)
* Rtsock interaction
- route events are bridged both ways

The implementation also supports the NETLINK_GENERIC family framework.

Implementation notes:
Netlink is implemented via loadable/unloadable kernel module,
not touching many kernel parts.
Each netlink socket uses dedicated taskqueue to support async operations
that can sleep, such as interface creation. All message processing is
performed within these taskqueues.

Compatibility:
Most of the Netlink data models specified above maps to FreeBSD concepts
nicely. Unmodified ip(8) binary correctly works with
interfaces, addresses, routes, nexthops and nexthop groups. Some
software such as net/bird require header-only modifications to compile
and work with FreeBSD netlink.

Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D36002
MFC after: 2 months


# 177f04d5 29-Aug-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

routing: constantify @rc in rib_decompose_notification().

Clarify the @rc immutability by explicitly marking @rc const.

MFC after: 2 weeks


# e7d02be1 17-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protosw: refactor protosw and domain static declaration and load

o Assert that every protosw has pr_attach. Now this structure is
only for socket protocols declarations and nothing else.
o Merge struct pr_usrreqs into struct protosw. This was suggested
in 1996 by wollman@ (see 7b187005d18ef), and later reiterated
in 2006 by rwatson@ (see 6fbb9cf860dcd).
o Make struct domain hold a variable sized array of protosw pointers.
For most protocols these pointers are initialized statically.
Those domains that may have loadable protocols have spacers. IPv4
and IPv6 have 8 spacers each (andre@ dff3237ee54ea).
o For inetsw and inet6sw leave a comment noting that many protosw
entries very likely are dead code.
o Refactor pf_proto_[un]register() into protosw_[un]register().
o Isolate pr_*_notsupp() methods into uipc_domain.c

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36232


# 036f1bc6 14-Aug-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

routing: retire rib_lookup_info()

This function was added in pre-epoch era ( 9a1b64d5a0224 ) to
provide public rtentry access interface & hide rtentry internals.
The implementation is based on the large on-stack copying and
refcounting of the referenced objects (ifa/ifp).
It has become obsolete after epoch & nexthop introduction. Convert
the last remaining user and remove the function itself.

Differential Revision: https://reviews.freebsd.org/D36197


# f73e4f6c 11-Aug-2022 Mateusz Guzik <mjg@FreeBSD.org>

routing: unbreak the build of a bunch of kernels

Sponsored by: Rubicon Communications, LLC ("Netgate")


# d8b42ddc 11-Aug-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

rtsock: subscribe to ifnet eventhandlers instead of direct calls.

Stop treating rtsock as a "special" consumer and use already-provided
ifaddr arrival/departure notifications.

MFC after: 2 weeks

Test Plan:
```
21:05 [0] m@devel0 route -n monitor

-> ifconfig vtnet0.2 create

got message of size 24 on Tue Aug 9 21:05:44 2022
RTM_IFANNOUNCE: interface arrival/departure: len 24, if# 3, what: arrival

got message of size 168 on Tue Aug 9 21:05:54 2022
RTM_IFINFO: iface status change: len 168, if# 3, link: up, flags:<BROADCAST,RUNNING,SIMPLEX,MULTICAST>

-> ifconfig vtnet0.2 destroy

got message of size 24 on Tue Aug 9 21:05:54 2022
RTM_IFANNOUNCE: interface arrival/departure: len 24, if# 3, what: departure

```

Reviewed By: glebius
Differential Revision: https://reviews.freebsd.org/D36095
MFC after: 2 weeks


# 36b10ac2 11-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

rtsock: do not use raw socket code

This makes routing socket implementation self contained and removes one
of the last dependencies on the raw socket code and pr_output method.

There are very subtle API visible changes:
- now routing socket would return EOPNOTSUPP instead of EINVAL on
syscalls that are not supposed to be called on a routing socket.
- routing socket buffer sizes are now controlled by net.rtsock
sysctls instead of net.raw. The latter were not documented
anywhere, and even Internet search doesn't find any references
or discussions related to these sysctls.

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36122


# d94ec749 11-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

rtsock: do not allocate mbufs_tags(9) just to store a 8-bit value

Use local storage of the mbuf packet header instead.

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36121


# ae6bfd12 01-Aug-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

routing: refactor private KPI
* Make nhgrp_get_nhops() return const struct weightened_nhop to
indicate that the list is immutable
* Make nhgrp_get_group() return the actual group, instead of
group+weight.

MFC after: 2 weeks


# 2717e958 27-Jul-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

routing: move route expiration time to its nexthop

Expiration time is actually a path property, not a route property.
Move its storage to nexthop to simplify upcoming nhop(9) KPI changes
and netlink introduction.

Differential Revision: https://reviews.freebsd.org/D35970
MFC after: 2 weeks


# 33a0803f 26-Jun-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

routing: fix debug headers added in 6fa8ed43ee0c #2.

Move debug declaration out of COMPAT_FREEBSD32 in rtsock.c

MFC after: 2 weeks


# 0e87bab6 25-Jun-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

routing: fix debug headers added in 6fa8ed43ee0c.

- move debug headers out of COMPAT_FREEBSD32 in rtsock.c
- remove accidentally-added LOG_ defines from syslog.h

MFC after: 2 weeks


# 76179e40 25-Jun-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

routing: fix syslog include for rtsock.c

MFC after: 2 weeks


# 6fa8ed43 25-Jun-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

routing: improve debugging.

Use unified guidelines for the severity across the routing subsystem.
Update severity for some of the already-used messages to adhere the
guidelines.
Convert rtsock logging to the new FIB_ reporting format.

MFC after: 2 weeks


# c260d5cd 25-Jun-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

routing: fix crash when RTM_CHANGE results in no-op for the multipath
route.

Reporting logic assumed there is always some nhop change for every
successful modification operation. Explicitly check that the changed
nexthop indeed exists when reporting back to userland.

MFC after: 2 weeks
Reported by: Claudio Jeker <claudio.jeker@klarasystems.com>
Tested by: Claudio Jeker <claudio.jeker@klarasystems.com>


# 9573cc35 13-May-2022 Kurosawa Takahiro <takahiro.kurosawa@gmail.com>

rtsock: fix a stack overflow

struct sockaddr is not sufficient for buffer that can hold any
sockaddr_* structure. struct sockaddr_storage should be used.

Test:
ifconfig epair create
ifconfig epair0a inet6 add 2001:db8::1 up
ndp -s 2001:db8::2 02:86:98:2e:96:0b proxy # this triggers kernel stack overflow

Reviewed by: markj, kp
Differential Revision: https://reviews.freebsd.org/D35188


# e606e5d1 04-Apr-2022 Warner Losh <imp@FreeBSD.org>

sysctl_dumpentry: move error to inner scope

Sponsored by: Netflix


# 6abb5043 27-Jan-2022 Gleb Smirnoff <glebius@FreeBSD.org>

rtsock: always set m_pkthdr.rcvif when queueing on netisr

netisr uses global workstreams and after dequeueing an mbuf it
uses rcvif to get the VNET of the mbuf. Of course, this is not
needed when kernel is compiled without VIMAGE. It came out that
routing socket does not set rcvif if compiled without VIMAGE.
Make this assignment not depending on VIMAGE option.

Fixes: 6871de9363e5


# 644ca084 03-Jan-2022 Gleb Smirnoff <glebius@FreeBSD.org>

domains: make domain_init() initialize only global state

Now that each module handles its global and VNET initialization
itself, there is no VNET related stuff left to do in domain_init().

Differential revision: https://reviews.freebsd.org/D33541


# 89128ff3 03-Jan-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protocols: init with standard SYSINIT(9) or VNET_SYSINIT

The historical BSD network stack loop that rolls over domains and
over protocols has no advantages over more modern SYSINIT(9).
While doing the sweep, split global and per-VNET initializers.

Getting rid of pr_init allows to achieve several things:
o Get rid of ifdef's that protect against double foo_init() when
both INET and INET6 are compiled in.
o Isolate initializers statically to the module they init.
o Makes code easier to understand and maintain.

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D33537


# 79554f2b 14-Nov-2021 Mateusz Guzik <mjg@FreeBSD.org>

net: whack "set but not used" warnings in net/rtsock.c

... except for one where the error is ignored.

Sponsored by: Rubicon Communications, LLC ("Netgate")


# 0dcef81d 23-Jul-2021 Mark Johnston <markj@FreeBSD.org>

Add required sysctl name length checks to various handlers

Reported by: KMSAN
MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# a3c2c06b 06-Jun-2021 Bjoern A. Zeeb <bz@FreeBSD.org>

Make LINT NOINET and NOIP kernel builds warning free.

Apply #ifdef INET or #if defined(INET6) || defined(INET) to make
universe NOINET and NOIP LINT kernels warning free as well again.


# 76cfc6fa 14-May-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix a use after free in update_rtm_from_rc().

update_rtm_from_rc() calls update_rtm_from_info() internally.
The latter one may update provided prtm pointer with a new rtm.
Reassign rtm from prtm afeter calling update_rtm_from_info() to
avoid touching the freed rtm.

PR: 255871
Submitted by: lylgood@foxmail.com
MFC after: 3 days


# 25682e6a 27-Apr-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix rtsock sockaddr alignment.

b31fbebeb3 introduced alloc_sockaddr_aligned() which, in fact,
failed to produce aligned addresses.

Reported by: Oskar Holmlund <oskar.holmlund at yahoo.com>
MFC after: immediately


# 5d1403a7 23-Apr-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

[rtsock] Enforce netmask/RTF_HOST consistency.

Traditionally we had 2 sources of information whether the
added/delete route request targets network or a host route:
netmask (RTA_NETMASK) and RTF_HOST flag.

The former one is tricky: netmask can be empty or can explicitly
specify the host netmask. Parsing netmask sockaddr requires per-family
parsing and that's what rtsock code traditionally avoided. As a result,
consistency was not enforced and it was possible to specify network with
the RTF_HOST flag and vice versa.

Continue normalization efforts from D29826 and D29826 and ensure that
RTF_HOST flag always reflects host/network data from netmask field.

Differential Revision: https://reviews.freebsd.org/D29958
MFC after: 2 days


# b31fbebe 19-Apr-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Relax rtsock message restrictions.

Address multiple issues with strict rtsock message validation.

D28668 "normalisation" approach was based on the assumption that
we always have at least "standard" sockaddr len.
It turned out to be false - certain older applications like quagga
or routed abuse sin[6]_len field and set it to the offset to the
first fully-zero bit in the mask. It is impossible to normalise
such sockaddrs without reallocation.

With that in mind, change the approach to use a distinct memory
buffer for the altered sockaddrs. This allows supporting the older
software while maintaining the guarantee on the "standard" sockaddrs.

PR: 255273,255089
Differential Revision: https://reviews.freebsd.org/D29826
MFC after: 3 days


# 758c9d54 19-Apr-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Improve error reporting in rtsock.c

MFC after: 3 days


# 7f5f3fcc 10-Apr-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix direct route installation with net/bird.

Slighly relax the gateway validation rules imposed by the
2fe5a79425c7, by requiring only first 8 bytes (everyhing
before sdl_data to be present in the AF_LINK gateway.

Reported by: olivier


# e5b394f2 20-Feb-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix setting static entries for arp/ndp.

rtsock message validation changes committed in 2fe5a79425c7
did not take llinfo messages into account.

Add a special validation case for RTA_GATEWAY llinfo messages.

MFC after: 2 days


# f9e1cd6c 19-Feb-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix arp/ndp deletion broken by 2fe5a79425c7.

Changes in the 2fe5a79425c7 moved dst sockaddr masking from the
routing control plane to the rtsock code.

It broke arp/ndp deletion.
It turns out, arp/ndp perform RTM_GET request first to get an
interface index necessary for the deletion.
Then they simply stamp the reply with RTF_LLDATA and set the
command to RTM_DELETE.
As a result, kernel receives request with non-empty RTA_NETMASK
and clears RTA_DST host bits before passing the message to the
lla code.

De facto, the only needed bits are RTA_DST, RTA_GATEWAY and the
subset of rtm_flags.

With that in mind, fix the interace by clearing RTA_NETMASK
for every messages with RTF_LLDATA.

While here, cleanup arp/ndp code a bit.

MFC after: 1 day
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D28804


# a4513bac 16-Feb-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix NOINET6 build broken by 2fe5a79425c7.

Reported by: mjg


# 2fe5a794 16-Feb-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix dst/netmask handling in routing socket code.

Traditionally routing socket code did almost zero checks on
the input message except for the most basic size checks.

This resulted in the unclear KPI boundary for the routing system code
(`rtrequest*` and now `rib_action()`) w.r.t message validness.

Multiple potential problems and nuances exists:
* Host bits in RTAX_DST sockaddr. Existing applications do send prefixes
with hostbits uncleared. Even `route(8)` does this, as they hope the kernel
would do the job of fixing it. Code inside `rib_action()` needs to handle
it on its own (see `rt_maskedcopy()` ugly hack).
* There are multiple way of adding the host route: it can be DST without
netmask or DST with /32(/128) netmask. Also, RTF_HOST has to be set correspondingly.
Currently, these 2 options create 2 DIFFERENT routes in the kernel.
* no sockaddr length/content checking for the "secondary" fields exists: nothing
stops rtsock application to send sockaddr_in with length of 25 (instead of 16).
Kernel will accept it, install to RIB as is and propagate to all rtsock consumers,
potentially triggering bugs in their code. Same goes for sin_port, sin_zero, etc.

The goal of this change is to make rtsock verify all sockaddr and prefix consistency.
Said differently, `rib_action()` or internals should NOT require to change any of the
sockaddrs supplied by `rt_addrinfo` structure due to incorrectness.

To be more specific, this change implements the following:
* sockaddr cleanup/validation check is added immediately after getting sockaddrs from rtm.
* Per-family dst/netmask checks clears host bits in dst and zeros all dst/netmask "secondary" fields.
* The same netmask checking code converts /32(/128) netmasks to "host" route case
(NULL netmask, RTF_HOST), removing the dualism.
* Instead of allowing ANY "known" sockaddr families (0<..<AF_MAX), allow only actually
supported ones (inet, inet6, link).
* Automatically convert `sockaddr_sdl` (AF_LINK) gateways to
`sockaddr_sdl_short`.

Reported by: Guy Yur <guyyur at gmail.com>
Reviewed By: donner
Differential Revision: https://reviews.freebsd.org/D28668
MFC after: 3 days


# 64d5c277 14-Feb-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Remove now-unused RTF_RNH_LOCKED route flag.

MFC after: 1 week


# 8ca99aec 12-Feb-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix various NOINET* builds broken by 145bf6c0af48.

Reported by: mjg, bdragon


# 145bf6c0 08-Feb-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix blackhole/reject routes.

Traditionally *BSD routing stack required to supply some
interface data for blackhole/reject routes. This lead to
varieties of hacks in routing daemons when inserting such routes.
With the recent routeing stack changes, gateway sockaddr without
RTF_GATEWAY started to be treated differently, purely as link
identifier.

This change broke net/bird, which installs blackhole routes with
127.0.0.1 gateway without RTF_GATEWAY flags.

Fix this by automatically constructing necessary gateway data at
rtsock level if RTF_REJECT/RTF_BLACKHOLE is set.

Reported by: Marek Zarychta <zarychtam at plan-b.pwste.edu.pl>
Reviewed by: donner
MFC after: 1 week


# d68cf57b 07-Jan-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Refactor rt_addrmsg() and rt_routemsg().

Summary:
* Refactor rt_addrmsg(): make V_rt_add_addr_allfibs decision locally.
* Fix rt_routemsg() and multipath by accepting nexthop instead of interface pointer.
* Refactor rtsock_routemsg(): avoid accessing rtentry fields directly.
* Simplify in_addprefix() by moving prefix search to a separate function.

Reviewers: #network

Subscribers: imp, ae, bz

Differential Revision: https://reviews.freebsd.org/D28011


# 2fb4a03d 24-Dec-2020 Ryan Libby <rlibby@FreeBSD.org>

rtsock: quiet -Wunused-variable in LINT-NOIP kernels

Fixup after r368769 / d68fb8d978bbd3cf1b5db35f309aaeab30636d43.

Reported by: mjg
Reviewed by: melifaro
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D27730


# 92be2847 23-Dec-2020 Mark Johnston <markj@FreeBSD.org>

rtsock: Avoid copying uninitialized padding bytes

When copying sockaddrs out to userspace, we pad them to a multiple of
the platform alignment (sizeof(long)). However, some sockaddr sizes,
such as struct sockaddr_dl, are not an integer multiple of the
alignment, so we may end up copying out uninitialized bytes.

Fix this by always bouncing through a pre-zeroed sockaddr_storage.

Reported by: KASAN
Reviewed by: melifaro
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27729


# d68fb8d9 18-Dec-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Switch direct rt fields access in rtsock.c to newly-create field acessors.

rtsock code was build around the assumption that each rtentry record
in the system radix tree is a ready-to-use sockaddr. This assumptions
turned out to be not quite true:
* masks have their length tweaked, so we have rtsock_fix_netmask() hack
* IPv6 addresses have their scope embedded, so we have another explicit
deembedding hack.

Change the code to decouple rtentry internals from rtsock code using
newly-created rtentry accessors. This will allow to eventually eliminate
both of the hacks and change rtentry dst/mask format.

Differential Revision: https://reviews.freebsd.org/D27451


# d1d941c5 29-Nov-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Remove RADIX_MPATH config option.

ROUTE_MPATH is the new config option controlling new multipath routing
implementation. Remove the last pieces of RADIX_MPATH-related code and
the config option.

Reviewed by: glebius
Differential Revision: https://reviews.freebsd.org/D27244


# 1b95005e 04-Oct-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix route flags update during RTM_CHANGE.
Nexthop lookup was not consireding rt_flags when doing
structure comparison, which lead to an original nexthop
selection when changing flags. Fix the case by adding
rt_flags field into comparison and rearranging nhop_priv
fields to allow for efficient matching.
Fix `route change X/Y flags` case - recent changes
disallowed specifying RTF_GATEWAY flag without actual gateway.
It turns out, route(8) fills in RTF_GATEWAY by default, unless
-interface flag is specified. Fix regression by clearing
RTF_GATEWAY flag instead of failing.
Fix route flag reporting in RTM_CHANGE messages by explicitly
updating rtm_flags after operation competion.
Add IPv4/IPv6 tests for flag-only route changes.


# 9c584fa4 03-Oct-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Remove ROUTE_MPATH-related warnings introduced in r366390.

Reported by: mjg


# fedeb08b 03-Oct-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Introduce scalable route multipath.

This change is based on the nexthop objects landed in D24232.

The change introduces the concept of nexthop groups.
Each group contains the collection of nexthops with their
relative weights and a dataplane-optimized structure to enable
efficient nexthop selection.

Simular to the nexthops, nexthop groups are immutable. Dataplane part
gets compiled during group creation and is basically an array of
nexthop pointers, compiled w.r.t their weights.

With this change, `rt_nhop` field of `struct rtentry` contains either
nexthop or nexthop group. They are distinguished by the presense of
NHF_MULTIPATH flag.
All dataplane lookup functions returns pointer to the nexthop object,
leaving nexhop groups details inside routing subsystem.

User-visible changes:

The change is intended to be backward-compatible: all non-mpath operations
should work as before with ROUTE_MPATH and net.route.multipath=1.

All routes now comes with weight, default weight is 1, maximum is 2^24-1.

Current maximum multipath group width is statically set to 64.
This will become sysctl-tunable in the followup changes.

Using functionality:
* Recompile kernel with ROUTE_MPATH
* set net.route.multipath to 1

route add -6 2001:db8::/32 2001:db8::2 -weight 10
route add -6 2001:db8::/32 2001:db8::3 -weight 20

netstat -6On

Nexthop groups data

Internet6:
GrpIdx NhIdx Weight Slots Gateway Netif Refcnt
1 ------- ------- ------- --------------------------------------- --------- 1
13 10 1 2001:db8::2 vlan2
14 20 2 2001:db8::3 vlan2

Next steps:
* Land outbound hashing for locally-originated routes ( D26523 ).
* Fix net/bird multipath (net/frr seems to work fine)
* Add ROUTE_MPATH to GENERIC
* Set net.route.multipath=1 by default

Tested by: olivier
Reviewed by: glebius
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D26449


# 2259a030 21-Sep-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Rework part of routing code to reduce difference to D26449.

* Split rt_setmetrics into get_info_weight() and rt_set_expire_info(),
as these two can be applied at different entities and at different times.
* Start filling route weight in route change notifications
* Pass flowid to UDP/raw IP route lookups
* Rework nd6_subscription_cb() and sysctl_dumpentry() to prepare for the fact
that rtentry can contain multiple nexthops.

Differential Revision: https://reviews.freebsd.org/D26497


# 4fa9815a 05-Sep-2020 Ed Maste <emaste@FreeBSD.org>

rtsock.c: remove extraneous space

Submitted by: Neel Chauhan <neel AT neelc DOT org>
Differential Revision: https://reviews.freebsd.org/D26249


# 662c1305 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

net: clean up empty lines in .c and .h files


# a624ca3d 28-Aug-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Move net/route/shared.h definitions to net/route/route_var.h.

No functional changes.

net/route/shared.h was created in the inital phases of nexthop conversion.
It was intended to serve the same purpose as route_var.h - share definitions
of functions and structures between the routing subsystem components. At
that time route_var.h was included by many files external to the routing
subsystem, which largerly defeats its purpose.

As currently this is not the case anymore and amount of route_var.h includes
is roughly the same as shared.h, retire the latter in favour of the former.


# 592d300e 24-Aug-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Remove RT_LOCK mutex from rte.

rtentry lock traditionally served 2 purposed: first was protecting refcounts,
the second was assuring consistent field access/changes.
Since route nexthop introduction, the need for the former disappeared and
the need for the latter reduced.
To be more precise, the following rte field are mutable:

rt_nhop (nexthop pointer, updated with RIB_WLOCK, passed in rib_cmd_info)
rte_flags (only RTF_HOST and RTF_UP, where RTF_UP gets changed at rte removal)
rt_weight (relative weight, updated with RIB_WLOCK, passed in rib_cmd_info)
rt_expire (time when rte deletion is scheduled, updated with RIB_WLOCK)
rt_chain (deletion chain pointer, updated with RIB_WLOCK)
All of them are updated under RIB_WLOCK, so the only remaining concern is the reading.

rt_nhop and rt_weight (addressed in this review) are read under rib lock and
stored in the rib_cmd_info, so the caller has no problem with consitency.
rte_flags is currently read unlocked in rtsock reporting (however the scope
is only RTF_UP flag, which is pretty static).
rt_expire is currently read unlocked in rtsock reporting.
rt_chain accesses are safe, as this is only used at route deletion.

rt_expire and rte_flags reads will be dealt in a separate reviews soon.

Differential Revision: https://reviews.freebsd.org/D26162


# 93bfd365 22-Aug-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Rename rt_flags to rte_flags && reduce number of rt_nhop accesses.

No functional changes.

Most of the routing flags are stored in the netxtop instead of rtentry.
Rename rt->rt_flags to rt->rte_flags to simplify reading/modifying code
checking routing flags.

In the new multipath code, rt->rt_nhop may actually point to nexthop group
instead of nhop. To ease transition, reduce the amount of rt->rt_nhop->...
accesses.

Differential Revision: https://reviews.freebsd.org/D26156


# bec053ff 15-Aug-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Make net.inet6.ip6.deembed_scopeid behaviour default & remove sysctl.

Submitted by: Neel Chauhan <neel AT neelc DOT org>
Differential Revision: https://reviews.freebsd.org/D25637


# a287a973 10-Jun-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Switch rtsock code to using newly-create rib_action() KPI call.

This simplifies the code and allows to further split rtentry and nexthop,
removing one of the blockers for multipath code introduction, described in
D24141.

Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D25192


# 2bbab0af 23-May-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Use epoch(9) for rtentries to simplify control plane operations.

Currently the only reason of refcounting rtentries is the need to report
the rtable operation details immediately after the execution.
Delaying rtentry reclamation allows to stop refcounting and simplify the code.
Additionally, this change allows to reimplement rib_lookup_info(), which
is used by some of the customers to get the matching prefix along
with nexthops, in more efficient way.

The change keeps per-vnet rtzone uma zone. It adds nh_vnet field to
nhop_priv to be able to reliably set curvnet even during vnet teardown.
Rest of the reference counting code will be removed in the D24867 .

Differential Revision: https://reviews.freebsd.org/D24866


# 656442a7 08-May-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Embed dst sockaddr into rtentry and remove rte packet counter

Currently each rtentry has dst&gateway allocated separately from another zone,
bloating cache accesses.

Current 'struct rtentry' has 12 "mandatory" radix pointers in the beginning,
leaving 4 usable pointers/32 bytes in the first 2 cache lines (amd64).
Fields needed for the datapath are destination sockaddr and rt_nhop.

So far it doesn't look like there is other routable addressing protocol other
than IPv4/IPv6/MPLS, which uses keys longer than 20 bytes.
With that in mind, embed dst into struct rtentry, making the first 24 bytes
of rtentry within 128 bytes. That is enough to make IPv6 address within first
128 bytes.

It is still pretty easy to add code for supporting separately-allocated dst,
however it doesn't make a lot of sense in having such code without a use case.

As rS359823 moved the gateway to the nexthop structure, the dst embedding change
removes the need for any additional allocations done by rt_setgate().

Lastly, as a part of cleanup, remove counter(9) allocation code, as this field
is not used in packet processing anymore.

Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D24669


# 8c61eb21 29-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Convert more rtentry field accesses into nhop fields accesses.

Continue routing subsystem conversion to nhop objects defined in r359823.
Use fields from nhop structure instead of "struct rtentry" fields.
This is one of the last changes prior to removing rt_ifp, rt_ifa,
rt_gateway and rt_mtu from struct rtentry.

Differential Revision: https://reviews.freebsd.org/D24609


# fc88ecd3 28-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Move route-specific ddb commands to route/route_ddb.c

Currently functionality resides in rtsock.c, which is a controlling
interface, partially external to the routing subsystem.
Additionally, DDB-supporting functionality is > 100SLOC, which deserves
a separate file.

Given that, move this functionality to a newly-created net/route/ subdir.

Reviewed by: cem
Differential Revision: https://reviews.freebsd.org/D24561


# e7d8af4f 28-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Move route_temporal.c and route_var.h to net/route.

Nexthop objects implementation, defined in r359823,
introduced sys/net/route directory intended to hold all
routing-related code. Move recently-introduced route_temporal.c and
private route_var.h header there.

Differential Revision: https://reviews.freebsd.org/D24597


# aaad3c4f 23-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Convert rtentry field accesses into nhop field accesses.

One of the goals of the new routing KPI defined in r359823 is to entirely
hide`struct rtentry` from the consumers. It will allow to improve routing
subsystem internals and deliver more features much faster.

This commit is mostly mechanical change to eliminate direct struct rtentry
field accesses.

The only notable difference is AF_LINK gateway encoding.

AF_LINK gw is used in routing stack for operations with interface routes
and host loopback routes.
In the former case it indicates _some_ non-NULL gateway, as the interface
is the same as in rt_ifp in kernel and rtm_ifindex in rtsock reporting.
In the latter case the interface index inside gateway was used by the IPv6
datapath to verify address scope for link-local interfaces.

Kernel uses struct sockaddr_dl for this type of gateway. This structure
allows for specifying rich interface data, such as mac address and interface
name. However, this results in relatively large structure size - 52 bytes.
Routing stack fils in only 2 fields - sdl_index and sdl_type, which reside
in the first 8 bytes of the structure.

In the new KPI, struct nhop_object tries to be cache-efficient, hence
embodies gateway address inside the structure. In the AF_LINK case it
stores stortened version of the structure - struct sockaddr_dl_short,
which occupies 16 bytes. After D24340 changes, the data inside AF_LINK
gateway will not be used in the kernel at all, leaving rtsock as the only
potential concern.

The difference in rtsock reporting:

(old)
got message of size 240 on Thu Apr 16 03:12:13 2020
RTM_ADD: Add Route: len 240, pid: 0, seq 0, errno 0, flags:<UP,DONE,PINNED>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK>
10.0.0.0 link#5 255.255.255.0

(new)
got message of size 200 on Sun Apr 19 09:46:32 2020
RTM_ADD: Add Route: len 200, pid: 0, seq 0, errno 0, flags:<UP,DONE,PINNED>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK>
10.0.0.0 link#5 255.255.255.0

Note 40 bytes different (52-16 + alignment).
However, gateway is still a valid AF_LINK gateway with proper data filled in.

It is worth noting that these particular messages (interface routes) are mostly
ignored by routing daemons:
* bird/quagga/frr uses RTM_NEWADDR and ignores prefix route addition messages.
* quagga/frr ignores routes without gateway

More detailed overview on how rtsock messages are used by the
routing daemons to reconstruct the kernel view, can be found in D22974.

Differential Revision: https://reviews.freebsd.org/D24519


# 2538a4b1 16-Apr-2020 Adrian Chadd <adrian@FreeBSD.org>

Remove an duplicate definition of nhops_dump_sysctl()

One of the source files included both nhop.h and shared.h, leading to this
clashing.

Tested with: mips-gcc cross toolchain


# a6663252 12-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Introduce nexthop objects and new routing KPI.

This is the foundational change for the routing subsytem rearchitecture.
More details and goals are available in https://reviews.freebsd.org/D24141 .

This patch introduces concept of nexthop objects and new nexthop-based
routing KPI.

Nexthops are objects, containing all necessary information for performing
the packet output decision. Output interface, mtu, flags, gw address goes
there. For most of the cases, these objects will serve the same role as
the struct rtentry is currently serving.
Typically there will be low tens of such objects for the router even with
multiple BGP full-views, as these objects will be shared between routing
entries. This allows to store more information in the nexthop.

New KPI:

struct nhop_object *fib4_lookup(uint32_t fibnum, struct in_addr dst,
uint32_t scopeid, uint32_t flags, uint32_t flowid);
struct nhop_object *fib6_lookup(uint32_t fibnum, const struct in6_addr *dst6,
uint32_t scopeid, uint32_t flags, uint32_t flowid);

These 2 function are intended to replace all all flavours of
<in_|in6_>rtalloc[1]<_ign><_fib>, mpath functions and the previous
fib[46]-generation functions.

Upon successful lookup, they return nexthop object which is guaranteed to
exist within current NET_EPOCH. If longer lifetime is desired, one can
specify NHR_REF as a flag and get a referenced version of the nexthop.
Reference semantic closely resembles rtentry one, allowing sed-style conversion.

Additionally, another 2 functions are introduced to support uRPF functionality
inside variety of our firewalls. Their primary goal is to hide the multipath
implementation details inside the routing subsystem, greatly simplifying
firewalls implementation:

int fib4_lookup_urpf(uint32_t fibnum, struct in_addr dst, uint32_t scopeid,
uint32_t flags, const struct ifnet *src_if);
int fib6_lookup_urpf(uint32_t fibnum, const struct in6_addr *dst6, uint32_t scopeid,
uint32_t flags, const struct ifnet *src_if);

All functions have a separate scopeid argument, paving way to eliminating IPv6 scope
embedding and allowing to support IPv4 link-locals in the future.

Structure changes:
* rtentry gets new 'rt_nhop' pointer, slightly growing the overall size.
* rib_head gets new 'rnh_preadd' callback pointer, slightly growing overall sz.

Old KPI:
During the transition state old and new KPI will coexists. As there are another 4-5
decent-sized conversion patches, it will probably take a couple of weeks.
To support both KPIs, fields not required by the new KPI (most of rtentry) has to be
kept, resulting in the temporary size increase.
Once conversion is finished, rtentry will notably shrink.

More details:
* architectural overview: https://reviews.freebsd.org/D24141
* list of the next changes: https://reviews.freebsd.org/D24232

Reviewed by: ae,glebius(initial version)
Differential Revision: https://reviews.freebsd.org/D24232


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# e02d3fe7 07-Jan-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix rtsock route message generation for interface addresses.

Reviewed by: olivier
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D22974


# d9302031 31-Dec-2019 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix NOINET6 build broken by r356236.

MFC after: 2 weeks


# c83dda36 31-Dec-2019 Alexander V. Chernikov <melifaro@FreeBSD.org>

Split gigantic rtsock route_output() into smaller functions.

Amount of changes to the original code has been intentionally minimised
to ease diffing.
The changes are mostly mechanical, with the following exceptions:

* lltable handler is now called directly based of RTF_LLINFO flag presense.
* "report" logic for updating rtm in RTM_GET/RTM_DELETE has been simplified,
fixing several potential use-after-free cases in rt_addrinfo.
* llable asserts has been replaced with error-returning, preventing kernel
crashes when lltable gw af family is invalid (root required).

MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D22864


# 8a9a28c4 07-Nov-2019 Gleb Smirnoff <glebius@FreeBSD.org>

sysctl_rtsock() has all necessary locking and doesn't need Giant to run.
While here add description.


# b8a6e03f 07-Oct-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Widen NET_EPOCH coverage.

When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.

However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.

Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.

On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().

This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.

Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.

This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.

Reviewed by: gallatin, hselasky, cy, adrian, kristof
Differential Revision: https://reviews.freebsd.org/D19111


# 4835262b 10-Sep-2019 Li-Wen Hsu <lwhsu@FreeBSD.org>

Fix build for the platforms where db_expr_t is not long

Sponsored by: The FreeBSD Foundation


# 0b12ab81 09-Sep-2019 Conrad Meyer <cem@FreeBSD.org>

Appease Clang false-positive Werrors in r352112

Reported by: bcran


# 8b6acd2b 09-Sep-2019 Conrad Meyer <cem@FreeBSD.org>

ddb(4): Add 'show route <dest>' and 'show routetable [<af>]'

These commands show the route resolved for a specified destination, or
print out the entire routing table for a given address family (or all
families, if none is explicitly provided).

Discussed with: emaste
Differential Revision: https://reviews.freebsd.org/D21510


# e6481fd4 14-Apr-2019 Michael Tuexen <tuexen@FreeBSD.org>

When sending a routing message, don't allow the user to set the
RTF_RNH_LOCKED flag in rtm_flags, since this flag is used only
internally.

Reported by: syzbot+65c676f5248a13753ea0@syzkaller.appspotmail.com
Reviewed by: ae@
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D19898


# cc31f982 10-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Remove recursive NET_EPOCH_ENTER() from sysctl_ifmalist(), missed in r342872.


# a68cc388 08-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Mechanical cleanup of epoch(9) usage in network stack.

- Remove macros that covertly create epoch_tracker on thread stack. Such
macros a quite unsafe, e.g. will produce a buggy code if same macro is
used in embedded scopes. Explicitly declare epoch_tracker always.

- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
locking macros to what they actually are - the net_epoch.
Keeping them as is is very misleading. They all are named FOO_RLOCK(),
while they no longer have lock semantics. Now they allow recursion and
what's more important they now no longer guarantee protection against
their companion WLOCK macros.
Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.

This is non functional mechanical change. The only functionally changed
functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
epoch recursively.

Discussed with: jtl, gallatin


# a716ad4a 27-Nov-2018 Andrey V. Elsukov <ae@FreeBSD.org>

Fix possible panic during ifnet detach in rtsock.

The panic can happen, when some application does dump of routing table
using sysctl interface. To prevent this, set IFF_DYING flag in
if_detach_internal() function, when ifnet under lock is removed from
the chain. In sysctl_rtsock() take IFNET_RLOCK_NOSLEEP() to prevent
ifnet detach during routes enumeration. In case, if some interface was
detached in the time before we take the lock, add the check, that ifnet
is not DYING. This prevents access to memory that could be freed after
ifnet is unlinked.

PR: 227720, 230498, 233306
Reviewed by: bz, eugen
MFC after: 1 week
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D18338


# d25f8522 26-Nov-2018 Mark Johnston <markj@FreeBSD.org>

Plug routing sysctl leaks.

Various structures exported by sysctl_rtsock() contain padding fields
which were not being zeroed.

Reported by: Thomas Barabosch, Fraunhofer FKIE
Reviewed by: ae
MFC after: 3 days
Security: kernel memory disclosure
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18333


# 5f901c92 24-Jul-2018 Andrew Turner <andrew@FreeBSD.org>

Use the new VNET_DEFINE_STATIC macro when we are defining static VNET
variables.

Reviewed by: bz
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D16147


# 6573d758 03-Jul-2018 Matt Macy <mmacy@FreeBSD.org>

epoch(9): allow preemptible epochs to compose

- Add tracker argument to preemptible epochs
- Inline epoch read path in kernel and tied modules
- Change in_epoch to take an epoch as argument
- Simplify tfb_tcp_do_segment to not take a ti_locked argument,
there's no longer any benefit to dropping the pcbinfo lock
and trying to do so just adds an error prone branchfest to
these functions
- Remove cases of same function recursion on the epoch as
recursing is no longer free.
- Remove the the TAILQ_ENTRY and epoch_section from struct
thread as the tracker field is now stack or heap allocated
as appropriate.

Tested by: pho and Limelight Networks
Reviewed by: kbowling at llnw dot com
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16066


# 20efcfc6 16-Jun-2018 Andrey V. Elsukov <ae@FreeBSD.org>

Switch RIB and RADIX_NODE_HEAD lock from rwlock(9) to rmlock(9).

Using of rwlock with multiqueue NICs for IP forwarding on high pps
produces high lock contention and inefficient. Rmlock fits better for
such workloads.

Reviewed by: melifaro, olivier
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D15789


# 4f6c66cc 23-May-2018 Matt Macy <mmacy@FreeBSD.org>

UDP: further performance improvements on tx

Cumulative throughput while running 64
netperf -H $DUT -t UDP_STREAM -- -m 1
on a 2x8x2 SKL went from 1.1Mpps to 2.5Mpps

Single stream throughput increases from 910kpps to 1.18Mpps

Baseline:
https://people.freebsd.org/~mmacy/2018.05.11/udpsender2.svg

- Protect read access to global ifnet list with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender3.svg

- Protect short lived ifaddr references with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender4.svg

- Convert if_afdata read lock path to epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender5.svg

A fix for the inpcbhash contention is pending sufficient time
on a canary at LLNW.

Reviewed by: gallatin
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15409


# d7c5a620 18-May-2018 Matt Macy <mmacy@FreeBSD.org>

ifnet: Replace if_addr_lock rwlock with epoch + mutex

Run on LLNW canaries and tested by pho@

gallatin:
Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5
based ConnectX 4-LX NIC, I see an almost 12% improvement in received
packet rate, and a larger improvement in bytes delivered all the way
to userspace.

When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1,
I see, using nstat -I mce0 1 before the patch:

InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree
4.98 0.00 4.42 0.00 4235592 33 83.80 4720653 2149771 1235 247.32
4.73 0.00 4.20 0.00 4025260 33 82.99 4724900 2139833 1204 247.32
4.72 0.00 4.20 0.00 4035252 33 82.14 4719162 2132023 1264 247.32
4.71 0.00 4.21 0.00 4073206 33 83.68 4744973 2123317 1347 247.32
4.72 0.00 4.21 0.00 4061118 33 80.82 4713615 2188091 1490 247.32
4.72 0.00 4.21 0.00 4051675 33 85.29 4727399 2109011 1205 247.32
4.73 0.00 4.21 0.00 4039056 33 84.65 4724735 2102603 1053 247.32

After the patch

InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree
5.43 0.00 4.20 0.00 3313143 33 84.96 5434214 1900162 2656 245.51
5.43 0.00 4.20 0.00 3308527 33 85.24 5439695 1809382 2521 245.51
5.42 0.00 4.19 0.00 3316778 33 87.54 5416028 1805835 2256 245.51
5.42 0.00 4.19 0.00 3317673 33 90.44 5426044 1763056 2332 245.51
5.42 0.00 4.19 0.00 3314839 33 88.11 5435732 1792218 2499 245.52
5.44 0.00 4.19 0.00 3293228 33 91.84 5426301 1668597 2121 245.52

Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch

Reviewed by: gallatin
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15366


# 7edd877a 06-May-2018 Matt Macy <mmacy@FreeBSD.org>

The ifnet pointer (ifp) in rt_newaddrmsg can be valid without ifp->if_addr being set if
if the ifnet is still live by way of a reference but
in line for deletion. Check ifp->if_addr before dereferencing.

Approved by: sbruno


# 6469bdcd 06-Apr-2018 Brooks Davis <brooks@FreeBSD.org>

Move most of the contents of opt_compat.h to opt_global.h.

opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941


# b3b6ff23 21-Feb-2018 Ryan Stone <rstone@FreeBSD.org>

Allow route change requests to not specify the gateway.

Only require a gateway to be specified on a route add request. On
a route change request that does not specify the gateway, the
gateway will remain the same. This allows changing other route
parameters without having to re-specifying the gateway, like in
"route change 10.0.0.0/8 -mtu 9000".

Update the route(8) manpage to explicitly call out this usage
as being supported.

MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Reviewed By: eugen (rtsock.c change), rgrimes
Differential Revision: https://reviews.freebsd.org/D14291


# 279e33d4 22-Jan-2018 Konstantin Belousov <kib@FreeBSD.org>

Fix compat32 for sysctl net.PF_ROUTE...NET_RT_IFLISTL.

Route messages are aligned to the host long type alignment, which
breaks 32bit.

Reported and tested by: lwhsu
Diagnosed by: Yuri Pankov <yuripv@icloud.com>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# 68ce5a03 07-Apr-2017 Patrick Kelsey <pkelsey@FreeBSD.org>

Fix typo in comment.

logest -> longest

MFC after: 3 days


# fbbd9655 28-Feb-2017 Warner Losh <imp@FreeBSD.org>

Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96


# 55dfce58 15-Nov-2016 Mark Johnston <markj@FreeBSD.org>

Plug a lock leak in sysctl_ifmalist().

Fix style in the local variable declarations.

PR: 214542
MFC after: 1 week


# ab607f28 12-Nov-2016 Ryan Stone <rstone@FreeBSD.org>

Don't read if_counters with if_addr_lock held

Calling into an ifnet implementation with the if_addr_lock already
held can cause a LOR and potentially a deadlock, as ifnet
implementations typically can take the if_addr_lock after their
own locks during configuration. Refactor a sysctl handler that
was violating this to read if_counter data in a temporary buffer
before the if_addr_lock is taken, and then copying the data
in its final location later, when the if_addr_lock is held.

PR: 194109
Reported by: Jean-Sebastien Pedron
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D8498
Reviewed by: sbruno


# 484149de 03-Jun-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

Introduce a per-VNET flag to enable/disable netisr prcessing on that VNET.
Add accessor functions to toggle the state per VNET.
The base system (vnet0) will always enable itself with the normal
registration. We will share the registered protocol handlers in all
VNETs minimising duplication and management.
Upon disabling netisr processing for a VNET drain the netisr queue from
packets for that VNET.

Update netisr consumers to (de)register on a per-VNET start/teardown using
VNET_SYS(UN)INIT functionality.

The change should be transparent for non-VIMAGE kernels.

Reviewed by: gnn (, hiren)
Obtained from: projects/vnet
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D6691


# a4641f4e 03-May-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/net*: minor spelling fixes.

No functional change.


# 02abd400 19-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

kernel: use our nitems() macro when it is available through param.h.

No functional change, only trivial cases are done in this sweep,

Discussed in: freebsd-current


# 35030a5d 29-Mar-2016 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove some NULL checks for M_WAITOK allocations.

MFC after: 1 month
Sponsored by: The FreeBSD Foundation


# 61eee0e2 24-Jan-2016 Alexander V. Chernikov <melifaro@FreeBSD.org>

MFP r287070,r287073: split radix implementation and route table structure.

There are number of radix consumers in kernel land (pf,ipfw,nfs,route)
with different requirements. In fact, first 3 don't have _any_ requirements
and first 2 does not use radix locking. On the other hand, routing
structure do have these requirements (rnh_gen, multipath, custom
to-be-added control plane functions, different locking).
Additionally, radix should not known anything about its consumers internals.

So, radix code now uses tiny 'struct radix_head' structure along with
internal 'struct radix_mask_head' instead of 'struct radix_node_head'.
Existing consumers still uses the same 'struct radix_node_head' with
slight modifications: they need to pass pointer to (embedded)
'struct radix_head' to all radix callbacks.

Routing code now uses new 'struct rib_head' with different locking macro:
RADIX_NODE_HEAD prefix was renamed to RIB_ (which stands for routing
information base).

New net/route_var.h header was added to hold routing subsystem internal
data. 'struct rib_head' was placed there. 'struct rtentry' will also
be moved there soon.


# 9a1b64d5 04-Jan-2016 Alexander V. Chernikov <melifaro@FreeBSD.org>

Add rib_lookup_info() to provide API for retrieving individual route
entries data in unified format.

There are control plane functions that require information other than
just next-hop data (e.g. individual rtentry fields like flags or
prefix/mask). Given that the goal is to avoid rte reference/refcounting,
re-use rt_addrinfo structure to store most rte fields. If caller wants
to retrieve key/mask or gateway (which are sockaddrs and are allocated
separately), it needs to provide sufficient-sized sockaddrs structures
w/ ther pointers saved in passed rt_addrinfo.

Convert:
* lltable new records checks (in_lltable_rtcheck(),
nd6_is_new_addr_neighbor().
* rtsock pre-add/change route check.
* IPv6 NS ND-proxy check (RADIX_MPATH code was eliminated because
1) we don't support RTF_ANNOUNCE ND-proxy for networks and there should
not be multiple host routes for such hosts 2) if we have multiple
routes we should inspect them (which is not done). 3) the entire idea
of abusing KRT as storage for ND proxy seems odd. Userland programs
should be used for that purpose).


# f7bab8d0 09-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Switch route radix to dual-lock model:
use rmlock for data patch access, and config rwlock
for conrol plane processing. Route table changes require
bock locks held.


# 69d149ad 09-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Since we no longer return individual radix entries, it is
not possible to do per-rte accounting. Remove rt_kpktsent.


# 033074c4 09-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Replace 'struct route *' if_output() argument with 'struct nhop_info *'.
Leave 'struct route' as is for legacy routing api users.
Remove most of rtalloc_ign*-derived functions.


# 55e5eda6 08-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Separate radix and routing: use different structures for route and
for other customers.

Introduce new 'struct rib_head' for routing purposes and make
all routing api use it.


# 8c3cfe0b 04-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Hide 'struct rtentry' and all its macro inside new header:
net/route_internal.h
The goal is to make its opaque for all code except route/rtsock and
proto domain _rmx.


# 56b61ca2 19-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Remove ifq_drops from struct ifqueue. Now queue drops are accounted in
struct ifnet if_oqdrops.

Some netgraph modules used ifqueue w/o ifnet. Accounting of queue drops
is simply removed from them. There were no API to read this statistic.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 4f8585e0 11-Sep-2014 Alan Somers <asomers@FreeBSD.org>

Revisions 264905 and 266860 added a "int fib" argument to ifa_ifwithnet and
ifa_ifwithdstaddr. For the sake of backwards compatibility, the new
arguments were added to new functions named ifa_ifwithnet_fib and
ifa_ifwithdstaddr_fib, while the old functions became wrappers around the
new ones that passed RT_ALL_FIBS for the fib argument. However, the
backwards compatibility is not desired for FreeBSD 11, because there are
numerous other incompatible changes to the ifnet(9) API. We therefore
decided to remove it from head but leave it in place for stable/9 and
stable/10. In addition, this commit adds the fib argument to
ifa_ifwithbroadaddr for consistency's sake.

sys/sys/param.h
Increment __FreeBSD_version

sys/net/if.c
sys/net/if_var.h
sys/net/route.c
Add fibnum argument to ifa_ifwithbroadaddr, and remove the _fib
versions of ifa_ifwithdstaddr, ifa_ifwithnet, and ifa_ifwithroute.

sys/net/route.c
sys/net/rtsock.c
sys/netinet/in_pcb.c
sys/netinet/ip_options.c
sys/netinet/ip_output.c
sys/netinet6/nd6.c
Fixup calls of modified functions.

share/man/man9/ifnet.9
Document changed API.

CR: https://reviews.freebsd.org/D458
MFC after: Never
Sponsored by: Spectra Logic


# e6485f73 31-Aug-2014 Gleb Smirnoff <glebius@FreeBSD.org>

o Remove struct if_data from struct ifnet. Now it is merely API structure
for route(4) socket and ifmib(4) sysctl.
o Move fields from if_data to ifnet, but keep all statistic counters
separate, since they should disappear later.
o Provide function if_data_copy() to fill if_data, utilize it in routing
socket and ifmib handler.
o Provide overridable ifnet(9) method to fetch counters. If no provided,
if_get_counters_compat() would be used, that returns old counters.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 73d76e77 14-Aug-2014 Kevin Lo <kevlo@FreeBSD.org>

Change pr_output's prototype to avoid the need for explicit casts.
This is a follow up to r269699.

Phabric: D564
Reviewed by: jhb


# 9753faf5 29-Jul-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Garbage collect couple of unused fields from struct ifaddr:
- ifa_claim_addr() unused since removal of NetAtalk
- ifa_metric seems to be never utilized, always a copy of if_metric


# 2f308a34 29-May-2014 Alan Somers <asomers@FreeBSD.org>

Fix unintended KBI change from r264905. Add _fib versions of
ifa_ifwithnet() and ifa_ifwithdstaddr() The legacy functions will call the
_fib() versions with RT_ALL_FIBS, preserving legacy behavior.

sys/net/if_var.h
sys/net/if.c
Add legacy-compatible functions as described above. Ensure legacy
behavior when RT_ALL_FIBS is passed as fibnum.

sys/netinet/in_pcb.c
sys/netinet/ip_output.c
sys/netinet/ip_options.c
sys/net/route.c
sys/net/rtsock.c
sys/netinet6/nd6.c
Call with _fib() functions if we must use a specific fib, or the
legacy functions otherwise.

tests/sys/netinet/fibs_test.sh
tests/sys/netinet/udp_dontroute.c
Improve the udp_dontroute test. The bug that this test exercises is
that ifa_ifwithnet() will return the wrong address, if multiple
interfaces have addresses on the same subnet but with different
fibs. The previous version of the test only considered one possible
failure mode: that ifa_ifwithnet_fib() might fail to find any
suitable address at all. The new version also checks whether
ifa_ifwithnet_fib() finds the correct address by checking where the
ARP request goes.

Reported by: bz, hrs
Reviewed by: hrs
MFC after: 1 week
X-MFC-with: 264905
Sponsored by: Spectra Logic


# 6db47af4 08-May-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Rename rt_msg1() to more handy rtsock_msg_mbuf().
(Just for history purposes: rt_msg2() was renamed
to rtsock_msg_buffer() in r265019).

Sponsored by: Yandex LLC
MFC after: 1 month


# 3deb3649 08-May-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix incorrect netmasks being passed via rtsock.

Since radix has been ignoring sa_family in passed sockaddrs,
no one ever has bothered filling valid sa_family in netmasks.
Additionally, radix adjusts sa_len field in every netmask not to
compare zero bytes at all.

This leads us to rt_mask with sa_family of AF_UNSPEC (-1) and
arbitrary sa_len field (0 for default route, for example).

However, rtsock have been passing that rt_mask intact for ages,
requiring all rtsock consumers to make ther own local hacks.
We even have unfixed on in base:

do `route -n monitor` in one window and issue `route -n get addr`
for some directly-connected address. You will probably see the following:

got message of size 304 on Thu May 8 15:06:06 2014
RTM_GET: Report Metrics: len 304, pid: 30493, seq 1, errno 0, flags:<UP,DONE,PINNED>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK,IFP,IFA>
10.0.0.0 link#1 (255) ffff ffff ff em0:8.0.27.c5.29.d4 10.0.0.92
_________________^^^^^^^^^^^^^^^^^^

after the change:

got message of size 312 on Thu May 8 15:44:07 2014
RTM_GET: Report Metrics: len 312, pid: 2895, seq 1, errno 0, flags:<UP,DONE,PINNED>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK,IFP,IFA>
10.0.0.0 link#1 255.255.255.0 em0:8.0.27.c5.29.d4 10.0.0.92
_________________^^^^^^^^^^^^^^^^^^

Sponsored by: Yandex LLC
MFC after: 1 month


# c9f98940 03-May-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix sysctl_ifmalist() broken in r265019.

Reported by: Olivier Cochard-Labbé
MFC with: r265019


# d9437c0f 29-Apr-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Partially revert r265019 - allocating 512 bytes on stack
can be too much for architectures like ARM. Always use rounded
malloc instead.

Discussed with: jmallett
MFC after: 4 weeks


# 0fb9298d 29-Apr-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Move rt_setmetrics() from rtsock.c to route.c.
All rtsock-initiated rte creation/modification are now
performed in route.c holding radix tree write lock.
This reduces the need for per-rte mutex.

Sponsored by: Yandex LLC
MFC after: 1 month


# de46b2c6 27-Apr-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix build

Found by: ian
Pointyhat to: me


# f2e5eb36 27-Apr-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Improve memory allocation model for rt_msg2() rtsock messages:
* memory is now allocated as early as possible, without holding locks.
* sysctl users are now guaranteed to get a response (M_WAITOK buffer prealloc).
* socket users are more likely to use on-stack buffer for replies.
* standard kernel malloc/free functions are now used instead of radix wrappers.
rt_msg2() has been renamed to rtsock_msg_buffer().

MFC after: 1 month


# f1fcb552 27-Apr-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Remove useless zeroing of RTAX_DST on error.
Cleanup a bit.

MFC after: 1 month


# 92c227af 27-Apr-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Cleanup route_output() a bit.

MFC after: 1 month


# 2277c5e5 27-Apr-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Do not delay freeing rtm. Bandaid added in r227061 is not needed since r227061,

MFC after: 1 month


# f5d9a696 26-Apr-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Move up fibnum to ensure it is always defined.

Found by: ian
MFC with: r264987


# 773aa053 26-Apr-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Determine fibnum once in the beginning of route_output().

MFC after: 1 month


# c77462dd 26-Apr-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Decouple RTM_CHANGE from RTM_GET handling in rtsock.c:route_output().
RTM_CHANGE is now handled inside route.c:rtrequest1_fib() as it should be.
Note change change handler is a separate function rtrequest1_fib_change().

MFC after: 1 month


# 36d55f0f 26-Apr-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Unify sa_equal() macro usage.

MFC after: 2 weeks


# 0cfee0c2 24-Apr-2014 Alan Somers <asomers@FreeBSD.org>

Fix subnet and default routes on different FIBs on the same subnet.

These two bugs are closely related. The root cause is that ifa_ifwithnet
does not consider FIBs when searching for an interface address.

sys/net/if_var.h
sys/net/if.c
Add a fib argument to ifa_ifwithnet and ifa_ifwithdstadddr. Those
functions will only return an address whose interface fib equals the
argument.

sys/net/route.c
Update calls to ifa_ifwithnet and ifa_ifwithdstaddr with fib
arguments.

sys/netinet/in.c
Update in_addprefix to consider the interface fib when adding
prefixes. This will prevent it from not adding a subnet route when
one already exists on a different fib.

sys/net/rtsock.c
sys/netinet/in_pcb.c
sys/netinet/ip_output.c
sys/netinet/ip_options.c
sys/netinet6/nd6.c
Add RT_DEFAULT_FIB arguments to ifa_ifwithdstaddr and ifa_ifwithnet.
In some cases it there wasn't a clear specific fib number to use.
In others, I was unable to test those functions so I chose
RT_DEFAULT_FIB to minimize divergence from current behavior. I will
fix some of the latter changes along with PR kern/187553.

tests/sys/netinet/fibs_test.sh
tests/sys/netinet/udp_dontroute.c
tests/sys/netinet/Makefile
Revert r263738. The udp_dontroute test was right all along.
However, bugs kern/187550 and kern/187553 cancelled each other out
when it came to this test. Because of kern/187553, ifa_ifwithnet
searched the default fib instead of the requested one, but because
of kern/187550, there was an applicable subnet route on the default
fib. The new test added in r263738 doesn't work right, however. I
can verify with dtrace that ifa_ifwithnet returned the wrong address
before I applied this commit, but route(8) miraculously found the
correct interface to use anyway. I don't know how.

Clear expected failure messages for kern/187550 and kern/187552.

PR: kern/187550
PR: kern/187552
Reviewed by: melifaro
MFC after: 3 weeks
Sponsored by: Spectra Logic


# 2d70c0de 19-Mar-2014 Gleb Smirnoff <glebius@FreeBSD.org>

When exporting ifnet via sysctl, add ifqueue(9) drop count to the
ifi_oqdrops. This is a temporary workaround until ifqueue(9) vanishes.

While here, remove the pointless ifi_vhid assignment. It has
sense only when we are exporting ifaddrs, not ifnets.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 2c284d93 13-Mar-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Remove IPX support.

IPX was a network transport protocol in Novell's NetWare network operating
system from late 80s and then 90s. The NetWare itself switched to TCP/IP
as default transport in 1998. Later, in this century the Novell Open
Enterprise Server became successor of Novell NetWare. The last release
that claimed to still support IPX was OES 2 in 2007. Routing equipment
vendors (e.g. Cisco) discontinued support for IPX in 2011.

Thus, IPX won't be supported in FreeBSD 11.0-RELEASE.


# b245f96c 12-Mar-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Since 32-bit if_baudrate isn't enough to describe a baud rate of a 10 Gbit
interface, in the r241616 a crutch was provided. It didn't work well, and
finally we decided that it is time to break ABI and simply make if_baudrate
a 64-bit value. Meanwhile, the entire struct if_data was reviewed.

o Remove the if_baudrate_pf crutch.

o Make all fields of struct if_data fixed machine independent size. The
notion of data (packet counters, etc) are by no means MD. And it is a
bug that on amd64 we've got a 64-bit counters, while on i386 32-bit,
which at modern speeds overflow within a second.

This also removes quite a lot of COMPAT_FREEBSD32 code.

o Give 16 bit for the ifi_datalen field. This field was provided to
make future changes to if_data less ABI breaking. Unfortunately the
8 bit size of it had effectively limited sizeof if_data to 256 bytes.

o Give 32 bits to ifi_mtu and ifi_metric.
o Give 64 bits to the rest of fields, since they are counters.

__FreeBSD_version bumped.

Discussed with: emax
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# e3a7aa6f 04-Mar-2014 Gleb Smirnoff <glebius@FreeBSD.org>

- Remove rt_metrics_lite and simply put its members into rtentry.
- Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This
removes another cache trashing ++ from packet forwarding path.
- Create zini/fini methods for the rtentry UMA zone. Via initialize
mutex and counter in them.
- Fix reporting of rmx_pksent to routing socket.
- Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode.

The change is mostly targeted for stable/10 merge. For head,
rt_pksent is expected to just disappear.

Discussed with: melifaro
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# c5d4eab6 19-Feb-2014 Marko Zec <zec@FreeBSD.org>

V_irtualize rtsock refcounting, which reduces the chances for panics
on teardown of vnets without active routing sockets while at least
one routing socket is active elsewhere.

Tested by: Vijay Singh
MFC after: 3 days


# d375edc9 09-Jan-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Simplify inet alias handling code: if we're adding/removing alias which
has the same prefix as some other alias on the same interface, use
newly-added rt_addrmsg() instead of hand-rolled in_addralias_rtmsg().

This eliminates the following rtsock messages:

Pinned RTM_ADD for prefix (for alias addition).
Pinned RTM_DELETE for prefix (for alias withdrawal).

Example (got 10.0.0.1/24 on vlan4, playing with 10.0.0.2/24):

before commit, addition:

got message of size 116 on Fri Jan 10 14:13:15 2014
RTM_NEWADDR: address being added to iface: len 116, metric 0, flags:
sockaddrs: <NETMASK,IFP,IFA,BRD>
255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255

got message of size 192 on Fri Jan 10 14:13:15 2014
RTM_ADD: Add Route: len 192, pid: 0, seq 0, errno 0, flags:<UP,PINNED>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK>
10.0.0.0 10.0.0.2 (255) ffff ffff ff

after commit, addition:

got message of size 116 on Fri Jan 10 13:56:26 2014
RTM_NEWADDR: address being added to iface: len 116, metric 0, flags:
sockaddrs: <NETMASK,IFP,IFA,BRD>
255.255.255.0 vlan4:8.0.27.c5.29.d4 14.0.0.2 14.0.0.255

before commit, wihdrawal:

got message of size 192 on Fri Jan 10 13:58:59 2014
RTM_DELETE: Delete Route: len 192, pid: 0, seq 0, errno 0, flags:<UP,PINNED>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK>
10.0.0.0 10.0.0.2 (255) ffff ffff ff

got message of size 116 on Fri Jan 10 13:58:59 2014
RTM_DELADDR: address being removed from iface: len 116, metric 0, flags:
sockaddrs: <NETMASK,IFP,IFA,BRD>
255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255

adter commit, withdrawal:

got message of size 116 on Fri Jan 10 14:14:11 2014
RTM_DELADDR: address being removed from iface: len 116, metric 0, flags:
sockaddrs: <NETMASK,IFP,IFA,BRD>
255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255

Sending both RTM_ADD/RTM_DELETE messages to rtsock is completely wrong
(and requires some hacks to keep prefix in route table on RTM_DELETE).

I've tested this change with quagga (no change) and bird (*).

bird alias handling is already broken in *BSD sysdep code, so nothing
changes here, too.

I'm going to MFC this change if there will be no complains about behavior
change.

While here, fix some style(9) bugs introduced by r260488
(pointed by glebius and bde).

Sponsored by: Yandex LLC
MFC after: 4 weeks


# 4cbac30b 09-Jan-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Split rt_newaddrmsg_fib() into two different functions.
Adding/deleting interface addresses involves access to 3 different subsystems,
int different parts of code. Each call can fail, so reporting successful
operation by rtsock in the middle of the process error-prone.

Further split routing notification API and actual rtsock calls via creating
public-available rt_addrmsg() / rt_routemsg() functions with "private"
rtsock_* backend.

MFC after: 2 weeks


# 7d9b6df1 08-Jan-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Constanly use RT_ALL_FIBS everywhere instead of -1.

MFC after: 2 weeks


# 5a2f4cbd 04-Jan-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Change semantics for rnh_lookup() function: now
it performs exact match search, regardless of netmask existance.
This simplifies most of rnh_lookup() consumers.

Fix panic triggered by deleting non-existent host route.

PR: kern/185092
Submitted by: Nikolay Denev <ndenev at gmail.com>
MFC after: 1 month


# 76039bc8 26-Oct-2013 Gleb Smirnoff <glebius@FreeBSD.org>

The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 7caf4ab7 15-Oct-2013 Gleb Smirnoff <glebius@FreeBSD.org>

- Utilize counter(9) to accumulate statistics on interface addresses. Add
four counters to struct ifaddr. This kills '+=' on a variables shared
between processors for every packet.
- Nuke struct if_data from struct ifaddr.
- In ip_input() do not put a reference on ifaddr, instead update statistics
right now in place and do IN_IFADDR_RUNLOCK(). These removes atomic(9)
for every packet. [1]
- To properly support NET_RT_IFLISTL sysctl used by getifaddrs(3), in
rtsock.c fill if_data fields using counter_u64_fetch().
- Accidentially fix bug in COMPAT_32 version of NET_RT_IFLISTL, which
took if_data not from the ifaddr, but from ifaddr's ifnet. [2]

Submitted by: melifaro [1], pluknet[2]
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 413e45bf 20-Aug-2013 Bjoern A. Zeeb <bz@FreeBSD.org>

After r241616 properly export ifi_baudrate_pf in the 32bit compat case.

MFC after: 3 days


# 12bdf23a 28-Jul-2013 Hiroki Sato <hrs@FreeBSD.org>

sin6 should be assigned before the loop.


# 4825b1e0 11-Jul-2013 Hiroki Sato <hrs@FreeBSD.org>

Add a leaf node CTL_NET.PF_ROUTE.0.AF.NET_RT_DUMP.0.FIB. This returns
routing table with the specified FIB number, not td->td_proc->p_fibnum.


# f672f56f 24-Jun-2013 Qing Li <qingli@FreeBSD.org>

Due to the routing related networking kernel redesign work
in FBSD 8.0, interface routes have been returened to the
applications without the RTF_GATEWAY bit. This incompatibility
has caused some issues with Zebra, Qugga and the like.
This patch provides the RTF_GATEWAY flag bit in returned interface
routes so to behave similarly to pre 8.0 systems.

Reviewed by: hrs
Verified by: mackn at opendns dot com


# c69f77c3 14-Mar-2013 Gleb Smirnoff <glebius@FreeBSD.org>

- Use m_getcl() instead of hand allocating.
- Convert panic() to KASSERT.
- Remove superfluous cleaning of mbuf fields after allocation.
- Add comment on possible use of m_get2() here.

Sponsored by: Nginx, Inc.


# 0bebb544 05-Dec-2012 Hiroki Sato <hrs@FreeBSD.org>

- Move definition of V_deembed_scopeid to scope6_var.h.
- Deembed scope id in L3 address in in6_lltable_dump().
- Simplify scope id recovery in rtsock routines.
- Remove embedded scope id handling in ndp(8) and route(8) completely.


# eb1b1807 05-Dec-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually


# 5c9fa630 04-Dec-2012 Hiroki Sato <hrs@FreeBSD.org>

- Fix LOR in sa6_recoverscope() in rt_msg2()[1].
- Check V_deembed_scopeid before checking if sa_family == AF_INET6.
- Fix scope id handing in route(8)[2] and ifconfig(8).

Reported by: rpaulo[1], Mateusz Guzik[1], peter[2]


# d60ec817 17-Nov-2012 Adrian Chadd <adrian@FreeBSD.org>

Fix up a compile time warning if INET6 isn't defined.


# 6bbfef90 17-Nov-2012 Hiroki Sato <hrs@FreeBSD.org>

Fill sin6_scope_id in sockaddr_in6 before passing it from the kernel to
userland via routing socket or sysctl. This eliminates the following
KAME-specific sin6_scope_id handling routine from each userland utility:

sin6.sin6_scope_id = ntohs(*(u_int16_t *)&sin6.sin6_addr.s6_addr[2]);

This behavior can be controlled by net.inet6.ip6.deembed_scopeid. This is
set to 1 by default (sin6_scope_id will be filled in the kernel).

Reviewed by: bz


# c9b652e3 18-Oct-2012 Andre Oppermann <andre@FreeBSD.org>

Mechanically remove the last stray remains of spl* calls from net*/*.
They have been Noop's for a long time now.


# c2508034 22-Apr-2012 Alexander V. Chernikov <melifaro@FreeBSD.org>

Do not require radix write lock to be held while dumping route table
via sysctl(4) interface. This permits router not to stop forwarding
packets while route table is being written to user-supplied buffer.

Reported by: Pawel Tyll <ptyll@nitronet.pl>
Approved by: kib(mentor)

MFC after: 1 week


# 6d076ae8 10-Feb-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

Introduce a new NET_RT_IFLISTL API to query the address list. It works
on extended and extensible structs if_msghdrl and ifa_msghdrl. This
will allow us to extend both the msghdrl structs and eventually if_data
in the future without breaking the ABI.

Bump __FreeBSD_version to allow ports to more easily detect the new API.

Reviewed by: glebius, brooks
MFC after: 3 days


# e82cf13b 10-Feb-2012 Bjoern A. Zeeb <bz@FreeBSD.org>

Backout changes from r228571. Remove if_data from struct ifa_msghdr again.
While this breaks carp on HEAD temporary, it restores the upgrade path from
stable, and head before 20111215.

Reviewed by: glebius, brooks


# c82c34b4 08-Jan-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Copy ifa->if_data to ifam->ifam_data. This was forgotten in r228571.

Submitted by: bz


# 137f91e8 05-Jan-2012 John Baldwin <jhb@FreeBSD.org>

Convert all users of IF_ADDR_LOCK to use new locking macros that specify
either a read lock or write lock.

Reviewed by: bz
MFC after: 2 weeks


# 08b68b0e 15-Dec-2011 Gleb Smirnoff <glebius@FreeBSD.org>

A major overhaul of the CARP implementation. The ip_carp.c was started
from scratch, copying needed functionality from the old implemenation
on demand, with a thorough review of all code. The main change is that
interface layer has been removed from the CARP. Now redundant addresses
are configured exactly on the interfaces, they run on.

The CARP configuration itself is, as before, configured and read via
SIOCSVH/SIOCGVH ioctls. A new prefix created with SIOCAIFADDR or
SIOCAIFADDR_IN6 may now be configured to a particular virtual host id,
which makes the prefix redundant.

ifconfig(8) semantics has been changed too: now one doesn't need
to clone carpXX interface, he/she should directly configure a vhid
on a Ethernet interface.

To supply vhid data from the kernel to an application the getifaddrs(8)
function had been changed to pass ifam_data with each address. [1]

The new implementation definitely closes all PRs related to carp(4)
being an interface, and may close several others. It also allows
to run a single redundant IP per interface.

Big thanks to Bjoern Zeeb for his help with inet6 part of patch, for
idea on using ifam_data and for several rounds of reviewing!

PR: kern/117000, kern/126945, kern/126714, kern/120130, kern/117448
Reviewed by: bz
Submitted by: bz [1]


# 6472ac3d 07-Nov-2011 Ed Schouten <ed@FreeBSD.org>

Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.

The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.


# 3ca1a2d6 03-Nov-2011 Max Laier <mlaier@FreeBSD.org>

Fix a use-after-free/redzone issue in the routing code.

Reported by (repeatedly): Mike Tancsa
Prodded by (repeatedly): bz
Forgotten by (repeatedly): mlaier
MFC after: 2 weeks


# 528737fd 28-Sep-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Pass the fibnum where we need filtering of the message on the
rtsock allowing routing daemons to filter routing updates on an
rtsock per FIB.

Adjust raw_input() and split it into wrapper and a new function
taking an optional callback argument even though we only have one
consumer [1] to keep the hackish flags local to rtsock.c.

PR: kern/134931
Submitted by: multiple (see PR)
Suggested by: rwatson [1]
Reviewed by: rwatson
MFC after: 3 days


# 826bf287 09-Feb-2011 Max Laier <mlaier@FreeBSD.org>

As info.rti_info[RTAX_DST] can point inside of rtm we must not free the rtm
until rt_dispatch is done with the sockaddr.

Found by: memguard
MFC after: 3 days


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# ee7c7fee 16-Oct-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

Close a race acquiring the IF_ADDR_LOCK() for each entry while iterating
over all interfaces to make sure the address will neither change nor be
freed while we are working on it.

PR: kern/146250
Submitted by: Mikolaj Golub (to.my.trociny gmail.com)
MFC after: 1 week


# 04f32057 03-Aug-2010 Konstantin Belousov <kib@FreeBSD.org>

Properly set ifi_datalen for compat32 struct if_data32.

PR: kern/149240
Submitted by: Stef Walter <stef memberwebs com>
MFC after: 1 weeks


# dd62f5c0 25-Jun-2010 Qing Li <qingli@FreeBSD.org>

MFC r208553

This patch fixes the problem where proxy ARP entries cannot be added
over the if_ng interface.

Approved by: re (bz)


# 0ed6142b 25-May-2010 Qing Li <qingli@FreeBSD.org>

This patch fixes the problem where proxy ARP entries cannot be added
over the if_ng interface.

MFC after: 3 days


# b0af8356 08-May-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r207194:
Provide 32bit compat shims for sysctl net.route NET_RT_IFLIST.


# 427a928a 25-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

Provide 32bit compat shims for sysctl net.route NET_RT_IFLIST.
This allows getifaddrs(3) to work for compat32 binaries.

Submitted by: jhb (6.x version)
Reviewed by: emaste
Tested by: emaste and <pluknet gmail com>
MFC after: 2 weeks


# 32c53401 05-Jan-2010 Qing Li <qingli@FreeBSD.org>

MFC r201282, r201543

r201282
-------
The proxy arp entries could not be added into the system over the
IFF_POINTOPOINT link types. The reason was due to the routing
entry returned from the kernel covering the remote end is of an
interface type that does not support ARP. This patch fixes this
problem by providing a hint to the kernel routing code, which
indicates the prefix route instead of the PPP host route should
be returned to the caller. Since a host route to the local end
point is also added into the routing table, and there could be
multiple such instantiations due to multiple PPP links can be
created with the same local end IP address, this patch also fixes
the loopback route installation failure problem observed prior to
this patch. The reference count of loopback route to local end would
be either incremented or decremented. The first instantiation would
create the entry and the last removal would delete the route entry.

r201543
-------
The IFA_RTSELF address flag marks a loopback route has been installed
for the interface address. This marker is necessary to properly support
PPP types of links where multiple links can have the same local end
IP address. The IFA_RTSELF flag bit maps to the RTF_HOST value, which
was combined into the route flag bits during prefix installation in
IPv6. This inclusion causing the prefix route to be unusable. This
patch fixes this bug by excluding the IFA_RTSELF flag during route
installation.

PR: ports/141342, kern/141134


# c7ab6602 30-Dec-2009 Qing Li <qingli@FreeBSD.org>

The proxy arp entries could not be added into the system over the
IFF_POINTOPOINT link types. The reason was due to the routing
entry returned from the kernel covering the remote end is of an
interface type that does not support ARP. This patch fixes this
problem by providing a hint to the kernel routing code, which
indicates the prefix route instead of the PPP host route should
be returned to the caller. Since a host route to the local end
point is also added into the routing table, and there could be
multiple such instantiations due to multiple PPP links can be
created with the same local end IP address, this patch also fixes
the loopback route installation failure problem observed prior to
this patch. The reference count of loopback route to local end would
be either incremented or decremented. The first instantiation would
create the entry and the last removal would delete the route entry.

MFC after: 5 days


# 950cde50 28-Dec-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

MFC r200473:

Throughout the network stack we have a few places of
if (jailed(cred))
left. If you are running with a vnet (virtual network stack) those will
return true and defer you to classic IP-jails handling and thus things
will be "denied" or returned with an error.

Work around this problem by introducing another "jailed()" function,
jailed_without_vnet(), that also takes vnets into account, and permits
the calls, should the jail from the given cred have its own virtual
network stack.

We cannot change the classic jailed() call to do that, as it is used
outside the network stack as well.

Discussed with: julian, zec, jamie, rwatson (back in Sept)


# de0bd6f7 13-Dec-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Throughout the network stack we have a few places of
if (jailed(cred))
left. If you are running with a vnet (virtual network stack) those will
return true and defer you to classic IP-jails handling and thus things
will be "denied" or returned with an error.

Work around this problem by introducing another "jailed()" function,
jailed_without_vnet(), that also takes vnets into account, and permits
the calls, should the jail from the given cred have its own virtual
network stack.

We cannot change the classic jailed() call to do that, as it is used
outside the network stack as well.

Discussed with: julian, zec, jamie, rwatson (back in Sept)
MFC after: 5 days


# c7276c59 30-Aug-2009 Qing Li <qingli@FreeBSD.org>

As part of r196609, a call to "rtalloc" did not take the fib into
account. So call the appropriate "rtalloc_ign_fib()" instead of
calling "rtalloc_ign()".

Reviewed by: pointed out by bz
Approved by: re


# 5311e988 30-Aug-2009 Qing Li <qingli@FreeBSD.org>

As part of r196609, a call to "rtalloc" did not take the fib into
account. So call the appropriate "rtalloc_ign_fib()" instead of
calling "rtalloc_ign()".

Reviewed by:i pointed out by bz
MFC after: immediately


# ba3ae75b 30-Aug-2009 Qing Li <qingli@FreeBSD.org>

MFC r196609

In ip_output(), the flow-table module must not try to cache L2/L3
information for interface of IFF_POINTOPOINT or IFF_LOOPBACK type.
Since the L2 information (rt_lle) is invalid for these interface
types, accidental caching attempt will trigger panic when the invalid
rt_lle reference is accessed.

When installing a new route, or when updating an existing route, the
user supplied gateway address may be an interface address (this is
particularly true for point-to-point interface related modules such
as ppp, if_tun, if_gif). Currently the routing command handler always
set the RTF_GATEWAY flag if the gateway address is given as part of the
command paramters. Therefore the gateway address must be verified against
interface addresses or else the route would be treated as an indirect
route, thus making that route unusable.

Reviewed by: kmacy, julian, rwatson
Approved by: re


# 9231d35f 28-Aug-2009 Qing Li <qingli@FreeBSD.org>

In ip_output(), the flow-table module must not try to cache L2/L3
information for interface of IFF_POINTOPOINT or IFF_LOOPBACK type.
Since the L2 information (rt_lle) is invalid for these interface
types, accidental caching attempt will trigger panic when the invalid
rt_lle reference is accessed.

When installing a new route, or when updating an existing route, the
user supplied gateway address may be an interface address (this is
particularly true for point-to-point interface related modules such
as ppp, if_tun, if_gif). Currently the routing command handler always
set the RTF_GATEWAY flag if the gateway address is given as part of the
command paramters. Therefore the gateway address must be verified against
interface addresses or else the route would be treated as an indirect
route, thus making that route unusable.

Reviewed by: kmacy, julia, rwatson
Verified by: marcus
MFC after: 3 days


# 9b36fbe9 13-Aug-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

MFC r196174:

Put multiple instructions into a block when iterating; unbreaks
NET_RT_DUMP, which otherwise only returned information of AF_MAX.
This was broken in r193232 (save your time - my bug, my fix).

Reported by: Larry Baird (lab gta.com)
Tested by: Larry Baird (lab gta.com)
Reviewed by: zec, lstewart, qing

PR: kern/137700
Approved by: re (kib)


# 20b0cdb7 13-Aug-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Put multiple instructions into a block when iterating; unbreaks
NET_RT_DUMP, which otherwise only returned information of AF_MAX.
This was broken in r193232 (save your time - my bug, my fix).

PR: kern/137700
Reported by: Larry Baird (lab gta.com)
Tested by: Larry Baird (lab gta.com)
Reviewed by: zec, lstewart, qing
Approved by: re (kib)


# 530c0060 01-Aug-2009 Robert Watson <rwatson@FreeBSD.org>

Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by: bz
Approved by: re (vimage blanket)


# d0728d71 23-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Introduce and use a sysinit-based initialization scheme for virtual
network stacks, VNET_SYSINIT:

- Add VNET_SYSINIT and VNET_SYSUNINIT macros to declare events that will
occur each time a network stack is instantiated and destroyed. In the
!VIMAGE case, these are simply mapped into regular SYSINIT/SYSUNINIT.
For the VIMAGE case, we instead use SYSINIT's to track their order and
properties on registration, using them for each vnet when created/
destroyed, or immediately on module load for already-started vnets.
- Remove vnet_modinfo mechanism that existed to serve this purpose
previously, as well as its dependency scheme: we now just use the
SYSINIT ordering scheme.
- Implement VNET_DOMAIN_SET() to allow protocol domains to declare that
they want init functions to be called for each virtual network stack
rather than just once at boot, compiling down to DOMAIN_SET() in the
non-VIMAGE case.
- Walk all virtualized kernel subsystems and make use of these instead
of modinfo or DOMAIN_SET() for init/uninit events. In some cases,
convert modular components from using modevent to using sysinit (where
appropriate). In some cases, do minor rejuggling of SYSINIT ordering
to make room for or better manage events.

Portions submitted by: jhb (VNET_SYSINIT), bz (cleanup)
Discussed with: jhb, bz, julian, zec
Reviewed by: bz
Approved by: re (VIMAGE blanket)


# eddfbb76 14-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)


# 8c0fec80 23-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Modify most routines returning 'struct ifaddr *' to return references
rather than pointers, requiring callers to properly dispose of those
references. The following routines now return references:

ifaddr_byindex
ifa_ifwithaddr
ifa_ifwithbroadaddr
ifa_ifwithdstaddr
ifa_ifwithnet
ifaof_ifpforaddr
ifa_ifwithroute
ifa_ifwithroute_fib
rt_getifa
rt_getifa_fib
IFP_TO_IA
ip_rtaddr
in6_ifawithifp
in6ifa_ifpforlinklocal
in6ifa_ifpwithaddr
in6_ifadd
carp_iamatch6
ip6_getdstifaddr

Remove unused macro which didn't have required referencing:

IFP_TO_IA6

This closes many small races in which changes to interface
or address lists while an ifaddr was in use could lead to use of freed
memory (etc). In a few cases, add missing if_addr_list locking
required to safely acquire references.

Because of a lack of deep copying support, we accept a race in which
an in6_ifaddr pointed to by mbuf tags and extracted with
ip6_getdstifaddr() doesn't hold a reference while in transmit. Once
we have mbuf tag deep copy support, this can be fixed.

Reviewed by: bz
Obtained from: Apple, Inc. (portions)
MFC after: 6 weeks (portions)


# 1099f828 21-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Clean up common ifaddr management:

- Unify reference count and lock initialization in a single function,
ifa_init().
- Move tear-down from a macro (IFAFREE) to a function ifa_free().
- Move reference count bump from a macro (IFAREF) to a function ifa_ref().
- Instead of using a u_int protected by a mutex to refcount(9) for
reference count management.

The ifa_mtx is now used for exactly one ioctl, and possibly should be
removed.

MFC after: 3 weeks


# c03528b6 10-Jun-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

SCTP needs either IPv4 or IPv6 as lower layer[1].
So properly hide the already #ifdef SCTP code with
#if defined(INET) || defined(INET6) as well to get us
closer to a non-INET/INET6 kernel.

Discussed with: tuexen [1]


# 8d8bc018 08-Jun-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

After r193232 rt_tables in vnet.h are no longer indirectly dependent on
the ROUTETABLES kernel option thus there is no need to include opt_route.h
anymore in all consumers of vnet.h and no longer depend on it for module
builds.

Remove the hidden include in flowtable.h as well and leave the two
explicit #includes in ip_input.c and ip_output.c.


# c2c2a7c1 01-Jun-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Convert the two dimensional array to be malloced and introduce
an accessor function to get the correct rnh pointer back.

Update netstat to get the correct pointer using kvm_read()
as well.

This not only fixes the ABI problem depending on the kernel
option but also permits the tunable to overwrite the kernel
option at boot time up to MAXFIBS, enlarging the number of
FIBs without having to recompile. So people could just use
GENERIC now.

Reviewed by: julian, rwatson, zec
X-MFC: not possible


# d4b5cae4 01-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Reimplement the netisr framework in order to support parallel netisr
threads:

- Support up to one netisr thread per CPU, each processings its own
workstream, or set of per-protocol queues. Threads may be bound
to specific CPUs, or allowed to migrate, based on a global policy.

In the future it would be desirable to support topology-centric
policies, such as "one netisr per package".

- Allow each protocol to advertise an ordering policy, which can
currently be one of:

NETISR_POLICY_SOURCE: packets must maintain ordering with respect to
an implicit or explicit source (such as an interface or socket).

NETISR_POLICY_FLOW: make use of mbuf flow identifiers to place work,
as well as allowing protocols to provide a flow generation function
for mbufs without flow identifers (m2flow). Falls back on
NETISR_POLICY_SOURCE if now flow ID is available.

NETISR_POLICY_CPU: allow protocols to inspect and assign a CPU for
each packet handled by netisr (m2cpuid).

- Provide utility functions for querying the number of workstreams
being used, as well as a mapping function from workstream to CPU ID,
which protocols may use in work placement decisions.

- Add explicit interfaces to get and set per-protocol queue limits, and
get and clear drop counters, which query data or apply changes across
all workstreams.

- Add a more extensible netisr registration interface, in which
protocols declare 'struct netisr_handler' structures for each
registered NETISR_ type. These include name, handler function,
optional mbuf to flow ID function, optional mbuf to CPU ID function,
queue limit, and ordering policy. Padding is present to allow these
to be expanded in the future. If no queue limit is declared, then
a default is used.

- Queue limits are now per-workstream, and raised from the previous
IFQ_MAXLEN default of 50 to 256.

- All protocols are updated to use the new registration interface, and
with the exception of netnatm, default queue limits. Most protocols
register as NETISR_POLICY_SOURCE, except IPv4 and IPv6, which use
NETISR_POLICY_FLOW, and will therefore take advantage of driver-
generated flow IDs if present.

- Formalize a non-packet based interface between interface polling and
the netisr, rather than having polling pretend to be two protocols.
Provide two explicit hooks in the netisr worker for start and end
events for runs: netisr_poll() and netisr_pollmore(), as well as a
function, netisr_sched_poll(), to allow the polling code to schedule
netisr execution. DEVICE_POLLING still embeds single-netisr
assumptions in its implementation, so for now if it is compiled into
the kernel, a single and un-bound netisr thread is enforced
regardless of tunable configuration.

In the default configuration, the new netisr implementation maintains
the same basic assumptions as the previous implementation: a single,
un-bound worker thread processes all deferred work, and direct dispatch
is enabled by default wherever possible.

Performance measurement shows a marginal performance improvement over
the old implementation due to the use of batched dequeue.

An rmlock is used to synchronize use and registration/unregistration
using the framework; currently, synchronized use is disabled
(replicating current netisr policy) due to a measurable 3%-6% hit in
ping-pong micro-benchmarking. It will be enabled once further rmlock
optimization has taken place. However, in practice, netisrs are
rarely registered or unregistered at runtime.

A new man page for netisr will follow, but since one doesn't currently
exist, it hasn't been updated.

This change is not appropriate for MFC, although the polling shutdown
handler should be merged to 7-STABLE.

Bump __FreeBSD_version.

Reviewed by: bz


# 0304c731 27-May-2009 Jamie Gritton <jamie@FreeBSD.org>

Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails. Child jails may be restricted more than their parents,
but never less. Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system. Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings. The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by: bz (mentor)


# 21ca7b57 05-May-2009 Marko Zec <zec@FreeBSD.org>

Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one. The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE(). Recursions
on curvnet are permitted, though strongly discuouraged.

This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.

The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.

The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.

This change also introduces a DDB subcommand to show the list of all
vnet instances.

Approved by: julian (mentor)


# 093f25f8 26-Apr-2009 Marko Zec <zec@FreeBSD.org>

In preparation for turning on options VIMAGE in next commits,
rearrange / replace / adjust several INIT_VNET_* initializer
macros, all of which currently resolve to whitespace.

Reviewed by: bz (an older version of the patch)
Approved by: julian (mentor)


# 77ee4c0e 20-Apr-2009 Robert Watson <rwatson@FreeBSD.org>

Acquire address list lock before walking an interface's address list to
identify possible jail addresses on it for IPv4 and IPv6.

MFC after: 2 weeks


# 427ac07f 14-Apr-2009 Kip Macy <kmacy@FreeBSD.org>

Extend route command:
- add show as alias for get
- add weights to allow mpath to do more than equal cost
- add sticky / nostick to disable / re-enable per-connection load balancing

This adds a field to rt_metrics_lite so network bits of world will need to be re-built.

Reviewed by: jeli & qingli


# 9c79d243 05-Feb-2009 Jamie Gritton <jamie@FreeBSD.org>

Call prison_if from rtm_get_jailed, instead of splitting it out into
prison_check_ip4 and prison_check_ip6. As prison_if includes a jailed()
check, remove that check before calling rtm_get_jailed.

Approved by: bz (mentor)


# b89e82dd 05-Feb-2009 Jamie Gritton <jamie@FreeBSD.org>

Standardize the various prison_foo_ip[46] functions and prison_if to
return zero on success and an error code otherwise. The possible errors
are EADDRNOTAVAIL if an address being checked for doesn't match the
prison, and EAFNOSUPPORT if the prison doesn't have any addresses in
that address family. For most callers of these functions, use the
returned error code instead of e.g. a hard-coded EADDRNOTAVAIL or
EINVAL.

Always include a jailed() check in these functions, where a non-jailed
cred always returns success (and makes no changes). Remove the explicit
jailed() checks that preceded many of the function calls.

Approved by: bz (mentor)


# 1cecba0f 25-Jan-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

For consistency with prison_{local,remote,check}_ipN rename
prison_getipN to prison_get_ipN.

Submitted by: jamie (as part of a larger patch)
MFC after: 1 week


# 82b334e8 16-Jan-2009 Qing Li <qingli@FreeBSD.org>

The RTF_LLINFO was revived unconditionally, but within the kernel the
check on the sysctl argument value being RTF_LLINFO is conditioned on
the COMPAT_ROUTE_FLAGS kernel option. This mismatch caused the L2
table retrieval failure, and the arp/ndp -an command displays empty L2
tables.

Reviewed by: pjd


# 14981d80 12-Jan-2009 Qing Li <qingli@FreeBSD.org>

Revive the RTF_LLINFO flag in route.h. The kernel code is guarded
by the new kernel option COMPAT_ROUTE_FLAGS for binary backward
compatibility. The RTF_LLDATA flag maps to the same value as RTF_LLINFO.
RTF_LLDATA is used by the arp and ndp utilities. The RTF_LLDATA flag is
always returned to the userland regardless whether the COMPAT_ROUTE_FLAGS
is defined.


# c2ded8ae 09-Jan-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Rather than using the cred from curthread, take it from the thread
referenced in the sysctl req argument.

Reviewed by: rwatson
MFC after: 2 weeks


# 813dd6ae 09-Jan-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Restrict arp, ndp and theoretically the FIB listing (if not
read with libkvm) to the addresses of a prison, when inside a
jail. [1]
As the patch from the PR was pre-'new-arp', add checks to the
llt_dump handlers as well.

While touching RTM_GET in route_output(), consistently use
curthread credentials rather than the creds from the socket
there. [2]

PR: kern/68189
Submitted by: Mark Delany <sxcg2-fuwxj@qmda.emu.st> [1]
Discussed with: rwatson [2]
Reviewed by: rwatson
MFC after: 4 weeks


# ebda3fc3 09-Jan-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Take the cred from curthread rather than curproc as curproc would need
locking but the credential from curthread (usually) never changes.

Discussed with: jhb
MFC after: 2 weeks


# 8eca593c 26-Dec-2008 Qing Li <qingli@FreeBSD.org>

This checkin addresses a couple of issues:
1. The "route" command allows route insertion through the interface-direct
option "-iface". During if_attach(), an sockaddr_dl{} entry is created
for the interface and is part of the interface address list. This
sockaddr_dl{} entry describes the interface in detail. The "route"
command selects this entry as the "gateway" object when the "-iface"
option is present. The "arp" and "ndp" commands also interact with the
kernel through the routing socket when adding and removing static L2
entries. The static L2 information is also provided through the
"gateway" object with an AF_LINK family type, similar to what is
provided by the "route" command. In order to differentiate between
these two types of operations, a RTF_LLDATA flag is introduced. This
flag is set by the "arp" and "ndp" commands when issuing the add and
delete commands. This flag is also set in each L2 entry returned by the
kernel. The "arp" and "ndp" command follows a convention where a RTM_GET
is issued first followed by a RTM_ADD/DELETE. This RTM_GET request fills
in the fields for a "rtm" object, which is reinjected into the kernel by
a subsequent RTM_ADD/DELETE command. The entry returend from RTM_GET
is a prefix route, so the RTF_LLDATA flag must be specified when issuing
the RTM_ADD/DELETE messages.

2. Enforce the convention that NET_RT_FLAGS with a 0 w_arg is the
specification for retrieving L2 information. Also optimized the
code logic.

Reviewed by: julian


# 6e6b3f7c 14-Dec-2008 Qing Li <qingli@FreeBSD.org>

This main goals of this project are:
1. separating L2 tables (ARP, NDP) from the L3 routing tables
2. removing as much locking dependencies among these layers as
possible to allow for some parallelism in the search operations
3. simplify the logic in the routing code,

The most notable end result is the obsolescent of the route
cloning (RTF_CLONING) concept, which translated into code reduction
in both IPv4 ARP and IPv6 NDP related modules, and size reduction in
struct rtentry{}. The change in design obsoletes the semantics of
RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland
applications such as "arp" and "ndp" have been modified to reflect
those changes. The output from "netstat -r" shows only the routing
entries.

Quite a few developers have contributed to this project in the
past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and
Andre Oppermann. And most recently:

- Kip Macy revised the locking code completely, thus completing
the last piece of the puzzle, Kip has also been conducting
active functional testing
- Sam Leffler has helped me improving/refactoring the code, and
provided valuable reviews
- Julian Elischer setup the perforce tree for me and has helped
me maintaining that branch before the svn conversion


# a5a3926c 13-Dec-2008 Andrew Thompson <thompsa@FreeBSD.org>

Dont leak the rnh lock on error.


# 9b20205d 10-Dec-2008 Kip Macy <kmacy@FreeBSD.org>

fix a reported panic when adding a route and one hit here when deleting a route

- pass RTF_RNH_LOCKED to rtalloc1_fib in 2 cases where the lock is held
- make sure the rnh lock is held across rt_setgate and rt_getifa_fib


# 609ff41f 07-Dec-2008 Warner Losh <imp@FreeBSD.org>

Add missing include to sys/lock.h before sys/rwlock.h


# 3120b9d4 07-Dec-2008 Kip Macy <kmacy@FreeBSD.org>

- convert radix node head lock from mutex to rwlock
- make radix node head lock not recursive
- fix LOR in rtexpunge
- fix LOR in rtredirect

Reviewed by: sam


# 4b79449e 02-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation


# 413628a7 29-Nov-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

MFp4:
Bring in updated jail support from bz_jail branch.

This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..

SCTP support was updated and supports IPv6 in jails as well.

Cpuset support permits jails to be bound to specific processor
sets after creation.

Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.

DDB 'show jails' command was added to aid debugging.

Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.

Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.

Bump __FreeBSD_version for the afore mentioned and in kernel changes.

Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.

Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible


# 1ede983c 23-Oct-2008 Dag-Erling Smørgrav <des@FreeBSD.org>

Retire the MALLOC and FREE macros. They are an abomination unto style(9).

MFC after: 3 months


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 8b615593 02-Oct-2008 Marko Zec <zec@FreeBSD.org>

Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 603724d3 17-Aug-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch


# 4d896055 09-Jul-2008 Robert Watson <rwatson@FreeBSD.org>

Remove unused support for local and foreign addresses in generic raw
socket support. These utility routines are used only for routing and
pfkey sockets, neither of which have a notion of address, so were
required to mock up fake socket addresses to avoid connection
requirements for applications that did not specify their own fake
addresses (most of them).

Quite a bit of the removed code is #ifdef notdef, since raw sockets
don't support bind() or connect() in practice. Removing this
simplifies the raw socket implementation, and removes two (commented
out) uses of dtom(9).

Fake addresses passed to sendto(2) by applications are ignored for
compatibility reasons, but this is now done in a more consistent way
(and with a comment). Possibly, EINVAL could be returned here in
the future if it is determined that no applications depend on the
semantic inconsistency of specifying a destination address for a
protocol without address support, but this will require some amount
of careful surveying.

NB: This does not affect netinet, netinet6, or other wire protocol
raw sockets, which provide their own independent infrastructure with
control block address support specific to the protocol.

MFC after: 3 weeks
Reviewed by: bz


# 59dd72d0 03-Jul-2008 Robert Watson <rwatson@FreeBSD.org>

Remove NETISR_MPSAFE, which allows specific netisr handlers to be directly
dispatched without Giant, and add NETISR_FORCEQUEUE, which allows specific
netisr handlers to always be dispatched via a queue (deferred). Mark the
usb and if_ppp netisr handlers as NETISR_FORCEQUEUE, and explicitly
acquire Giant in those handlers.

Previously, any netisr handler not marked NETISR_MPSAFE would necessarily
run deferred and with Giant acquired. This change removes Giant
scaffolding from the netisr infrastructure, but NETISR_FORCEQUEUE allows
non-MPSAFE handlers to continue to force deferred dispatch so as to avoid
lock order reversals between their acqusition of Giant and any calling
context.

It is likely we will be able to remove NETISR_FORCEQUEUE once
IFF_NEEDSGIANT is removed, as non-MPSAFE usb and if_ppp drivers will no
longer be supported.

Reviewed by: bz
MFC after: 1 month
X-MFC note: We can't remove NETISR_MPSAFE from stable/7 for KPI reasons,
but the rest can go back.


# 8b07e49a 09-May-2008 Julian Elischer <julian@FreeBSD.org>

Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.

Constraints:
------------

I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.

One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".

One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.

This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.

To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.

The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.

The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.

In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.

One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).

You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.

This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.

Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.

Packets fall into one of a number of classes.

1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..

setfib -3 ping target.example.com # will use fib 3 for ping.

It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.

2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)

3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).

4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.

5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.

6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.

Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)

In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.

In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.

Early testing experience:
-------------------------

Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.

For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.

Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.

ipfw has grown 2 new keywords:

setfib N ip from anay to any
count ip from any to any fib N

In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.

SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.

Where to next:
--------------------

After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.

Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.

My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.

When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.

Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.

This work was sponsored by Ironport Systems/Cisco

Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco


# e440aed9 12-Apr-2008 Qing Li <qingli@FreeBSD.org>

This patch provides the back end support for equal-cost multi-path
(ECMP) for both IPv4 and IPv6. Previously, multipath route insertion
is disallowed. For example,

route add -net 192.103.54.0/24 10.9.44.1
route add -net 192.103.54.0/24 10.9.44.2

The second route insertion will trigger an error message of
"add net 192.103.54.0/24: gateway 10.2.5.2: route already in table"

Multiple default routes can also be inserted. Here is the netstat
output:

default 10.2.5.1 UGS 0 3074 bge0 =>
default 10.2.5.2 UGS 0 0 bge0

When multipath routes exist, the "route delete" command requires
a specific gateway to be specified or else an error message would
be displayed. For example,

route delete default

would fail and trigger the following error message:

"route: writing to routing socket: No such process"
"delete net default: not in table"

On the other hand,

route delete default 10.2.5.2

would be successful: "delete net default: gateway 10.2.5.2"

One does not have to specify a gateway if there is only a single
route for a particular destination.

I need to perform more testings on address aliases and multiple
interfaces that have the same IP prefixes. This patch as it
stands today is not yet ready for prime time. Therefore, the ECMP
code fragments are fully guarded by the RADIX_MPATH macro.
Include the "options RADIX_MPATH" in the kernel configuration
to enable this feature.

Reviewed by: robert, sam, gnn, julian, kmacy


# 237fdd78 16-Mar-2008 Robert Watson <rwatson@FreeBSD.org>

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


# 18b6e4c8 08-Sep-2007 Olivier Houchard <cognet@FreeBSD.org>

Do not set the RTF_GATEWAY flag if RTF_LLINFO is set, it doesn't make much
sense in that context, and leads to unusable routes.
This should unbreak bootpd.

Discussed with: glebius
Submitted by: bms
Approved by: re (bmah)


# 5de55821 27-Mar-2007 Gleb Smirnoff <glebius@FreeBSD.org>

Fix regression in rev. 1.140.

Reported by: Yuriy Tsibizov <Yuriy.Tsibizov gfk.ru>, bsam


# 75ae0c01 27-Mar-2007 Bruce M Simpson <bms@FreeBSD.org>

Fix a case where hardware removal of an interface caused an attempt to
announce an ll_ifma which has gone away. Add a KASSERT to catch regressions.

Bug found by: Tom Uffner


# 9406b274 22-Mar-2007 Gleb Smirnoff <glebius@FreeBSD.org>

When working on an RTM_CHANGE do the route editing in the following
sequence. First, if rt_ifa is going to be changed, then call
ifa_rtrequest(RTM_DELETE). Second, if gateway is going to be changed,
then call rt_setgate(). Third, change rt_ifa.

With this change we are able to change a link level route to a
gateway one, that wasn't possible before:

# ifconfig em0 192.168.22.1/24
# arp -s 192.168.22.99 00:11:22:33:44:55
# route change 192.168.22.99 192.168.22.199
# ping 192.168.22.99
db>

Reported by: avatar


# acd3428b 06-Nov-2006 Robert Watson <rwatson@FreeBSD.org>

Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges. These may
require some future tweaking.

Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>


# f8829a4a 03-Nov-2006 Randall Stewart <rrs@FreeBSD.org>

Ok, here it is, we finally add SCTP to current. Note that this
work is not just mine, but it is also the works of Peter Lei
and Michael Tuexen. They both are my two key other developers
working on the project.. and they need ata-boy's too:
****
peterlei@cisco.com
tuexen@fh-muenster.de
****
I did do a make sysent which updated the
syscall's and sysproto.. I hope that is correct... without
it you don't build since we have new syscalls for SCTP :-0

So go out and look at the NOTES, add
option SCTP (make sure inet and inet6 are present too)
and play with SCTP.

I will see about comitting some test tools I have after I
figure out where I should place them. I also have a
lib (libsctp.a) that adds some of the missing socketapi
functions that I need to put into lib's.. I will talk
to George about this :-)

There may still be some 64 bit issues in here, none of
us have a 64 bit processor to test with yet.. Michael
may have a MAC but thats another beast too..

If you have a mac and want to use SCTP contact Michael
he maintains a web site with a loadable module with
this code :-)

Reviewed by: gnn
Approved by: gnn


# a152f8a3 21-Jul-2006 Robert Watson <rwatson@FreeBSD.org>

Change semantics of socket close and detach. Add a new protocol switch
function, pru_close, to notify protocols that the file descriptor or
other consumer of a socket is closing the socket. pru_abort is now a
notification of close also, and no longer detaches. pru_detach is no
longer used to notify of close, and will be called during socket
tear-down by sofree() when all references to a socket evaporate after
an earlier call to abort or close the socket. This means detach is now
an unconditional teardown of a socket, whereas previously sockets could
persist after detach of the protocol retained a reference.

This faciliates sharing mutexes between layers of the network stack as
the mutex is required during the checking and removal of references at
the head of sofree(). With this change, pru_detach can now assume that
the mutex will no longer be required by the socket layer after
completion, whereas before this was not necessarily true.

Reviewed by: gnn


# e27c3f48 05-Jul-2006 Oleg Bulyzhin <oleg@FreeBSD.org>

Adjust rt_(set|get)metrics() to do kernel <-> userland timebase conversion.
We need it since kernel timebase has changed (time_second -> time_uptime).

Approved by: glebius (mentor)


# bc725eaf 01-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Chance protocol switch method pru_detach() so that it returns void
rather than an error. Detaches do not "fail", they other occur or
the protocol flags SS_PROTOREF to take ownership of the socket.

soclose() no longer looks at so_pcb to see if it's NULL, relying
entirely on the protocol to decide whether it's time to free the
socket or not using SS_PROTOREF. so_pcb is now entirely owned and
managed by the protocol code. Likewise, no longer test so_pcb in
other socket functions, such as soreceive(), which have no business
digging into protocol internals.

Protocol detach routines no longer try to free the socket on detach,
this is performed in the socket code if the protocol permits it.

In rts_detach(), no longer test for rp != NULL in detach, and
likewise in other protocols that don't permit a NULL so_pcb, reduce
the incidence of testing for it during detach.

netinet and netinet6 are not fully updated to this change, which
will be in an upcoming commit. In their current state they may leak
memory or panic.

MFC after: 3 months


# ac45e92f 01-Apr-2006 Robert Watson <rwatson@FreeBSD.org>

Change protocol switch pru_abort() API so that it returns void rather
than an int, as an error here is not meaningful. Modify soabort() to
unconditionally free the socket on the return of pru_abort(), and
modify most protocols to no longer conditionally free the socket,
since the caller will do this.

This commit likely leaves parts of netinet and netinet6 in a situation
where they may panic or leak memory, as they have not are not fully
updated by this commit. This will be corrected shortly in followup
commits to these components.

MFC after: 3 months


# 22cafcf0 15-Mar-2006 Andre Oppermann <andre@FreeBSD.org>

- Fill in the correct rtm_index for RTM_ADD and RTM_CHANGE messages.

- Allow RTM_CHANGE to change a number of route flags as specified by
RTF_FMASK.

- The unused rtm_use field in struct rt_msghdr is redesignated as
rtm_fmask field to communicate route flag changes in RTM_CHANGE
messages from userland. The use count of a route was moved to
rtm_rmx a long time ago. For source code compatibility reasons
a define of rtm_use to rtm_fmask is provided.

These changes faciliate running of multiple cooperating routing
daemons at the same time without causing undesired interference.
Open[BGP|OSPF]D make use of these features to have IGP routes
override EGP ones.

Obtained from: OpenBSD (claudio@)
MFC after: 3 days


# 4a0d6638 11-Nov-2005 Ruslan Ermilov <ru@FreeBSD.org>

- Store pointer to the link-level address right in "struct ifnet"
rather than in ifindex_table[]; all (except one) accesses are
through ifp anyway. IF_LLADDR() works faster, and all (except
one) ifaddr_byindex() users were converted to use ifp->if_addr.

- Stop storing a (pointer to) Ethernet address in "struct arpcom",
and drop the IFP2ENADDR() macro; all users have been converted
to use IF_LLADDR() instead.


# 303989a2 09-Nov-2005 Ruslan Ermilov <ru@FreeBSD.org>

Use sparse initializers for "struct domain" and "struct protosw",
so they are easier to follow for the human being.


# a11faa9f 19-Sep-2005 Gleb Smirnoff <glebius@FreeBSD.org>

Drop current rtentry lock before calling rt_getifa(). This fixes a LOR
and a possible recursive use of rtentry mutex.

PR: kern/69356
Reviewed by: sam


# fe0fc7ef 10-Sep-2005 Christian S.J. Peron <csjp@FreeBSD.org>

Protect interface and address lists using the appropriate mutex. These
locks were not aquired because the user buffers were not wired, thus it was
possible that that SYSCTL_OUT could sleep, causing a number of different
problems such as lock ordering issues and dead locks.

-Wire user supplied buffer to ensure SYSCTL_OUT will not sleep.
-Pickup ifnet locks to protect the list.
-Where applicable pickup address locks.
-Pickup radix node head locks.
-Remove splnet stubs
-Remove various comments about locking here, because they are no
longer needed.

It is the hope that these changes will make sysctl_rtsock MP safe.

MFC after: 3 weeks


# 5b1c0294 07-Sep-2005 David E. O'Brien <obrien@FreeBSD.org>

Forward declaring static variables as extern is invalid ISO-C. Now that
GCC can properly handle forward static declarations, do this properly.


# 7e994955 25-Aug-2005 Robert Watson <rwatson@FreeBSD.org>

De-spl parts of the routing socket code now generally protected
through locking; leave some spl references around code where there
are open questions about global variable references. Also, add
an XXX regarding locking in sysctl.

MFC after: 3 days


# 79188861 11-Aug-2005 Gleb Smirnoff <glebius@FreeBSD.org>

o To prevent a race between RTM_DELETE message and
arptimer() deleting stale entry, we need to lock
rtentry before unlocking radix head.

Reviewed by: sam


# 292ee7be 09-Aug-2005 Robert Watson <rwatson@FreeBSD.org>

Rename IFF_RUNNING to IFF_DRV_RUNNING, IFF_OACTIVE to IFF_DRV_OACTIVE,
and move both flags from ifnet.if_flags to ifnet.if_drv_flags, making
and documenting the locking of these flags the responsibility of the
device driver, not the network stack. The flags for these two fields
will be mutually exclusive so that they can be exposed to user space as
though they were stored in the same variable.

Provide #defines to provide the old names #ifndef _KERNEL, so that user
applications (such as ifconfig) can use the old flag names. Using the
old names in a device driver will result in a compile error in order to
help device driver writers adopt the new model.

When exposing the interface flags to user space, via interface ioctls
or routing sockets, or the two fields together. Since the driver flags
cannot currently be set for user space, no new logic is currently
required to handle this case.

Add some assertions that general purpose network stack routines, such
as if_setflags(), are not improperly used on driver-owned flags.

With this change, a large number of very minor network stack races are
closed, subject to correct device driver locking. Most were likely
never triggered.

Driver sweep to follow; many thanks to pjd and bz for the line-by-line
review they gave this patch.

Reviewed by: pjd, bz
MFC after: 7 days


# ba7be0a9 15-Jul-2005 George V. Neville-Neil <gnn@FreeBSD.org>

Fix for PR 82974. We were not checking that the route looked up in
the case of an RTM_CHANGE was specific, i.e. that it matched completely. This
led to a route change of a non-existent route changing the default route
as the radix code would simply back track to that point and hand that
route back to the routing socket code.

PR: 82974
Reviewed by: Tai-hwa Liang <avatar@mmlab.cse.yzu.edu.tw>
Ben Kaduk <minimarmot@gmail.com>
Bjoern A. Zeeb <bzeeb-lists@lists.zabbadoz.net>
Obtained from: OpenBSD with modifications.
MFC after: 2 weeks


# 25029d6c 08-Jun-2005 Hartmut Brandt <harti@FreeBSD.org>

When returing an RTM_GET message through the routing socket fill
in the rtm_index field whenever we have an interface pointer. This
is consistent with the RTM_GET messages returned by sysctl().


# 7a7fa27b 26-Mar-2005 Sam Leffler <sam@FreeBSD.org>

rt_newaddrmsg will blow up if given something other than RTM_ADD
or RTM_DELETE; add an assertion, may want to do something more
heavyhanded in the future

Noticed by: Coverity Prevent analysis tool
Reviewed by: mdodd


# 8d78bea4 23-Feb-2005 Sam Leffler <sam@FreeBSD.org>

eliminate dead code and collapse the remainder

Noticed by: Coverity Prevent analysis tool
Reviewed by: rwatson


# c398230b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# 756d52a1 08-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Initialize struct pr_userreqs in new/sparse style and fill in common
default elements in net_init_domain().

This makes it possible to grep these structures and see any bogosities.


# b83a279f 05-Oct-2004 Sam Leffler <sam@FreeBSD.org>

Add 802.11-specific events that are dispatched through the routing socket.
This really doesn't belong here but is preferred (for the moment) over
adding yet another mechanism for sending msgs from the kernel to user apps.

Reviewed by: imp


# 3161f583 27-Aug-2004 Andre Oppermann <andre@FreeBSD.org>

Apply error and success logic consistently to the function netisr_queue() and
its users.

netisr_queue() now returns (0) on success and ERRNO on failure. At the
moment ENXIO (netisr queue not functional) and ENOBUFS (netisr queue full)
are supported.

Previously it would return (1) on success but the return value of IF_HANDOFF()
was interpreted wrongly and (0) was actually returned on success. Due to this
schednetisr() was never called to kick the scheduling of the isr. However this
was masked by other normal packets coming through netisr_dispatch() causing the
dequeueing of waiting packets.

PR: kern/70988
Found by: MOROHOSHI Akihiko <moro@remus.dti.ne.jp>
MFC after: 3 days


# 18aee723 24-Aug-2004 Peter Pentchev <roam@FreeBSD.org>

Fix a typo (attacked -> attached).

Approved by: sam


# b062951a 21-Aug-2004 Robert Watson <rwatson@FreeBSD.org>

If a tunable for the routing socket netisr queue max is defined, allow it
to override the default value, rather than the default value overriding
the tunable.


# 190a4c94 21-Aug-2004 Robert Watson <rwatson@FreeBSD.org>

Allow the size of the routing socket netisr queue to be configured using
the tunable or sysctl 'net.route.netisr_maxqlen'. Default the maximum
depth to 256 rather than IFQ_MAXLEN due to the downsides of dropping
routing messages.

MT5 candidate.

Discussed with: mdodd, mlaier, Vincent Jardin <jardin at 6wind.com>


# 3b7d076f 13-Aug-2004 Robert Watson <rwatson@FreeBSD.org>

Use IFQ_SET_MAXLEN() to set the maximum queue depth of the routing
socket netisr queue.

Pointed out by: winter


# 9b3d77e7 05-Jul-2004 Bruce M Simpson <bms@FreeBSD.org>

Be consistent and use bzero() instead of memset().


# d989c7b3 08-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Introduce a netisr to deliver kernel-generated routing, avoiding
recursive entering of the socket code from the routing code:

- Modify rt_dispatch() to bundle up the sockaddr family, if any,
associated with a pending mbuf to dispatch to routing sockets, in
an m_tag on the mbuf.

- Allocate NETISR_ROUTE for use by routing sockets.

- Introduce rtsintrq, an ifqueue to be used by the netisr, and
introduce rts_input(), a function to unbundle the tagged sockaddr
and inject the mbuf and address into raw_input(), which previously
occurred in rt_dispatch().

- Introduce rts_init() to initialize rtsintrq, its mutex, and
register the netisr. Perform this at the same point in system
initialization as setup of the domains.

This change introduces asynchrony between the generation of a
pending routing socket message and delivery to sockets for use
by userspace. It avoids socket->routing->rtsock->socket use and
helps to avoid lock order reversals between the routing code and
socket code (in particular, raw socket control blocks), as route
locks are held over calls to rt_dispatch().

Reviewed by: "George V.Neville-Neil" <gnn@neville-neil.com>
Conceptual head nod by: sam


# 3581cc66 10-May-2004 Christian S.J. Peron <csjp@FreeBSD.org>

Zero the un-used portions of the struct sockaddr data before sending
it back to userspace, so it does not break bind(2) on raw sockets in jails.

Currently some processes, like traceroute(8) construct a routing request
to determine its source address based on the destination. This sockaddr
data is fed directly to bind(2). When bind calls ifa_ifwithaddr(9) to
make sure the address exists on the interface, the comparison will
fail causing bind(2) to return EADDRNOTAVAIL if the data wasnt zero'ed
before initialization.

Approved by: bmilekic (mentor)


# 1a0c4873 03-May-2004 Maxim Konovalov <maxim@FreeBSD.org>

o Fix misindentation in the previous commit.


# 5a59cefc 26-Apr-2004 Bosko Milekic <bmilekic@FreeBSD.org>

Give jail(8) the feature to allow raw sockets from within a
jail, which is less restrictive but allows for more flexible
jail usage (for those who are willing to make the sacrifice).
The default is off, but allowing raw sockets within jails can
now be accomplished by tuning security.jail.allow_raw_sockets
to 1.

Turning this on will allow you to use things like ping(8)
or traceroute(8) from within a jail.

The patch being committed is not identical to the patch
in the PR. The committed version is more friendly to
APIs which pjd is working on, so it should integrate
into his work quite nicely. This change has also been
presented and addressed on the freebsd-hackers mailing
list.

Submitted by: Christian S.J. Peron <maneo@bsdpro.com>
PR: kern/65800


# 9554c70b 19-Apr-2004 Ruslan Ermilov <ru@FreeBSD.org>

More style and deobfuscation fixes.

Submitted by: bde


# ae24a36e 18-Apr-2004 Ruslan Ermilov <ru@FreeBSD.org>

Style and code unobfuscation.


# b088717c 18-Apr-2004 Ruslan Ermilov <ru@FreeBSD.org>

Fixed a bug from rev. 1.42: cast to a correct type.

Submitted by: luigi


# 6b96f1af 18-Apr-2004 Luigi Rizzo <luigi@FreeBSD.org>

+ replace Bcmp/Bzero with 'the real thing' as in the rest of the file.
+ remember to check and fix or explain a strange cast in route_output()


# 5dfc91d7 17-Apr-2004 Luigi Rizzo <luigi@FreeBSD.org>

Minor changes to improve code readability (no actual code changes):
+ replace 0 with NULL where appropriate (not complete)
+ remove register declaration while there
+ add argument names to function prototypes to have a better idea of
what they are used for
+ add 'const' qualifiers in 3 places


# 913af518 17-Apr-2004 Luigi Rizzo <luigi@FreeBSD.org>

misc cleanup in sysctl_ifmalist():
+ remove a partly incorrect comment that i introduced in the last commit;
+ deal with the correct part of the above comment by cleaning up the
updates of 'info' -- rti_addrs needd not to be updated,
rti_info[RTAX_IFP] can be set once outside the loop.
While at it, correct a few misspelling of NULL as 0, but there are
way too many in this file, and i did not want to clutter the
important part of this commit.


# 9b98ee2c 16-Apr-2004 Luigi Rizzo <luigi@FreeBSD.org>

Consistently use ifaddr_byindex() to access the link-level address
of an interface. No functional change.

On passing, comment a likely bug in net/rtsock.c:sysctl_ifmalist()
which, if confirmed, would deserve to be fixed and MFC'ed


# e74642df 13-Apr-2004 Luigi Rizzo <luigi@FreeBSD.org>

route.h: introduce a macro, SA_SIZE(struct sockaddr *) which returns
the space occupied by a struct sockaddr when passed through a
routing socket.
Use it to replace the macro ROUNDUP(int), that does the same but
is redefined by every file which uses it, courtesy of
the School of Cut'n'Paste Programming(TM).

(partial) userland changes to follow.


# a8b76c8f 12-Apr-2004 Luigi Rizzo <luigi@FreeBSD.org>

remove an almost-duplicate piece of code by setting the loop
limits appropriately.


# f36cfd49 07-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson


# 97d8d152 20-Nov-2003 Andre Oppermann <andre@FreeBSD.org>

Introduce tcp_hostcache and remove the tcp specific metrics from
the routing table. Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.

It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination. Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.

tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.

It removes significant locking requirements from the tcp stack with
regard to the routing table.

Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)


# a557af22 17-Nov-2003 Robert Watson <rwatson@FreeBSD.org>

Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.

This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.

For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.

Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.

Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 05b2efe0 14-Nov-2003 Bruce M Simpson <bms@FreeBSD.org>

Add a sysctl MIB, NET_RT_IFMALIST, to retrieve multicast group memberships
in a protocol-independent way.

Submitted by: harti


# 7138d65c 08-Nov-2003 Sam Leffler <sam@FreeBSD.org>

replace explicit changes to rt_refcnt by RT_ADDREF and RT_REMREF
macros that expand to include assertions when the system is built
with INVARIANTS

Supported by: FreeBSD Foundation


# 9bf40ede 31-Oct-2003 Brooks Davis <brooks@FreeBSD.org>

Replace the if_name and if_unit members of struct ifnet with new members
if_xname, if_dname, and if_dunit. if_xname is the name of the interface
and if_dname/unit are the driver name and instance.

This change paves the way for interface renaming and enhanced pseudo
device creation and configuration symantics.

Approved By: re (in principle)
Reviewed By: njl, imp
Tested On: i386, amd64, sparc64
Obtained From: NetBSD (if_xname)


# d1dd20be 03-Oct-2003 Sam Leffler <sam@FreeBSD.org>

Locking for updates to routing table entries. Each rtentry gets a mutex
that covers updates to the contents. Note this is separate from holding
a reference and/or locking the routing table itself.

Other/related changes:

o rtredirect loses the final parameter by which an rtentry reference
may be returned; this was never used and added unwarranted complexity
for locking.
o minor style cleanups to routing code (e.g. ansi-fy function decls)
o remove the logic to bump the refcnt on the parent of cloned routes,
we assume the parent will remain as long as the clone; doing this avoids
a circularity in locking during delete
o convert some timeouts to MPSAFE callouts

Notes:

1. rt_mtx in struct rtentry is guarded by #ifdef _KERNEL as user-level
applications cannot/do-no know about mutex's. Doing this requires
that the mutex be the last element in the structure. A better solution
is to introduce an externalized version of struct rtentry but this is
a major task because of the intertwining of rtentry and other data
structures that are visible to user applications.
2. There are known LOR's that are expected to go away with forthcoming
work to eliminate many held references. If not these will be resolved
prior to release.
3. ATM changes are untested.

Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS (partly)


# aea8b30f 03-Oct-2003 Sam Leffler <sam@FreeBSD.org>

trivial locking rtsock_cb

Sponsored by: FreeBSD Foundation


# becc44d7 03-Oct-2003 Sam Leffler <sam@FreeBSD.org>

cleanups prior to adding locking (and in some cases to eliminate locking):

o move route_cb to be private to rtsock.c
o replace global static route_proto by locals
o eliminate global #define shorthands for info references
o remove some register decls
o ansi-fy function decls
o move items to be close in scope to their usage
o add rt_dispatch function for dispatching the actual message
o cleanup tangled logic for doing all-but-me msg send

Support by: FreeBSD Foundation


# 3c6b084e 05-Mar-2003 Peter Wemm <peter@FreeBSD.org>

Finish driving a stake through the heart of netns and the associated
ifdefs scattered around the place - its dead Jim!

The SMB stuff had stolen AF_NS, make it official.


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 7701e15b 25-Dec-2002 Jeffrey Hsu <hsu@FreeBSD.org>

Disable radix node locking for sysctl until we fix the sysctl infrastructure
to not sleep.


# b2aaf46e 25-Dec-2002 Jeffrey Hsu <hsu@FreeBSD.org>

Range-check the address family parameter passed in to the sysctl handler.

Submitted by: ru


# 71eba915 25-Dec-2002 Ruslan Ermilov <ru@FreeBSD.org>

If the caller of rtrequest*(RTM_DELETE, ...) asked for a copy of
the entry being removed (ret_nrt != NULL), increment the entry's
rt_refcnt like we do it for RTM_ADD and RTM_RESOLVE, rather than
messing around with 1->0 transitions for rtfree() all over.


# 956b0b65 23-Dec-2002 Jeffrey Hsu <hsu@FreeBSD.org>

SMP locking for radix nodes.


# b30a244c 21-Dec-2002 Jeffrey Hsu <hsu@FreeBSD.org>

SMP locking for ifnet list.


# 12e552d6 20-Dec-2002 Jeffrey Hsu <hsu@FreeBSD.org>

Swap the order of a free and a use of an ifaddr structure.


# 19fc74fb 18-Dec-2002 Jeffrey Hsu <hsu@FreeBSD.org>

Lock up ifaddr reference counts.


# 8d3574c7 01-Oct-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Fix some harmless mis-indents.

Spotted by: FlexeLint


# 37c84183 28-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Be consistent about "static" functions: if the function is marked
static in its prototype, mark it static at the definition too.

Inspired by: FlexeLint warning #512


# 93b0017f 25-Aug-2002 Philippe Charnier <charnier@FreeBSD.org>

Replace various spelling with FALLTHROUGH which is lint()able


# 62f76486 18-Aug-2002 Maxim Sobolev <sobomax@FreeBSD.org>

Increase size of ifnet.if_flags from 16 bits (short) to 32 bits (int). To avoid
breaking application ABI use unused ifreq.ifru_flags[1] for upper 16 bits in
SIOCSIFFLAGS and SIOCGIFFLAGS ioctl's.

Reviewed by: -hackers, -net


# 03e49181 18-Jun-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Remove so*_locked(), which were backed out by mistake.


# 4cc20ab1 31-May-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Back out my lats commit of locking down a socket, it conflicts with hsu's work.

Requested by: hsu


# 243917fe 19-May-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Lock down a socket, milestone 1.

o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

- so_count
- so_options
- so_linger
- so_state

o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:

- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()

Reviewed by: alfred


# f1320723 01-May-2002 Alfred Perlstein <alfred@FreeBSD.org>

Redo the sigio locking.

Turn the sigio sx into a mutex.

Sigio lock is really only needed to protect interrupts from dereferencing
the sigio pointer in an object when the sigio itself is being destroyed.

In order to do this in the most unintrusive manner change pgsigio's
sigio * argument into a **, that way we can lock internally to the
function.


# 960ed29c 29-Apr-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Revert the change of #includes in sys/filedesc.h and sys/socketvar.h.

Requested by: bde

Since locking sigio_lock is usually followed by calling pgsigio(),
move the declaration of sigio_lock and the definitions of SIGIO_*() to
sys/signalvar.h.

While I am here, sort include files alphabetically, where possible.


# d48d4b25 27-Apr-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Add a global sx sigio_lock to protect the pointer to the sigio object
of a socket. This avoids lock order reversal caused by locking a
process in pgsigio().

sowakeup() and the callers of it (sowwakeup, soisconnected, etc.) now
require sigio_lock to be locked. Provide sowwakeup_locked(),
soisconnected_locked(), and so on in case where we have to modify a
socket and wake up a process atomically.


# 44731cab 01-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on: smp@


# 929ddbbb 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# a854ed98 27-Feb-2002 John Baldwin <jhb@FreeBSD.org>

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


# 694ff264 27-Jan-2002 Andrew Gallatin <gallatin@FreeBSD.org>

Prevent the kernel from generating an unaligned sysctl data buffer on
64-bit platforms. The unaligned access is caused by struct ifa_msghdr
not being a multiple of 8-bytes in size. If an interface has an odd
number of addresses, this causes the next interface to generate an
unaligned access in the user-level app walking the interfaces (ifconfig).

Submitted by: Bernd Walter <ticso@cicely8.cicely.de>


# f7a54d06 24-Jan-2002 Crist J. Clark <cjc@FreeBSD.org>

Have sysctl() return the correct errno(2) as documented in the
sysctl(3) manpage.

Submitted by: ru
Obtained from: BSD/OS


# 7b6edd04 18-Jan-2002 Ruslan Ermilov <ru@FreeBSD.org>

Introduce an interface announcement message for the routing
socket so that routing daemons and other interested parties
know when an interface is attached/detached.

PR: kern/33747
Obtained from: NetBSD
MFC after: 2 weeks


# e20e9426 19-Dec-2001 Brian Somers <brian@FreeBSD.org>

It's no longer necessary to ensure that ``gate'' is set when RTF_GATEWAY
is passed, as subsequent code does that check now anyway.

Submitted by: ru


# 02a5d63e 19-Dec-2001 Brian Somers <brian@FreeBSD.org>

Only call rt_getifa() if we've either been passed a gateway or
if we've been given an RTA_IFP or changed RTA_IFA sockaddr.

This fixes the following bug:
>/dev/tun100
>/dev/tun101
ifconfig tun100 1.2.3.4 5.6.7.8
ifconfig tun101 1.2.3.4 6.7.8.9
route change 6.7.8.9 -ifa 1.2.3.4 -iface -mtu 500
which erroneously changed tun101's host route to have an ifp of tun100
(rt_getifa() sets the ifp after calling ifa_ifwithnet(1.2.3.4))

This incarnation submitted by: ru


# 8071913d 17-Oct-2001 Ruslan Ermilov <ru@FreeBSD.org>

Pull post-4.4BSD change to sys/net/route.c from BSD/OS 4.2.

Have sys/net/route.c:rtrequest1(), which takes ``rt_addrinfo *''
as the argument. Pass rt_addrinfo all the way down to rtrequest1
and ifa->ifa_rtrequest. 3rd argument of ifa->ifa_rtrequest is now
``rt_addrinfo *'' instead of ``sockaddr *'' (almost noone is
using it anyways).

Benefit: the following command now works. Previously we needed
two route(8) invocations, "add" then "change".
# route add -inet6 default ::1 -ifp gif0

Remove unsafe typecast in rtrequest(), from ``rtentry *'' to
``sockaddr *''. It was introduced by 4.3BSD-Reno and never
corrected.

Obtained from: BSD/OS, NetBSD
MFC after: 1 month
PR: kern/28360


# 28070a0e 17-Oct-2001 Ruslan Ermilov <ru@FreeBSD.org>

Bring in latest CSRG revisions to this file:

- Report destination address of a P2P link when servicing
routing socket messages.

- Report interface name, address, and destination address
of a P2P link when servicing NET_RT_{DUMP,FLAGS} sysctls.

Part of CSRG revision 8.6 coresponds to revision 1.12.
CSRG revision 8.7 corresponds to revision 1.15.


# a35b06c5 28-Sep-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Change sysctl_iflist() so it has a single point of return. This will
assist any future locking efforts.


# dadb6c3b 20-Sep-2001 Ruslan Ermilov <ru@FreeBSD.org>

Use the current process's credentials rather than socket's cached.
If the process drops its super-user privileges, we certainly don't
want to allow it to modify routing tables.

Discussed with: rwatson


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 162c0b2e 30-Aug-2001 Ruslan Ermilov <ru@FreeBSD.org>

Synch with NetBSD and OpenBSD.

Allow non-superuser to open, listen to, and send safe commands on the
routing socket. Superuser priviledge is required for all commands
but RTM_GET.

Lose `setuid root' bit of route(8).

Reviewed by: wollman, dd


# 7ba271ae 02-Aug-2001 Jonathan Chen <jon@FreeBSD.org>

fix memory leak when error during opening of routing socket

PR: kern/29336
Submitted by: Richard Andrades <richard@xebeo.com>
MFC after: 1 month


# 03311056 04-Jul-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

adjust mbuf length right in route_output().

Obtained from: KAME
MFC after: 1 week


# 33841545 10-Jun-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

Sync with recent KAME.
This work was based on kame-20010528-freebsd43-snap.tgz and some
critical problem after the snap was out were fixed.
There are many many changes since last KAME merge.

TODO:
- The definitions of SADB_* in sys/net/pfkeyv2.h are still different
from RFC2407/IANA assignment because of binary compatibility
issue. It should be fixed under 5-CURRENT.
- ip6po_m member of struct ip6_pktopts is no longer used. But, it
is still there because of binary compatibility issue. It should
be removed under 5-CURRENT.

Reviewed by: itojun
Obtained from: KAME
MFC after: 3 weeks


# 91421ba2 20-Feb-2001 Robert Watson <rwatson@FreeBSD.org>

o Move per-process jail pointer (p->pr_prison) to inside of the subject
credential structure, ucred (cr->cr_prison).
o Allow jail inheritence to be a function of credential inheritence.
o Abstract prison structure reference counting behind pr_hold() and
pr_free(), invoked by the similarly named credential reference
management functions, removing this code from per-ABI fork/exit code.
o Modify various jail() functions to use struct ucred arguments instead
of struct proc arguments.
o Introduce jailed() function to determine if a credential is jailed,
rather than directly checking pointers all over the place.
o Convert PRISON_CHECK() macro to prison_check() function.
o Move jail() function prototypes to jail.h.
o Emulate the P_JAILED flag in fill_kinfo_proc() and no longer set the
flag in the process flags field itself.
o Eliminate that "const" qualifier from suser/p_can/etc to reflect
mutex use.

Notes:

o Some further cleanup of the linux/jail code is still required.
o It's now possible to consider resolving some of the process vs
credential based permission checking confusion in the socket code.
o Mutex protection of struct prison is still not present, and is
required to protect the reference count plus some fields in the
structure.

Reviewed by: freebsd-arch
Obtained from: TrustedBSD Project


# fc2ffbe6 04-Feb-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)


# 22f29826 03-Feb-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Use <sys/queue.h> macro api rather than fondle its implementation detals.

Created with: /usr/bin/sed
Reviewed by: /sbin/md5


# 7cc0979f 08-Dec-2000 David Malone <dwmalone@FreeBSD.org>

Convert more malloc+bzero to malloc+M_ZERO.

Submitted by: josh@zipperup.org
Submitted by: Robert Drehmel <robd@gmx.net>


# 8bf72aef 25-Jul-2000 Hajimu UMEMOTO <ume@FreeBSD.org>

Workaround to avoid panic during detach pccard nic.


# 77978ab8 04-Jul-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Previous commit changing SYSCTL_HANDLER_ARGS violated KNF.

Pointed out by: bde


# 82d9ae4e 03-Jul-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Style police catches up with rev 1.26 of src/sys/sys/sysctl.h:

Sanitize SYSCTL_HANDLER_ARGS so that simplistic tools can grog our
sources:

-sysctl_vm_zone SYSCTL_HANDLER_ARGS
+sysctl_vm_zone (SYSCTL_HANDLER_ARGS)


# 242c5536 12-Feb-2000 Peter Wemm <peter@FreeBSD.org>

Clean up some loose ends in the network code, including the X.25 and ISO
#ifdefs. Clean out unused netisr's and leftover netisr linker set gunk.
Tested on x86 and alpha, including world.

Approved by: jkh


# 899ce4f4 28-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

Count AF_INET6 attachement to routing socket.

Obtained from: KAME project


# 920eb79f 28-Dec-1999 Ruslan Ermilov <ru@FreeBSD.org>

Make cloning mask sockaddr (genmask) possible.

PR: kern/3061
Reviewed by: wollman


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# cb64988f 28-Apr-1999 Luoqi Chen <luoqi@FreeBSD.org>

Postpone route_init() until all domains are attached.


# 75c13541 28-Apr-1999 Poul-Henning Kamp <phk@FreeBSD.org>

This Implements the mumbled about "Jail" feature.

This is a seriously beefed up chroot kind of thing. The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.

For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact: "real virtual servers".

Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.

Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.

It generally does what one would expect, but setting up a jail
still takes a little knowledge.

A few notes:

I have no scripts for setting up a jail, don't ask me for them.

The IP number should be an alias on one of the interfaces.

mount a /proc in each jail, it will make ps more useable.

/proc/<pid>/status tells the hostname of the prison for
jailed processes.

Quotas are only sensible if you have a mountpoint per prison.

There are no privisions for stopping resource-hogging.

Some "#ifdef INET" and similar may be missing (send patches!)

If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!

Tools, comments, patches & documentation most welcome.

Have fun...

Sponsored by: http://www.rndassociates.com/
Run for almost a year by: http://www.servetheweb.com/


# 831a80b0 27-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# 95b6073c 31-Oct-1997 David Greenman <dg@FreeBSD.org>

Fixed bug in RTM_ADD where rmx_locks weren't being set on the new route,
preventing "route add default 1.2.3.4 -lock -mtu 1500" from working as
expected (which is, BTW, to disable Path MTU Discovery).


# 55b211e3 28-Oct-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# a1c995b6 12-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Last major round (Unless Bruce thinks of somthing :-) of malloc changes.

Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.

A couple of finer points by: bde


# f8f6cbba 13-Sep-1997 Peter Wemm <peter@FreeBSD.org>

Update network code to use poll support.


# 4d1d4912 01-Sep-1997 Bruce Evans <bde@FreeBSD.org>

Added used #include - don't depend on <sys/mbuf.h> including
<sys/malloc.h> (unless we only use the bogusly shared M*WAIT flags).


# 57bf258e 16-Aug-1997 Garrett Wollman <wollman@FreeBSD.org>

Fix all areas of the system (or at least all those in LINT) to avoid storing
socket addresses in mbufs. (Socket buffers are the one exception.) A number
of kernel APIs needed to get fixed in order to make this happen. Also,
fix three protocol families which kept PCBs in mbufs to not malloc them
instead. Delete some old compatibility cruft while we're at it, and add
some new routines in the in_cksum family.


# 076d0761 18-Jul-1997 Julian Elischer <julian@FreeBSD.org>

An actual fix for the routing default crashes that
1/ is compatible with the old route(1) in case needed.
2/ actually fixes the problem while vetting bad user input.
note: I have already fixed route(1) so the problem shouldn't occur.
if it does. use 0.0.0.0/0 instead of the word 'default' :)


# 9a969a6e 17-Jul-1997 Mike Smith <msmith@FreeBSD.org>

Fix Julian's fixed fix. Routing is weird.

We need to accept at least one sockaddr with zero length, in order
to be able to set the default route.

Suggested by: Phone conversation with Julian (sleep well!)


# ff6d0a59 16-Jul-1997 Julian Elischer <julian@FreeBSD.org>

Bungled cut/paste leaves kernel with page faults..
(read all about it!)


# 7f33a738 15-Jul-1997 Julian Elischer <julian@FreeBSD.org>

Finally track down the reason for some of my occasional kernel crashes.
Route(1) has a bug that sends a bad message to the kernel. The kernel
trusts it and crashes. Add some sanity checks so that
we don't trust the user quite as much any more.
(also add a comment in if_ethersubr.c)


# a29f300e 27-Apr-1997 Garrett Wollman <wollman@FreeBSD.org>

The long-awaited mega-massive-network-code- cleanup. Part I.

This commit includes the following changes:
1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility
glue for them is deleted, and the kernel will panic on boot if any are compiled
in.

2) Certain protocol entry points are modified to take a process structure,
so they they can easily tell whether or not it is possible to sleep, and
also to access credentials.

3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt()
call. Protocols should use the process pointer they are now passed.

4) The PF_LOCAL and PF_ROUTE families have been updated to use the new
style, as has the `raw' skeleton family.

5) PF_LOCAL sockets now obey the process's umask when creating a socket
in the filesystem.

As a result, LINT is now broken. I'm hoping that some enterprising hacker
with a bit more time will either make the broken bits work (should be
easy for netipx) or dike them out.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 3aa72005 03-Feb-1997 Bill Fenner <fenner@FreeBSD.org>

Make sure we have arguments to pass before calling ifaof_ifpforaddr
and ifa_ifwithroute.

This eliminates the panic seen in kern/2647, although it doesn't
address the fact that RTM_CHANGE can't change flags.


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 477180fb 13-Jan-1997 Garrett Wollman <wollman@FreeBSD.org>

Use the new if_multiaddrs list for multicast addresses rather than the
previous hackery involving struct in_ifaddr and arpcom. Get rid of the
abominable multi_kludge. Update all network interfaces to use the
new machanism. Distressingly few Ethernet drivers program the multicast
filter properly (assuming the hardware has one, which it usually does).


# 59562606 13-Dec-1996 Garrett Wollman <wollman@FreeBSD.org>

Convert the interface address and IP interface address structures
to TAILQs. Fix places which referenced these for no good reason
that I can see (the references remain, but were fixed to compile
again; they are still questionable).


# 29412182 11-Dec-1996 Garrett Wollman <wollman@FreeBSD.org>

Use queue macros for the list of interfaces. Next stop: ifaddrs!


# 1db1fffa 09-Jul-1996 Bill Fenner <fenner@FreeBSD.org>

Disallow host routes that point to themselves. These routes serve no
purpose, other than to get in the way of the ARP table and cause
"can't allocate llinfo" errors.

This change may cause gated or routed to start complaining when adding
such routes. If so, these programs will need to be fixed to not try
to add these routes.

Reviewed by: wollman


# 6ddbf1e2 07-May-1996 Gary Palmer <gpalmer@FreeBSD.org>

Clean up various compiler warnings. Most (if not all) were benign

Reviewed by: bde


# 2ee45d7d 11-Mar-1996 David Greenman <dg@FreeBSD.org>

Move or add #include <queue.h> in preparation for upcoming struct socket
changes.


# 52041295 16-Nov-1995 Poul-Henning Kamp <phk@FreeBSD.org>

All net.* sysctl converted now.


# cc6a66f2 26-Oct-1995 Julian Elischer <julian@FreeBSD.org>

Reviewed by: julian and jhay@mikom.csir.co.za
Submitted by: Mike Mitchell, supervisor@alb.asctmd.com

This is a bulk mport of Mike's IPX/SPX protocol stacks and all the
related gunf that goes with it..
it is not guaranteed to work 100% correctly at this time
but as we had several people trying to work on it
I figured it would be better to get it checked in so
they could all get teh same thing to work on..

Mikes been using it for a year or so
but on 2.0

more changes and stuff will be merged in from other developers now that this is in.

Mike Mitchell, Network Engineer
AMTECH Systems Corporation, Technology and Manufacturing
8600 Jefferson Street, Albuquerque, New Mexico 87113 (505) 856-8000
supervisor@alb.asctmd.com


# ed07cc7f 13-Oct-1995 Garrett Wollman <wollman@FreeBSD.org>

Protect against routing socket messages with way-too-big address families.

Submitted by: Keith Sklower by way of Paul Traina


# 4ab32c34 08-Oct-1995 Bruce Evans <bde@FreeBSD.org>

Fix types of sysctl functions. Add prototypes. Cosmetic.


# 9e52b982 03-Oct-1995 Garrett Wollman <wollman@FreeBSD.org>

Import of 4.4-Lite-2 sys/net to make merge and examination easier. Since we
are not on the vendor branch for any of these files, the conflicts shown make
no matter.

Obtained from: 4.4BSD-Lite-2


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# c9a84156 15-May-1995 David Greenman <dg@FreeBSD.org>

Fixed route reference count bug that squirmed in during the the
routing-socket code upgrade from Berkeley..

Submitted by: Garrett Wollman via Peter Wemm via Cornell


# 748e0b0a 10-May-1995 Garrett Wollman <wollman@FreeBSD.org>

Make networking domains drop-ins, through the magic of GNU ld. (Some day,
there may even be LKMs.) Also, change the internal name of `unixdomain'
to `localdomain' since AF_LOCAL is now the preferred name of this family.
Declare netisr correctly and in the right place.


# 78a82810 10-May-1995 Garrett Wollman <wollman@FreeBSD.org>

Updated routing-socket code from Berkeley

Obtained from: Keith Bostic by way of Paul Traina


# b5e8ce9f 16-Mar-1995 Bruce Evans <bde@FreeBSD.org>

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 995add1a 13-Dec-1994 Garrett Wollman <wollman@FreeBSD.org>

Add support for two separate cloning flags, one set by the lower layers,
and one set by the protocol family. Also add another parameter to
rtalloc1() to allow for any interface flags to be ignored; currently
this is only useful for RTF_PRCLONING. Get rid of rt_prflags and re-unite
with rt_flags. Add T/TCP ``route metrics''.

NB: YOU MUST RECOMPILE `route' AND OTHER RELATED PROGRAMS AS A RESULT OF
THIS CHANGE.

This also adds a new interface parameter, `ifi_physical', which will
eventually replace IFF_ALTPHYS as the mechanism for specifying the
particular physical connection desired on a multiple-connection card.

NB: YOU MUST RECOMPILE `ifconfig' AND OTHER RELATED PROGRAMS AS A RESULT OF
THIS CHANGE.


# 5df72964 11-Oct-1994 Garrett Wollman <wollman@FreeBSD.org>

Fix a bug which caused panics when attempting to change just the flags of
a route. (This still doesn't work, but it doesn't panic now.) It looks
like there may be a number of incipient bugs in this code.

Also, get ready for the time when all IP gateway routes are cloning, which
is necessary to keep proper TCP statistics.


# df440948 08-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

Cosmetics: to silence gcc -wall.


# 2624cf89 05-Oct-1994 Garrett Wollman <wollman@FreeBSD.org>

A number of bug-fixes inspired by Mark Treacy:
- Allow PPP to run multicasts natively.
- Deal properly with lots of similarly-named interfaces.
- Don't sign-extend if_flags.

NB: the last fix (to rtsock.c) must be reversed when we expand if_flags to a
reasonable size.

Submitted by: Mark Treacy


# c5789ba3 04-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

Moved m_copyback into uipc_mbuf.c


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources