History log of /freebsd-current/sys/net/if_var.h
Revision Date Author Comments
# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 2a371643 21-Jul-2023 Justin Hibbits <jhibbits@FreeBSD.org>

IfAPI: Retire if_etherbpfmtap() and if_bpfmtap()

Summary:
These came in the original DrvAPI commits in 2014, and are obsoleted by
bpf_mtap_if() and ether_bpf_mtap_if(). The `_if` suffix, rather than
prefix, conveys that it's operating on the bpf of the interface, instead
than the interface itself.

Reviewed by: glebius
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D41146


# 2ff63af9 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .h pattern

Remove /^\s*\*+\s*\$FreeBSD\$.*$\n/


# 79379355 16-Jun-2023 Alexander V. Chernikov <melifaro@FreeBSD.org>

netlink: convert to IfAPI.

Convert to IfAPI everything except `IF_AFDATA_WLOCK` usage in neigh.c.

Reviewed By: jhibbits
Differential Revision: https://reviews.freebsd.org/D40577


# 848d5bb1 16-May-2023 Konstantin Belousov <kib@FreeBSD.org>

net/if_var.h: consistently use if_t over struct ifnet *

Reviewed by: jhibbits
Sponsored by: NVidia networking
Differential revision: https://reviews.freebsd.org/D40125


# f766d1d5 10-Apr-2023 Justin Hibbits <jhibbits@FreeBSD.org>

IfAPI: Add if_maddr_empty() to check for any maddrs

if_llmaddr_count() only counts link-level multicast addresses.
hv_netvsc(4) needs to know if there are any multicast addresses. Since
hv_netvsc(4) is the only instance where this would be used, make it a
simple boolean. If others need a if_maddr_count(), that can be added in
the future.

Reviewed by: melifaro
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D39493


# 7814374b 07-Apr-2023 Justin Hibbits <jhibbits@FreeBSD.org>

IfAPI: Hide the macros that touch ifnet members

Nothing should be directly touching the ifnet members, which are hidden
in <net/if_private.h>, so hide them in the same header to avoid errors
from users.

Sponsored by: Juniper Networks, Inc.


# 56d4550c 19-Apr-2023 Alexander V. Chernikov <melifaro@FreeBSD.org>

ifnet: factor out interface renaming into a separate function.

This change is required to support interface renaming via Netlink.
No functional changes intended.

Reviewed by: zlei
Differential Revision: https://reviews.freebsd.org/D39692
MFC after: 2 weeks


# e2427c69 16-Mar-2023 Justin Hibbits <jhibbits@FreeBSD.org>

IfAPI: Add iterator to complement if_foreach()

Summary:
Sometimes an if_foreach() callback can be trivial, or need a lot of
outer context. In this case a regular `for` loop makes more sense. To
keep things hidden in the new API, use an opaque `if_iter` structure
that can still be instantiated on the stack. The current implementation
uses just a single pointer out of the 4 alotted to the opaque context,
and the cleanup does nothing, but may be used in the future.

Reviewed by: melifaro
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D39138


# df2b419a 04-Mar-2023 Alexander V. Chernikov <melifaro@FreeBSD.org>

ifnet: add if_foreach_sleep() to allow ifnet iterations with sleep.

Subscribers: imp, ae, glebius

Differential Revision: https://reviews.freebsd.org/D38904


# 66bdbcd5 03-Mar-2023 Alexander V. Chernikov <melifaro@FreeBSD.org>

net: unify mtu update code

Subscribers: imp, ae, glebius

Differential Revision: https://reviews.freebsd.org/D38893


# aac2d19d 10-Feb-2023 Justin Hibbits <jhibbits@FreeBSD.org>

IfAPI: Style cleanup

Summary:
Clean up style issues from IfAPI additions.

Casts to (struct ifnet *) made sense when `if_t` was a `void *`, but
since it's a `struct ifnet *` it no longer makes sense. Fix whitespace
errors, among others.

Reviewed by: kib, glebius
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D38499


# a3a76c3d 10-Feb-2023 Justin Hibbits <jhibbits@FreeBSD.org>

IfAPI: Add capabilities2/capenable2 accessors

Summary:
As a stopgap measure add basic accessors for the if_capabilities2 and
if_capenable2 members to further hide the ifnet details.

Sponsored by: Juniper Networks, Inc.
Reviewed by: glebius, kib
Differential Revision: https://reviews.freebsd.org/D38487


# 189c3729 10-Feb-2023 Justin Hibbits <jhibbits@FreeBSD.org>

IfAPI: More accessors

Summary:
Add the following accessors needed by infiniband drivers:
* if_getaddrlen()
* if_setbroadcastaddr()
* if_resolvemulti()

With these accessors, and additional changes on the drivers' side, an
amd64 kernel can be compiled with `struct ifnet` completely hidden.

Reviewed by: melifaro
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D38488


# 1e6131ba 01-Feb-2023 Justin Hibbits <jhibbits@FreeBSD.org>

IfAPI: Add needed APIs for mbuf support

Summary:
Add 2 new APIs for supporting recent mbuf changes:
* 36e0a362ac added the m_snd_tag_alloc() wrapper around
if_snd_tag_alloc(). Push this down to the ifnet level.
* 4d7a1361ef adds the m_rcvif_serialize()/m_rcvif_restore() KPIs to
serialize and restore an ifnet pointer. Add the necessary wrapper to
get the index generation for this.

Reviewed By: jhb
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D38340


# 2eeb8083 01-Feb-2023 Justin Hibbits <jhibbits@FreeBSD.org>

IfAPI: Add iterator to loop over all interfaces

Summary:
Sometimes it's useful to iterate over all interfaces in the current
VNET, as the linuxulator does in several places.

Unlike other iterators in the IfAPI this propagates any error received
up to the caller, instead of returning a count.

Sponsored by: Juniper Networks, Inc.
Reviewed by: glebius, melifaro
Differential Revision: https://reviews.freebsd.org/D38348


# d79539e6 25-Jan-2023 Justin Hibbits <jhibbits@FreeBSD.org>

IfAPI: Add if_altq_is_enabled() interface.

Summary:
The only user of the ALTQ_IS_ENABLED() in a driver checks against the
ifnet queue. Abstract that all out and present the interface to check
if ALTQ is enabled on the interface.

Sponsored by: Juniper Networks, Inc.
Reviewed By: glebius
Differential Revision: https://reviews.freebsd.org/D38204


# 31cfaf19 25-Jan-2023 Justin Hibbits <jhibbits@FreeBSD.org>

IfAPI: Add l2com accessor for firewire.

Summary:
Firewire is the only device driver that accesses the l2com member, all
other accesses are handled within the netstack itself.

Sponsored by: Juniper Networks, Inc.
Reviewed by: glebius, melifaro
Differential Revision: https://reviews.freebsd.org/D38203


# 0d2684e1 24-Jan-2023 Justin Hibbits <jhibbits@FreeBSD.org>

IfAPI: Add some more accessors

Summary:
* if_setreassignfn for wireguard.
* if_getinputfn() and if_getstartfn() for various drivers. Use the
function descriptor typedefs for these and the setters.
* vlantrunk accessor. This is used by VLAN_CAPABILITIES() used by
several drivers, as well as directly by mxge(4).
* if_pcp member accessor, used by cxgbe.
* accessors for netmap adapter.

Sponsored by: Juniper Networks, Inc.
Reviewed By: glebius
Differential Revision: https://reviews.freebsd.org/D38202


# c255d1a4 23-Jan-2023 Justin Hibbits <jhibbits@FreeBSD.org>

IfAPI: Add if_llsoftc member accessors for TOEDEV

Summary:
Keep TOEDEV() macro for backwards compatibility, and add a SETTOEDEV()
macro to complement with the new accessors.

Sponsored by: Juniper Networks, Inc.
Reviewed by: glebius
Differential Revision: https://reviews.freebsd.org/D38199


# 30af2c13 23-Jan-2023 Justin Hibbits <jhibbits@FreeBSD.org>

IfAPI: Add if_get/setmaclabel() and use it.

Summary:
Port the MAC modules to use the IfAPI APIs as part of this.

Sponsored by: Juniper Networks, Inc.
Reviewed by: glebius
Differential Revision: https://reviews.freebsd.org/D38197


# 113af4fd 24-Jan-2023 Justin Hibbits <jhibbits@FreeBSD.org>

IfAPI: Add if_gettype() API and use it for vlan

Sponsored by: Juniper Networks, Inc.
Reviewed by: #network, glebius
Differential Revision: https://reviews.freebsd.org/D38198


# 053a24d1 13-Jan-2023 Justin Hibbits <jhibbits@FreeBSD.org>

debugnet: Add ifnet accessor to set debugnet methods

As part of the effort to hide the internals of the ifnet struct, convert
the DEBUGNET_SET() macro to use an accessor instead of directly touching
the methods member.

Reviewed by: glebius (older version)
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D38105


# 2c2b37ad 13-Jan-2023 Justin Hibbits <jhibbits@FreeBSD.org>

ifnet/API: Move struct ifnet definition to a <net/if_private.h>

Hide the ifnet structure definition, no user serviceable parts inside,
it's a netstack implementation detail. Include it temporarily in
<net/if_var.h> until all drivers are updated to use the accessors
exclusively.

Reviewed by: glebius
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D38046


# fa25dbfd 12-Jan-2023 Justin Hibbits <jhibbits@FreeBSD.org>

ifnet API: Change if_init() to take context argument

Some drivers, like iflib drivers, take a 'context' argument instead of a
ifnet argument, as a single interface may have multiple contexts.
Follow this scheme by passing the context argument down. Most drivers
will likely pass 'ifp' as the context.

Reviewed by: glebius
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D38102


# ae330108 12-Jan-2023 Justin Hibbits <jhibbits@FreeBSD.org>

Revert "ifnet/API: Move the IfAPI from if_var.h to if.h"

<net/if.h> should be a fully user-facing header, so these APIs don't
belong there. Revert and will find another approach.

This reverts commit fe33e0ab83d1fbc3c5cd4a2591ba0036e47b1fec.

Fixes: fe33e0ab83d1
Sponsored by: Juniper Networks, Inc.


# fe33e0ab 11-Jan-2023 Justin Hibbits <jhibbits@FreeBSD.org>

ifnet/API: Move the IfAPI from if_var.h to if.h

Summary:
The "public" KPI for ifnet belongs in net/if.h, with net/if_var.h being
implementation details for the netstack. This is the next step in
enforcing that separation.

Reviewed by: melifaro
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D38030


# be4315dc 21-Dec-2022 Justin Hibbits <jhibbits@FreeBSD.org>

ifnet/DrvAPI: Move if_t typedef to a better place

Summary:
<net/if_var.h> should really be used by the netstack only, not by
drivers. Eventually all the accessors will be moved to <net/if.h> as
well, but for now just move the typedef while the KPI gets sorted and
drivers get converted.

Sponsored by: Juniper Networks, Inc.
Reviewed By: melifaro, glebius
Differential Revision: https://reviews.freebsd.org/D37784


# eb1da3e5 09-Dec-2022 Justin Hibbits <jhibbits@FreeBSD.org>

DrvAPI: Extend driver KPI with more accessors

Summary:
Add the following accessors to hide some more netstack details:
* if_get/setcapabilities2 and *bits analogue
* if_setdname
* if_getxname
* if_transmit - wrapper for call to ifp->if_transmit()
- This required changing the existing if_transmit to
if_transmit_default, since that's its purpose.
* if_getalloctype
* if_getindex
* if_foreach_addr_type - Like if_foreach_lladdr() but for any address
family type. Used by some drivers to iterate over all AF_INET
addresses.
* if_init() - wrapper for ifp->if_init() call
* if_setinputfn
* if_setsndtagallocfn
* if_togglehwassist

Reviewers: #transport, #network, glebius, melifaro

Reviewed by: #network, melifaro
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D37664


# 984b27d8 30-Nov-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

net: add if_allocdescr() to permit updating iface description from the kernel

Reviewed by: kp,zlei
Differential Revision: https://reviews.freebsd.org/D37566
MFC after: 2 weeks


# 9a7c520a 24-Sep-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

ifp: add if_setdescr() / if_freedesrt() methods

Add methods for setting and removing the description from the interface,
so the external users can manage it without using ioctl API.

MFC after: 2 weeks


# b96549f0 11-Apr-2022 Konstantin Belousov <kib@FreeBSD.org>

struct ifnet: add if_capabilities2 and if_capenable2 bitmasks

We are running out of bits in if_capabilities.

Suggested by: jhb
Reviewed by: hselasky, jhb, kp (previous version)
Sponsored by: NVIDIA Networking
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D32551


# f2ab9160 19-May-2022 Andrey V. Elsukov <ae@FreeBSD.org>

[vlan + lagg] add IFNET_EVENT_UPDATE_BAUDRATE event

use it to update if_baudrate for vlan interfaces created on the LACP lagg.

Differential revision: https://reviews.freebsd.org/D33405


# 4d7a1361 26-Jan-2022 Gleb Smirnoff <glebius@FreeBSD.org>

ifnet/mbuf: provide KPI to serialize/restore m->m_pkthdr.rcvif

Supplement ifindex table with generation count and use it to
serialize & restore an ifnet pointer.

Reviewed by: kp
Differential revision: https://reviews.freebsd.org/D33266
Fun note: git show e6abef09187a

(cherry picked from commit e1882428dcbbafd2814d7e17b977a8f686784b39)


# 6c741ffb 03-May-2022 Marko Zec <zec@FreeBSD.org>

Revert "mbuf: do not restore dying interfaces"

This reverts commit 703e533da5e2e4743d38bbf4605fec041bc69976.

Revert "ifnet/mbuf: provide KPI to serialize/restore m->m_pkthdr.rcvif"

This reverts commit e1882428dcbbafd2814d7e17b977a8f686784b39.

Obtained from: github.com/glebius/FreeBSD/commits/backout-ifindex


# 964b8f8b 28-Jan-2022 Gleb Smirnoff <glebius@FreeBSD.org>

ifnet: garbage collect unused function ifaddr_byindex().

Last use was removed in 5adea417d49.


# e1882428 26-Jan-2022 Gleb Smirnoff <glebius@FreeBSD.org>

ifnet/mbuf: provide KPI to serialize/restore m->m_pkthdr.rcvif

Supplement ifindex table with generation count and use it to
serialize & restore an ifnet pointer.

Reviewed by: kp
Differential revision: https://reviews.freebsd.org/D33266
Fun note: git show e6abef09187a


# c8f2c290 25-Jan-2022 Hans Petter Selasky <hselasky@FreeBSD.org>

Add definitions for TLS receive tags using the existing send tag infrastructure.

Although send tags are strictly used for transmit, the name might be changed
in the future to be more generic.

The TLS receive tags support regular IPv4 and IPv6 traffic, and also over any
VLAN. If prio-tagging is enabled, VLAN ID zero, this must be checked in the
network driver itself when creating the TLS RX decryption offload filter.

TLS receive tags have a modify callback to tell the network driver about
the progress of decryption. Currently decryption is done IP packet by IP
packet, even if the IP packet contains a partial TLS record. The modify
callback allows the network driver to keep track of TCP sequence numbers
pointing to the beginning of TLS records after TCP packet reassembly.
These callbacks only happen when encrypted or partially decrypted data is
received and are used to verify the decryptions starting point for the
hardware. Typically the hardware will guess where TLS headers start and
needs help from the software to know if the guess was correct. This is
the purpose of the modify callback.

Differential Revision: https://reviews.freebsd.org/D32356
Discussed with: jhb@
MFC after: 1 week
Sponsored by: NVIDIA Networking


# 7e0bba4d 04-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

ifnet: make V_if_index static to if.c

This requires moving net.link.generic sysctl declaration from if_mib.c
to if.c. Ideally if_mib.c needs just to be merged to if.c, but they
have different license texts.

Differential revision: https://reviews.freebsd.org/D33263


# d74b7bae 04-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

ifnet_byindex() actually requires network epoch

Sweep over potentially unsafe calls to ifnet_byindex() and wrap them
in epoch. Most of the code touched remains unsafe, as the returned
pointer is being used after epoch exit. Mark that with a comment.

Validate the index argument inside the function, reducing argument
validation requirement from the callers and making V_if_index
private to if.c.

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D33263


# 1e3ca25d 22-Nov-2021 Gleb Smirnoff <glebius@FreeBSD.org>

ifnet: make if_alloc_domain() static


# 8a6f38c8 22-Nov-2021 Gleb Smirnoff <glebius@FreeBSD.org>

ifnet: garbage collect drbr_*_drv().

They were left in 62d76917b8678 but after years proved not to be useful.


# c782ea8b 14-Sep-2021 John Baldwin <jhb@FreeBSD.org>

Add a switch structure for send tags.

Move the type and function pointers for operations on existing send
tags (modify, query, next, free) out of 'struct ifnet' and into a new
'struct if_snd_tag_sw'. A pointer to this structure is added to the
generic part of send tags and is initialized by m_snd_tag_init()
(which now accepts a switch structure as a new argument in place of
the type).

Previously, device driver ifnet methods switched on the type to call
type-specific functions. Now, those type-specific functions are saved
in the switch structure and invoked directly. In addition, this more
gracefully permits multiple implementations of the same tag within a
driver. In particular, NIC TLS for future Chelsio adapters will use a
different implementation than the existing NIC TLS support for T6
adapters.

Reviewed by: gallatin, hselasky, kib (older version)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D31572


# 9e5243d7 30-Mar-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Enforce check for using the return result for ifa?_try_ref().

Suggested by: hps
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D29504


# 25bfa448 21-Mar-2021 Adrian Chadd <adrian@FreeBSD.org>

Add device and ifnet logging methods, similar to device_printf / if_printf

* device_printf() is effectively a printf
* if_printf() is effectively a LOG_INFO

This allows subsystems to log device/netif stuff using different log levels,
rather than having to invent their own way to prefix unit/netif names.

Differential Revision: https://reviews.freebsd.org/D29320
Reviewed by: imp


# 7563019b 22-Feb-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Add if_try_ref() to simplify refcount handling inside epoch.

When we have an ifp pointer and the code is running inside epoch,
epoch guarantees the pointer will not be freed.
However, the following case can still happen:

* in thread 1 we drop to refcount=0 for ifp and schedule its deletion.
* in thread 2 we use this ifp and reference it
* destroy callout kicks in
* unhappy user reports a bug

This can happen with the current implementation of ifnet_byindex_ref(),
as we're not holding any locks preventing ifnet deletion by a parallel thread.

To address it, add if_try_ref(), allowing to return failure when
referencing ifp with refcount=0.
Additionally, enforce existing if_ref() is with KASSERT to provide a
cleaner error in such scenarios.

Finally, fix ifnet_byindex_ref() by using if_try_ref() and returning NULL
if the latter fails.

MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D28836


# 600eade2 16-Feb-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Add ifa_try_ref() to simplify ifa handling inside epoch.

More and more code migrates from lock-based protection to the NET_EPOCH
umbrella. It requires some logic changes, including, notably, refcount
handling.

When we have an `ifa` pointer and we're running inside epoch we're
guaranteed that this pointer will not be freed.
However, the following case can still happen:
* in thread 1 we drop to 0 refcount for ifa and schedule its deletion.
* in thread 2 we use this ifa and reference it
* destroy callout kicks in
* unhappy user reports bug

To address it, new `ifa_try_ref()` function is added, allowing to return
failure when we try to reference `ifa` with 0 refcount.
Additionally, existing `ifa_ref()` is enforced with `KASSERT` to provide
cleaner error in such scenarious.

Reviewed By: rstone, donner
Differential Revision: https://reviews.freebsd.org/D28639
MFC after: 1 week


# 1a714ff2 26-Jan-2021 Randall Stewart <rrs@FreeBSD.org>

This pulls over all the changes that are in the netflix
tree that fix the ratelimit code. There were several bugs
in tcp_ratelimit itself and we needed further work to support
the multiple tag format coming for the joint TLS and Ratelimit dances.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D28357


# a60100fd 25-Nov-2020 Kristof Provost <kp@FreeBSD.org>

if: Remove ifnet_rwlock

It no longer serves any purpose, as evidenced by the fact that we never take it
without ifnet_sxlock.

Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D27278


# 521eac97 28-Oct-2020 John Baldwin <jhb@FreeBSD.org>

Support hardware rate limiting (pacing) with TLS offload.

- Add a new send tag type for a send tag that supports both rate
limiting (packet pacing) and TLS offload (mostly similar to D22669
but adds a separate structure when allocating the new tag type).

- When allocating a send tag for TLS offload, check to see if the
connection already has a pacing rate. If so, allocate a tag that
supports both rate limiting and TLS offload rather than a plain TLS
offload tag.

- When setting an initial rate on an existing ifnet KTLS connection,
set the rate in the TCP control block inp and then reset the TLS
send tag (via ktls_output_eagain) to reallocate a TLS + ratelimit
send tag. This allocates the TLS send tag asynchronously from a
task queue, so the TLS rate limit tag alloc is always sleepable.

- When modifying a rate on a connection using KTLS, look for a TLS
send tag. If the send tag is only a plain TLS send tag, assume we
failed to allocate a TLS ratelimit tag (either during the
TCP_TXTLS_ENABLE socket option, or during the send tag reset
triggered by ktls_output_eagain) and ignore the new rate. If the
send tag is a ratelimit TLS send tag, change the rate on the TLS tag
and leave the inp tag alone.

- Lock the inp lock when setting sb_tls_info for a socket send buffer
so that the routines in tcp_ratelimit can safely dereference the
pointer without needing to grab the socket buffer lock.

- Add an IFCAP_TXTLS_RTLMT capability flag and associated
administrative controls in ifconfig(8). TLS rate limit tags are
only allocated if this capability is enabled. Note that TLS offload
(whether unlimited or rate limited) always requires IFCAP_TXTLS[46].

Reviewed by: gallatin, hselasky
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26691


# 4c7ba83f 12-Jul-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Switch inet6 default route subscription to the new rib subscription api.

Old subscription model allowed only single customer.

Switch inet6 to the new subscription api and eliminate the old model.

Differential Revision: https://reviews.freebsd.org/D25615


# 4d2c2509 23-May-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Move <add|del|change>_route() functions to route_ctl.c in preparation of
multipath control plane changed described in D24141.

Currently route.c contains core routing init/teardown functions, route table
manipulation functions and various helper functions, resulting in >2KLOC
file in total. This change moves most of the route table manipulation parts
to a dedicated file, simplifying planned multipath changes and making
route.c more manageable.

Differential Revision: https://reviews.freebsd.org/D24870


# 74787ef4 29-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Add nhop to the ifa_rtrequest() callback.

With the upcoming multipath changes described in D24141,
rt->rt_nhop can potentially point to a nexthop group instead of
an individual nhop.
To simplify caller handling of such cases, change ifa_rtrequest() callback
to pass changed nhop directly.

Differential Revision: https://reviews.freebsd.org/D24604


# 67452942 17-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Finish r191148: replace rtentry with route in if_bridge if_output() callback.

Generic if_output() callback signature was modified to use struct route
instead of struct rtentry in r191148, back in 2009.

Quoting commit message:

Change if_output to take a struct route as its fourth argument in order
to allow passing a cached struct llentry * down to L2

Fix bridge_output() to match this signature and update the remaining
comment in if_var.h.

Reviewed by: kp
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24394


# 98085bae 09-Mar-2020 Andrew Gallatin <gallatin@FreeBSD.org>

make lacp's use_numa hashing aware of send tags

When I did the use_numa support, I missed the fact that there is
a separate hash function for send tag nic selection. So when
use_numa is enabled, ktls offload does not work properly, as it
does not reliably allocate a send tag on the proper egress nic
since different egress nics are selected for send-tag allocation
and packet transmit. To fix this, this change:

- refectors lacp_select_tx_port_by_hash() and
lacp_select_tx_port() to make lacp_select_tx_port_by_hash()
always called by lacp_select_tx_port()

- pre-shifts flowids to convert them to hashes when calling lacp_select_tx_port_by_hash()

- adds a numa_domain field to if_snd_tag_alloc_params

- plumbs the numa domain into places where we allocate send tags

In testing with NIC TLS setup on a NUMA machine, I see thousands
of output errors before the change when enabling
kern.ipc.tls.ifnet.permitted=1. After the change, I see no
errors, and I see the NIC sysctl counters showing active TLS
offload sessions.

Reviewed by: rrs, hselasky, jhb
Sponsored by: Netflix


# 8ad798ae 03-Mar-2020 Brooks Davis <brooks@FreeBSD.org>

Expose ifr_buffer_get_(buffer|length) outside if.c.

This is a preparatory commit for D23933.

Reviewed by: jhb


# d7313dc6 26-Feb-2020 Randall Stewart <rrs@FreeBSD.org>

This commit expands tcp_ratelimit to be able to handle cards
like the mlx-c5 and c6 that require a "setup" routine before
the tcp_ratelimit code can declare and use a rate. I add the
setup routine to if_var as well as fix tcp_ratelimit to call it.
I also revisit the rates so that in the case of a mlx card
of type c5/6 we will use about 100 rates concentrated in the range
where the most gain can be had (1-200Mbps). Note that I have
tested these on a c5 and they work and perform well. In fact
in an unloaded system they pace right to the correct rate (great
job mlx!). There will be a further commit here from Hans that
will add the respective changes to the mlx driver to support this
work (which I was testing with).

Sponsored by: Netflix Inc.
Differential Revision: ttps://reviews.freebsd.org/D23647


# 3264dcad 14-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

- Move global network epoch definition to epoch.h, as more different
subsystems tend to need to know about it, and including if_var.h is
huge header pollution for them. Polluting possible non-network
users with single symbol seems much lesser evil.
- Remove non-preemptible network epoch. Not used yet, and unlikely
to get used in close future.


# 0839aa5c 29-Oct-2019 Gleb Smirnoff <glebius@FreeBSD.org>

There is a long standing problem with multicast programming for NICs
and IPv6. With IPv6 we may call if_addmulti() in context of processing
of an incoming packet. Usually this is interrupt context. While most
of the NIC drivers are able to reprogram multicast filters without
sleeping, some of them can't. An example is e1000 family of drivers.
With iflib conversion the problem was somewhat hidden. Iflib processes
packets in private taskqueue, so going to sleep doesn't trigger an
assertion. However, the sleep would block operation of the driver and
following incoming packets would fill the ring and eventually would
start being dropped. Enabling epoch for the full time of a packet
processing again started to trigger assertions for e1000.

Fix this problem once and for all using a general taskqueue to call
if_ioctl() method in all cases when if_addmulti() is called in a
non sleeping context. Note that nobody cares about returned value.

Reviewed by: hselasky, kib
Differential Revision: https://reviews.freebsd.org/D22154


# 19e09f44 21-Oct-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Remove obsoleted KPIs that were used to access interface address lists.


# 7790c8c1 17-Oct-2019 Conrad Meyer <cem@FreeBSD.org>

Split out a more generic debugnet(4) from netdump(4)

Debugnet is a simplistic and specialized panic- or debug-time reliable
datagram transport. It can drive a single connection at a time and is
currently unidirectional (debug/panic machine transmit to remote server
only).

It is mostly a verbatim code lift from netdump(4). Netdump(4) remains
the only consumer (until the rest of this patch series lands).

The INET-specific logic has been extracted somewhat more thoroughly than
previously in netdump(4), into debugnet_inet.c. UDP-layer logic and up, as
much as possible as is protocol-independent, remains in debugnet.c. The
separation is not perfect and future improvement is welcome. Supporting
INET6 is a long-term goal.

Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to
'debugnet_' or 'dn_' -- sorry. I thought keeping the netdump name on the
generic module would be more confusing than the refactoring.

The only functional change here is the mbuf allocation / tracking. Instead
of initiating solely on netdump-configured interface(s) at dumpon(8)
configuration time, we watch for any debugnet-enabled NIC for link
activation and query it for mbuf parameters at that time. If they exceed
the existing high-water mark allocation, we re-allocate and track the new
high-water mark. Otherwise, we leave the pre-panic mbuf allocation alone.
In a future patch in this series, this will allow initiating netdump from
panic ddb(4) without pre-panic configuration.

No other functional change intended.

Reviewed by: markj (earlier version)
Some discussion with: emaste, jhb
Objection from: marius
Differential Revision: https://reviews.freebsd.org/D21421


# 270b83b9 14-Oct-2019 Hans Petter Selasky <hselasky@FreeBSD.org>

The two functions ifnet_byindex() and ifnet_byindex_locked() are exactly the
same after the network stack was epochified. Merge the two into one function
and cleanup all uses of ifnet_byindex_locked().

While at it:
- Add branch prediction macros.
- Make sure the ifnet pointer is only deferred once,
also when code optimisation is disabled.

Sponsored by: Mellanox Technologies


# fb3fc771 10-Oct-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Add two extra functions that basically give count of addresses
on interface. Such function could been implemented on top of
the if_foreach_llm?addr(), but several drivers need counting,
so avoid copy-n-paste inside the drivers.


# 826857c8 10-Oct-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Provide new KPI for network drivers to access lists of interface
addresses. The KPI doesn't reveal neither how addresses are stored,
how the access to them is synchronized, neither reveal struct ifaddr
and struct ifmaddr.

Reviewed by: gallatin, erj, hselasky, philip, stevek
Differential Revision: https://reviews.freebsd.org/D21943


# b2e60773 26-Aug-2019 John Baldwin <jhb@FreeBSD.org>

Add kernel-side support for in-kernel TLS.

KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277


# 20abea66 01-Aug-2019 Randall Stewart <rrs@FreeBSD.org>

This adds the third step in getting BBR into the tree. BBR and
an updated rack depend on having access to the new
ratelimit api in this commit.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D20953


# e2e050c8 19-May-2019 Conrad Meyer <cem@FreeBSD.org>

Extract eventfilter declarations to sys/_eventfilter.h

This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.

EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).

As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions. The remainder of the patch addresses
adding appropriate includes to fix those files.

LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).

No functional change (intended). Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.


# 7687707d 22-Apr-2019 Andrew Gallatin <gallatin@FreeBSD.org>

Track device's NUMA domain in ifnet & alloc ifnet from NUMA local memory

This commit adds new if_alloc_domain() and if_alloc_dev() methods to
allocate ifnets. When called with a domain on a NUMA machine,
ifalloc_domain() will record the NUMA domain in the ifnet, and it will
allocate the ifnet struct from memory which is local to that NUMA
node. Similarly, if_alloc_dev() is a wrapper for if_alloc_domain
which uses a driver supplied device_t to call ifalloc_domain() with
the appropriate domain.

Note that the new if_numa_domain field fits in an alignment pad in
struct ifnet, and so does not alter the size of the structure.

Reviewed by: glebius, kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19930


# b252313f 31-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

New pfil(9) KPI together with newborn pfil API and control utility.

The KPI have been reviewed and cleansed of features that were planned
back 20 years ago and never implemented. The pfil(9) internals have
been made opaque to protocols with only returned types and function
declarations exposed. The KPI is made more strict, but at the same time
more extensible, as kernel uses same command structures that userland
ioctl uses.

In nutshell [KA]PI is about declaring filtering points, declaring
filters and linking and unlinking them together.

New [KA]PI makes it possible to reconfigure pfil(9) configuration:
change order of hooks, rehook filter from one filtering point to a
different one, disconnect a hook on output leaving it on input only,
prepend/append a filter to existing list of filters.

Now it possible for a single packet filter to provide multiple rulesets
that may be linked to different points. Think of per-interface ACLs in
Cisco or Juniper. None of existing packet filters yet support that,
however limited usage is already possible, e.g. default ruleset can
be moved to single interface, as soon as interface would pride their
filtering points.

Another future feature is possiblity to create pfil heads, that provide
not an mbuf pointer but just a memory pointer with length. That would
allow filtering at very early stages of a packet lifecycle, e.g. when
packet has just been received by a NIC and no mbuf was yet allocated.

Differential Revision: https://reviews.freebsd.org/D18951


# a68cc388 08-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Mechanical cleanup of epoch(9) usage in network stack.

- Remove macros that covertly create epoch_tracker on thread stack. Such
macros a quite unsafe, e.g. will produce a buggy code if same macro is
used in embedded scopes. Explicitly declare epoch_tracker always.

- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
locking macros to what they actually are - the net_epoch.
Keeping them as is is very misleading. They all are named FOO_RLOCK(),
while they no longer have lock semantics. Now they allow recursion and
what's more important they now no longer guarantee protection against
their companion WLOCK macros.
Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.

This is non functional mechanical change. The only functionally changed
functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
epoch recursively.

Discussed with: jtl, gallatin


# b79aa45e 13-Nov-2018 Gleb Smirnoff <glebius@FreeBSD.org>

For compatibility KPI functions like if_addr_rlock() that used to have
mutexes but now are converted to epoch(9) use thread-private epoch_tracker.
Embedding tracker into ifnet(9) or ifnet derived structures creates a non
reentrable function, that will fail miserably if called simultaneously from
two different contexts.
A thread private tracker will provide a single tracker that would allow to
call these functions safely. It doesn't allow nested call, but this is not
expected from compatibility KPIs.

Reviewed by: markj


# 64d63b1e 21-Oct-2018 Andrey V. Elsukov <ae@FreeBSD.org>

Add ifaddr_event_ext event. It is similar to ifaddr_event, but the
handler receives the type of event IFADDR_EVENT_ADD/IFADDR_EVENT_DEL,
and the pointer to ifaddr. Also ifaddr_event now is implemented using
ifaddr_event_ext handler.

MFC after: 3 weeks
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D17100


# 1687b1ab 29-Sep-2018 Michael Tuexen <tuexen@FreeBSD.org>

For changing the MTU on tun/tap devices, it should not matter whether it
is done via using ifconfig, which uses a SIOCSIFMTU ioctl() command, or
doing it using a TUNSIFINFO/TAPSIFINFO ioctl() command.
Without this patch, for IPv6 the new MTU is not used when creating routes.
Especially, when initiating TCP connections after increasing the MTU,
the old MTU is still used to compute the MSS.
Thanks to ae@ and bz@ for helping to improve the patch.

Reviewed by: ae@, bz@
Approved by: re (kib@)
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D17180


# b08d611d 20-Sep-2018 Matt Macy <mmacy@FreeBSD.org>

fix vlan locking to permit sx acquisition in ioctl calls

- update vlan(9) to handle changes earlier this year in multicast locking

Tested by: np@, darkfiberu at gmail.com

PR: 230510
Reviewed by: mjoras@, shurd@, sbruno@
Approved by: re (gjb@)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16808


# f9be0386 15-Aug-2018 Matt Macy <mmacy@FreeBSD.org>

Fix in6_multi double free

This is actually several different bugs:
- The code is not designed to handle inpcb deletion after interface deletion
- add reference for inpcb membership
- The multicast address has to be removed from interface lists when the refcount
goes to zero OR when the interface goes away
- decouple list disconnect from refcount (v6 only for now)
- ifmultiaddr can exist past being on interface lists
- add flag for tracking whether or not it's enqueued
- deferring freeing moptions makes the incpb cleanup code simpler but opens the
door wider still to races
- call inp_gcmoptions synchronously after dropping the the inpcb lock

Fundamentally multicast needs a rewrite - but keep applying band-aids for now.

Tested by: kp
Reported by: novel, kp, lwhsu


# 98a8fdf6 09-Jul-2018 Andrey V. Elsukov <ae@FreeBSD.org>

Deduplicate the code.

Add generic function if_tunnel_check_nesting() that does check for
allowed nesting level for tunneling interfaces and also does loop
detection. Use it in gif(4), gre(4) and me(4) interfaces.

Differential Revision: https://reviews.freebsd.org/D16162


# 6573d758 03-Jul-2018 Matt Macy <mmacy@FreeBSD.org>

epoch(9): allow preemptible epochs to compose

- Add tracker argument to preemptible epochs
- Inline epoch read path in kernel and tied modules
- Change in_epoch to take an epoch as argument
- Simplify tfb_tcp_do_segment to not take a ti_locked argument,
there's no longer any benefit to dropping the pcbinfo lock
and trying to do so just adds an error prone branchfest to
these functions
- Remove cases of same function recursion on the epoch as
recursing is no longer free.
- Remove the the TAILQ_ENTRY and epoch_section from struct
thread as the tracker field is now stack or heap allocated
as appropriate.

Tested by: pho and Limelight Networks
Reviewed by: kbowling at llnw dot com
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16066


# 0f8d79d9 24-May-2018 Matt Macy <mmacy@FreeBSD.org>

CK: update consumers to use CK macros across the board

r334189 changed the fields to have names distinct from those in queue.h
in order to expose the oversights as compile time errors


# 4f6c66cc 23-May-2018 Matt Macy <mmacy@FreeBSD.org>

UDP: further performance improvements on tx

Cumulative throughput while running 64
netperf -H $DUT -t UDP_STREAM -- -m 1
on a 2x8x2 SKL went from 1.1Mpps to 2.5Mpps

Single stream throughput increases from 910kpps to 1.18Mpps

Baseline:
https://people.freebsd.org/~mmacy/2018.05.11/udpsender2.svg

- Protect read access to global ifnet list with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender3.svg

- Protect short lived ifaddr references with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender4.svg

- Convert if_afdata read lock path to epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender5.svg

A fix for the inpcbhash contention is pending sufficient time
on a canary at LLNW.

Reviewed by: gallatin
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15409


# fd04260d 20-May-2018 Matt Macy <mmacy@FreeBSD.org>

ck: simplify interface with libkvm consumers by defining ck_queue types
as their queue.h equivalents if !_KERNEL


# d7c5a620 18-May-2018 Matt Macy <mmacy@FreeBSD.org>

ifnet: Replace if_addr_lock rwlock with epoch + mutex

Run on LLNW canaries and tested by pho@

gallatin:
Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5
based ConnectX 4-LX NIC, I see an almost 12% improvement in received
packet rate, and a larger improvement in bytes delivered all the way
to userspace.

When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1,
I see, using nstat -I mce0 1 before the patch:

InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree
4.98 0.00 4.42 0.00 4235592 33 83.80 4720653 2149771 1235 247.32
4.73 0.00 4.20 0.00 4025260 33 82.99 4724900 2139833 1204 247.32
4.72 0.00 4.20 0.00 4035252 33 82.14 4719162 2132023 1264 247.32
4.71 0.00 4.21 0.00 4073206 33 83.68 4744973 2123317 1347 247.32
4.72 0.00 4.21 0.00 4061118 33 80.82 4713615 2188091 1490 247.32
4.72 0.00 4.21 0.00 4051675 33 85.29 4727399 2109011 1205 247.32
4.73 0.00 4.21 0.00 4039056 33 84.65 4724735 2102603 1053 247.32

After the patch

InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree
5.43 0.00 4.20 0.00 3313143 33 84.96 5434214 1900162 2656 245.51
5.43 0.00 4.20 0.00 3308527 33 85.24 5439695 1809382 2521 245.51
5.42 0.00 4.19 0.00 3316778 33 87.54 5416028 1805835 2256 245.51
5.42 0.00 4.19 0.00 3317673 33 90.44 5426044 1763056 2332 245.51
5.42 0.00 4.19 0.00 3314839 33 88.11 5435732 1792218 2499 245.52
5.44 0.00 4.19 0.00 3293228 33 91.84 5426301 1668597 2121 245.52

Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch

Reviewed by: gallatin
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15366


# 70398c2f 18-May-2018 Matt Macy <mmacy@FreeBSD.org>

epoch(9): Make epochs non-preemptible by default

There are risks associated with waiting on a preemptible epoch section.
Change the name to make them not be the default and document the issue
under CAVEATS.

Reported by: markj


# 5c30b378 10-May-2018 Matt Macy <mmacy@FreeBSD.org>

Allow different bridge types to coexist

if_bridge has a lot of limitations that make it scale poorly to higher data
rates. In my projects/VPC branch I leverage the bridge interface between
layers for my high speed soft switch as well as for purposes of stacking
in general.

Reviewed by: sbruno@
Approved by: sbruno@
Differential Revision: https://reviews.freebsd.org/D15344


# 7bf272a6 10-May-2018 Matt Macy <mmacy@FreeBSD.org>

Allocate epoch for networking at startup

Additionally add CK to include paths for modules

Approved by: sbruno@


# b6f6f880 06-May-2018 Matt Macy <mmacy@FreeBSD.org>

r333175 introduced deferred deletion of multicast addresses in order to permit the driver ioctl
to sleep on commands to the NIC when updating multicast filters. More generally this permitted
driver's to use an sx as a softc lock. Unfortunately this change introduced a race whereby a
a multicast update would still be queued for deletion when ifconfig deleted the interface
thus calling down in to _purgemaddrs and synchronously deleting _all_ of the multicast addresses
on the interface.

Synchronously remove all external references to a multicast address before enqueueing for delete.

Reported by: lwhsu
Approved by: sbruno


# e5054602 05-May-2018 Mark Johnston <markj@FreeBSD.org>

Import the netdump client code.

This is a component of a system which lets the kernel dump core to
a remote host after a panic, rather than to a local storage device.
The server component is available in the ports tree. netdump is
particularly useful on diskless systems.

The netdump(4) man page contains some details describing the protocol.
Support for configuring netdump will be added to dumpon(8) in a future
commit. To use netdump, the kernel must have been compiled with the
NETDUMP option.

The initial revision of netdump was written by Darrell Anderson and
was integrated into Sandvine's OS, from which this version was derived.

Reviewed by: bdrewery, cem (earlier versions), julian, sbruno
MFC after: 1 month
X-MFC note: use a spare field in struct ifnet
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D15253


# 4a381a9e 26-Apr-2018 Hans Petter Selasky <hselasky@FreeBSD.org>

Add network device event for priority code point, PCP, changes.

When the PCP is changed for either a VLAN network interface or when
prio tagging is enabled for a regular ethernet network interface,
broadcast the IFNET_EVENT_PCP event so applications like ibcore can
update its GID tables accordingly.

MFC after: 3 days
Reviewed by: ae, kib
Differential Revision: https://reviews.freebsd.org/D15040
Sponsored by: Mellanox Technologies


# 541d96aa 30-Mar-2018 Brooks Davis <brooks@FreeBSD.org>

Use an accessor function to access ifr_data.

This fixes 32-bit compat (no ioctl command defintions are required
as struct ifreq is the same size). This is believed to be sufficent to
fully support ifconfig on 32-bit systems.

Reviewed by: kib
Obtained from: CheriBSD
MFC after: 1 week
Relnotes: yes
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14900


# f1379734 27-Mar-2018 Konstantin Belousov <kib@FreeBSD.org>

Allow to specify PCP on packets not belonging to any VLAN.

According to 802.1Q-2014, VLAN tagged packets with VLAN id 0 should be
considered as untagged, and only PCP and DEI values from the VLAN tag
are meaningful. See for instance
https://www.cisco.com/c/en/us/td/docs/switches/connectedgrid/cg-switch-sw-master/software/configuration/guide/vlan0/b_vlan_0.html.

Make it possible to specify PCP value for outgoing packets on an
ethernet interface. When PCP is supplied, the tag is appended, VLAN
id set to 0, and PCP is filled by the supplied value. The code to do
VLAN tag encapsulation is refactored from the if_vlan.c and moved into
if_ethersubr.c.

Drivers might have issues with filtering VID 0 packets on
receive. This bug should be fixed for each driver.

Reviewed by: ae (previous version), hselasky, melifaro
Sponsored by: Mellanox Technologies
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D14702


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# 95ed5015 06-Sep-2017 Hans Petter Selasky <hselasky@FreeBSD.org>

Add support for generic backpressure indicator for ratelimited
transmit queues aswell as non-ratelimited ones.

Add the required structure bits in order to support a backpressure
indication with ratelimited connections aswell as non-ratelimited
ones. The backpressure indicator is a value between zero and 65535
inclusivly, indicating if the destination transmit queue is empty or
full respectivly. Applications can use this value as a decision point
for when to stop transmitting data to avoid endless ENOBUFS error
codes upon transmitting an mbuf. This indicator is also useful to
reduce the latency for ratelimited queues.

Reviewed by: gallatin, kib, gnn
Differential Revision: https://reviews.freebsd.org/D11518
Sponsored by: Mellanox Technologies


# ddae5750 10-May-2017 Ravi Pokala <rpokala@FreeBSD.org>

Persistently store NIC's hardware MAC address, and add a way to retrive it

The MAC address reported by `ifconfig ${nic} ether' does not always match
the address in the hardware, as reported by the driver during attach. In
particular, NICs which are components of a lagg(4) interface all report the
same MAC.

When attaching, the NIC driver passes the MAC address it read from the
hardware as an argument to ether_ifattach(). Keep a second copy of it, and
create ioctl(SIOCGHWADDR) to return it. Teach `ifconfig' to report it along
with the active MAC address.

PR: 194386
Reviewed by: glebius
MFC after: 1 week
Sponsored by: Panasas
Differential Revision: https://reviews.freebsd.org/D10609


# fbbd9655 28-Feb-2017 Warner Losh <imp@FreeBSD.org>

Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96


# d0b2cad1 31-Jan-2017 Stephen J. Kiernan <stevek@FreeBSD.org>

Add the folowing set accessor functions for recently-added members of ifnet
structure:

if_gethwtsomax(), if_sethwtsomax() - if_hw_tsomax
if_gethwtsomaxsegcount(), if_sethwtsomaxsegcount() - if_hw_tsomaxsegcount
if_gethwtsomaxsegsize(), if_sethwtsomaxsegsize() - if_hw_tsomaxsegsize

Update em and vnic drivers which had already been coverted to use accessor
functions for the other ifnet structure members.

Reviewed by: erj
Approved by: sjg (mentor)
Obtained from: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D8544


# 6597559e 28-Jan-2017 Dexuan Cui <dexuan@FreeBSD.org>

ifnet: move the new ifnet_event EVENTHANDLER_DECLARE to net/if_var.h

Thank glebius for pointing this out:
"The network stuff shall not be added to sys/eventhandler.h"

Reviewed by: David_A_Bright_DELL.com, sephe, glebius
Approved by: sephe (mentor)
MFC after: 2 weeks
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D9345


# f3e7afe2 18-Jan-2017 Hans Petter Selasky <hselasky@FreeBSD.org>

Implement kernel support for hardware rate limited sockets.

- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.

- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.

- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().

- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.

- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.

- How rate limiting works:

1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.

2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.

3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.

4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.

Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months


# b95d46da 18-Oct-2016 Kevin Lo <kevlo@FreeBSD.org>

Fix typo in comment.


# c2b5ba76 05-Oct-2016 Kevin Lo <kevlo@FreeBSD.org>

Remove an alias if_list, use if_link consistently.

Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D8075


# decb239d 28-Sep-2016 Kevin Lo <kevlo@FreeBSD.org>

Remove the compatibility macro if_addrlist.

Since if_addrlist is used only for ipfilter(4), add a macro if_addrlist
in ip_compat.h.

Reviewed by: cy
Differential Revision: https://reviews.freebsd.org/D8059


# c7641cd1 28-Sep-2016 Kevin Lo <kevlo@FreeBSD.org>

Remove ifa_list, use ifa_link (structure field) instead.

While here, prefer if_addrhead (FreeBSD) to if_addrlist (BSD compat) naming
for the interface address list in sctp_bsd_addr.c

Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D8051


# 3ff511d3 27-Sep-2016 Kevin Lo <kevlo@FreeBSD.org>

Remove a comment about the size of the ifnet structure.

Reviewed by: adrian
Differential Revision: https://reviews.freebsd.org/D8036


# f22bfc72 23-Jun-2016 Navdeep Parhar <np@FreeBSD.org>

Add spares to struct ifnet and socket for packet pacing and/or general
use. Update comments regarding the spare fields in struct inpcb.

Bump __FreeBSD_version for the changes to the size of the structures.

Reviewed by: gnn@
Approved by: re@ (gjb@)
Sponsored by: Chelsio Communications


# ad4e9116 18-May-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

Rather than having the if_vmove() code intermixed in the vnet_destroy()
function in vnet.c move it to if.c where it logically belongs and put
it under a VNET_SYSUNINIT() call.
To not change the current behaviour make sure it runs first thing
during teardown. In the future this will allow us more flexibility
on changing the order on when we want to get rid of interfaces.

Stop exporting if_vmove() and make it file static.

Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D6438


# 4c7070db 17-May-2016 Scott Long <scottl@FreeBSD.org>

Import the 'iflib' API library for network drivers. From the author:

"iflib is a library to eliminate the need for frequently duplicated device
independent logic propagated (poorly) across many network drivers."

Participation is purely optional. The IFLIB kernel config option is
provided for drivers that want to transition between legacy and iflib
modes of operation. ixl and ixgbe driver conversions will be committed
shortly. We hope to see participation from the Broadcom and maybe
Chelsio drivers in the near future.

Submitted by: mmacy@nextbsd.org
Reviewed by: gallatin
Differential Revision: D5211


# 4fb3a820 30-Dec-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Implement interface link header precomputation API.

Add if_requestencap() interface method which is capable of calculating
various link headers for given interface. Right now there is support
for INET/INET6/ARP llheader calculation (IFENCAP_LL type request).
Other types are planned to support more complex calculation
(L2 multipath lagg nexthops, tunnel encap nexthops, etc..).

Reshape 'struct route' to be able to pass additional data (with is length)
to prepend to mbuf.

These two changes permits routing code to pass pre-calculated nexthop data
(like L2 header for route w/gateway) down to the stack eliminating the
need for other lookups. It also brings us closer to more complex scenarios
like transparently handling MPLS nexthops and tunnel interfaces.
Last, but not least, it removes layering violation introduced by flowtable
code (ro_lle) and simplifies handling of existing if_output consumers.

ARP/ND changes:
Make arp/ndp stack pre-calculate link header upon installing/updating lle
record. Interface link address change are handled by re-calculating
headers for all lles based on if_lladdr event. After these changes,
arpresolve()/nd6_resolve() returns full pre-calculated header for
supported interfaces thus simplifying if_output().
Move these lookups to separate ether_resolve_addr() function which ether
returs error or fully-prepared link header. Add <arp|nd6_>resolve_addr()
compat versions to return link addresses instead of pre-calculated data.

BPF changes:
Raw bpf writes occupied _two_ cases: AF_UNSPEC and pseudo_AF_HDRCMPLT.
Despite the naming, both of there have ther header "complete". The only
difference is that interface source mac has to be filled by OS for
AF_UNSPEC (controlled via BIOCGHDRCMPLT). This logic has to stay inside
BPF and not pollute if_output() routines. Convert BPF to pass prepend data
via new 'struct route' mechanism. Note that it does not change
non-optimized if_output(): ro_prepend handling is purely optional.
Side note: hackish pseudo_AF_HDRCMPLT is supported for ethernet and FDDI.
It is not needed for ethernet anymore. The only remaining FDDI user is
dev/pdq mostly untouched since 2007. FDDI support was eliminated from
OpenBSD in 2013 (sys/net/if_fddisubr.c rev 1.65).

Flowtable changes:
Flowtable violates layering by saving (and not correctly managing)
rtes/lles. Instead of passing lle pointer, pass pointer to pre-calculated
header data from that lle.

Differential Revision: https://reviews.freebsd.org/D4102


# d6e82913 17-Dec-2015 Steven Hartland <smh@FreeBSD.org>

Revert r292275 & r292379

glebius has concerns about these changes so reverting those can be discussed
and addressed.

Sponsored by: Multiplay


# 52e53e2d 15-Dec-2015 Steven Hartland <smh@FreeBSD.org>

Fix lagg failover due to missing notifications

When using lagg failover mode neither Gratuitous ARP (IPv4) or Unsolicited
Neighbour Advertisements (IPv6) are sent to notify other nodes that the
address may have moved.

This results is slow failover, dropped packets and network outages for the
lagg interface when the primary link goes down.

We now use the new if_link_state_change_cond with the force param set to
allow lagg to force through link state changes and hence fire a
ifnet_link_event which are now monitored by rip and nd6.

Upon receiving these events each protocol trigger the relevant
notifications:
* inet4 => Gratuitous ARP
* inet6 => Unsolicited Neighbour Announce

This also fixes the carp IPv6 NA's that stopped working after r251584 which
added the ipv6_route__llma route.

The new behavour can be controlled using the sysctls:
* net.link.ether.inet.arp_on_link
* net.inet6.icmp6.nd6_on_link

Also removed unused param from lagg_port_state and added descriptions for the
sysctls while here.

PR: 156226
MFC after: 1 month
Sponsored by: Multiplay
Differential Revision: https://reviews.freebsd.org/D4111


# ef91a976 25-Nov-2015 Andrey V. Elsukov <ae@FreeBSD.org>

Overhaul if_enc(4) and make it loadable in run-time.

Use hhook(9) framework to achieve ability of loading and unloading
if_enc(4) kernel module. INET and INET6 code on initialization registers
two helper hooks points in the kernel. if_enc(4) module uses these helper
hook points and registers its hooks. IPSEC code uses these hhook points
to call helper hooks implemented in if_enc(4).


# 59c180c3 16-Sep-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Unify loopback route switching:
* prepare gateway before insertion
* use RTM_CHANGE instead of explicit find/change route
* Remove fib argument from ifa_switch_loopback_route added in r264887:
if old ifp fib differes from new one, that the caller
is doing something wrong
* Make ifa_*_loopback_route call single ifa_maintain_loopback_route().


# d76d4012 14-Sep-2015 Hans Petter Selasky <hselasky@FreeBSD.org>

Update TSO limits to include all headers.

To make driver programming easier the TSO limits are changed to
reflect the values used in the BUSDMA tag a network adapter driver is
using. The TCP/IP network stack will subtract space for all linklevel
and protocol level headers and ensure that the full mbuf chain passed
to the network adapter fits within the given limits.

Implementation notes:

If a network adapter driver needs to fixup the first mbuf in order to
support VLAN tag insertion, the size of the VLAN tag should be
subtracted from the TSO limit. Else not.

Network adapters which typically inline the complete header mbuf could
technically transmit one more segment. This patch does not implement a
mechanism to recover the last segment for data transmission. It is
believed when sufficiently large mbuf clusters are used, the segment
limit will not be reached and recovering the last segment will not
have any effect.

The current TSO algorithm tries to send MTU-sized packets, where the
MTU typically is 1500 bytes, which gives 1448 bytes of TCP data
payload per packet for IPv4. That means if the TSO length limitiation
is set to 65536 bytes, there will be a data payload remainder of
(65536 - 1500) mod 1448 bytes which is equal to 324 bytes. Trying to
recover total TSO length due to inlining mbuf header data will not
have any effect, because adding or removing the ETH/IP/TCP headers
to or from 324 bytes will not cause more or less TCP payload to be
TSO'ed.

Existing network adapter limits will be updated separately.

Differential Revision: https://reviews.freebsd.org/D3458
Reviewed by: rmacklem
MFC after: 2 weeks


# 441f9243 04-Sep-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Constantify lookup key in ifa_ifwith* functions.
Some places in our network stack already have const
arguments (like if_output() routines and LLE functions).

Code using ifa_ifwith (and similar functins) along with
LLE/_output functions is currently bound to use tricks
like __DECONST(). Provide a cleaner way by making sockaddr
lookup key really constant.

MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D3464


# 772e66a6 16-Apr-2015 Gleb Smirnoff <glebius@FreeBSD.org>

Move ALTQ from contrib to net/altq. The ALTQ code is for many years
discontinued by its initial authors. In FreeBSD the code was already
slightly edited during the pf(4) SMP project. It is about to be edited
more in the projects/ifnet. Moving out of contrib also allows to remove
several hacks to the make glue.

Reviewed by: net@


# fec642ad 26-Feb-2015 Gleb Smirnoff <glebius@FreeBSD.org>

Hide struct ifmultiaddr under _KERNEL, too.


# e072c794 19-Feb-2015 Gleb Smirnoff <glebius@FreeBSD.org>

Now that all users of _WANT_IFADDR are fixed, remove this crutch and
hide ifaddr, in_ifaddr and in6_ifaddr under _KERNEL.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 73d77028 23-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Do more fine-grained lltable locking: use table runtime lock as rare
as we can.


# 7c066c18 22-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Use less-invasive approach for IF_AFDATA lock: convert into 2 locks:
use rwlock accessible via external functions
(IF_AFDATA_CFG_* -> if_afdata_cfg_*()) for all control plane tasks
use rmlock (IF_AFDATA_RUN_*) for fast-path lookups.


# 27688dfe 22-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Temporarily revert r274774.


# 9883e41b 20-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Switch IF_AFDATA lock to rmlock


# 3c7c188c 10-Nov-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Fix some minor TSO issues:
- Improve description of TSO limits.
- Remove a not needed KASSERT()
- Remove some not needed variable casts.

Sponsored by: Mellanox Technologies
Discussed with: lstewart @
MFC after: 1 week


# 033074c4 09-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Replace 'struct route *' if_output() argument with 'struct nhop_info *'.
Leave 'struct route' as is for legacy routing api users.
Remove most of rtalloc_ign*-derived functions.


# 4ea05db8 09-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Use standard mtx(9), rwlock(9), sx(9) system initialization macros
instead of doing initialization manually.

Sponsored by: Nginx, Inc.
Sponsored by: Netflix


# 1a75e3b2 06-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Make checks for rt_mtu generic:

Some virtual if drivers has (ab)used ifa ifa_rtrequest hook to enforce
route MTU to be not bigger that interface MTU. While ifa_rtrequest hooking
might be an option in some situation, it is not feasible to do MTU checks
there: generic (or per-domain) routing code is perfectly capable of doing
this.

We currrently have 3 places where MTU is altered:

1) route addition.
In this case domain overrides radix _addroute callback (in[6]_addroute)
and all necessary checks/fixes are/can be done there.

2) route change (especially, GW change).
In this case, there are no explicit per-domain calls, but one can
override rte by setting ifa_rtrequest hook to domain handler
(inet6 does this).

3) ifconfig ifaceX mtu YYYY
In this case, we have no callbacks, but ip[6]_output performes runtime
checks and decreases rt_mtu if necessary.

Generally, the goals are to be able to handle all MTU changes in
control plane, not in runtime part, and properly deal with increased
interface MTU.

This commit changes the following:
* removes hooks setting MTU from drivers side
* adds proper per-doman MTU checks for case 1)
* adds generic MTU check for case 2)

* The latter is done by using new dom_ifmtu callback since
if_mtu denotes L3 interface MTU, e.g. maximum trasmitted _packet_ size.
However, IPv6 mtu might be different from if_mtu one (e.g. default 1280)
for some cases, so we need an abstract way to know maximum MTU size
for given interface and domain.
* moves rt_setmetrics() before MTU/ifa_rtrequest hooks since it copies
user-supplied data which must be checked.
* removes RT_LOCK_ASSERT() from other ifa_rtrequest hooks to be able to
use this functions on new non-inserted rte.

More changes will follow soon.

MFC after: 1 month
Sponsored by: Yandex LLC


# f8ca6199 03-Nov-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Clarify TSO segment limit comment and remove two TABs to make lines a
bit shorter.

Sponsored by: Mellanox Technologies


# cbaac009 28-Sep-2014 Bjoern A. Zeeb <bz@FreeBSD.org>

Move the unconditional #include of net/ifq.h to the very end of file.
This seems to allow us to pass a universe with either clang or gcc
after r272244 (and r272260) and probably makes it easier to untabgle
these chained #includes in the future.


# bd071d4d 28-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

- Remove empty wrappers ether_poll_[de]register_drv(). [1]
- Move polling(9) declarations out of ifq.h back to if_var.h
they are absolutely unrelated to queues.

Submitted by: Mikhail <mp lenta.ru> [1]


# 112f50ff 28-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Finally, convert counters in struct ifnet to counter(9).

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 7d6cc45c 27-Sep-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Use underlying ports counters to get lagg statistics instead of
per-packet accounting.
This introduce user-visible changes like aggregating error counters.

Reviewed by: asomers (prev.version), glebius
CR: D781
MFC after: 2 weeks
Sponsored by: Yandex LLC


# 9fd573c3 22-Sep-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Improve transmit sending offload, TSO, algorithm in general.

The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.

Reviewed by: adrian, rmacklem
Sponsored by: Mellanox Technologies
MFC after: 1 week


# d2a707cd 18-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Remove a bunch of methods that are superseded by if_inc_counter().

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 1b7fb1d9 18-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

While not too late rename 'ifnet_counter' to 'ift_counter'. One of the
imporant moments that we discussed with Marcel and Anuranjan was that
a converted driver should return false for 'grep ifnet if_driver.c' :)

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 35853c2c 18-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Add a function to set if_get_counter method for an ifnet. To be used
in the drivers that are already converted to "Juniper drvapi". This
can be revisited in future.


# 277e067a 18-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

While not too late rename if_get_counter_compat() to if_get_counter_default().
The compat counters will go away, but the function will remain in its place,
and in all places where it is going to be called.

Discussed with: melifaro


# 0b7b006c 18-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Add if_inc_counter(), a generic method to update ifnet(9) counter
w/o dereferencing the struct.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 72f31000 13-Sep-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Revert r271504. A new patch to solve this issue will be made.

Suggested by: adrian @


# eb93b77a 13-Sep-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Improve transmit sending offload, TSO, algorithm in general.

The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.

MFC after: 1 week
Sponsored by: Mellanox Technologies


# 4f8585e0 11-Sep-2014 Alan Somers <asomers@FreeBSD.org>

Revisions 264905 and 266860 added a "int fib" argument to ifa_ifwithnet and
ifa_ifwithdstaddr. For the sake of backwards compatibility, the new
arguments were added to new functions named ifa_ifwithnet_fib and
ifa_ifwithdstaddr_fib, while the old functions became wrappers around the
new ones that passed RT_ALL_FIBS for the fib argument. However, the
backwards compatibility is not desired for FreeBSD 11, because there are
numerous other incompatible changes to the ifnet(9) API. We therefore
decided to remove it from head but leave it in place for stable/9 and
stable/10. In addition, this commit adds the fib argument to
ifa_ifwithbroadaddr for consistency's sake.

sys/sys/param.h
Increment __FreeBSD_version

sys/net/if.c
sys/net/if_var.h
sys/net/route.c
Add fibnum argument to ifa_ifwithbroadaddr, and remove the _fib
versions of ifa_ifwithdstaddr, ifa_ifwithnet, and ifa_ifwithroute.

sys/net/route.c
sys/net/rtsock.c
sys/netinet/in_pcb.c
sys/netinet/ip_options.c
sys/netinet/ip_output.c
sys/netinet6/nd6.c
Fixup calls of modified functions.

share/man/man9/ifnet.9
Document changed API.

CR: https://reviews.freebsd.org/D458
MFC after: Never
Sponsored by: Spectra Logic


# ccbefc2d 31-Aug-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Toss fields so that no padding field is required to achieve alignment.


# 09a8241f 30-Aug-2014 Gleb Smirnoff <glebius@FreeBSD.org>

It is actually possible to have if_t a typedef to non-void type,
and keep both converted to drvapi and non-converted drivers
compilable.

o Make if_t typedef to struct ifnet *.
o Remove shim functions.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 997d2d83 31-Aug-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Provide pointer from struct ifnet to struct netmap_adapter,
instead of abusing spare field.


# e6485f73 31-Aug-2014 Gleb Smirnoff <glebius@FreeBSD.org>

o Remove struct if_data from struct ifnet. Now it is merely API structure
for route(4) socket and ifmib(4) sysctl.
o Move fields from if_data to ifnet, but keep all statistic counters
separate, since they should disappear later.
o Provide function if_data_copy() to fill if_data, utilize it in routing
socket and ifmib handler.
o Provide overridable ifnet(9) method to fetch counters. If no provided,
if_get_counters_compat() would be used, that returns old counters.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 9753faf5 29-Jul-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Garbage collect couple of unused fields from struct ifaddr:
- ifa_claim_addr() unused since removal of NetAtalk
- ifa_metric seems to be never utilized, always a copy of if_metric


# 62d76917 02-Jun-2014 Marcel Moolenaar <marcel@FreeBSD.org>

Introduce a procedural interface to the ifnet structure. The new
interface allows the ifnet structure to be defined as an opaque
type in NIC drivers. This then allows the ifnet structure to be
changed without a need to change or recompile NIC drivers.

Put differently, NIC drivers can be written and compiled once and
be used with different network stack implementations, provided of
course that those network stack implementations have an API and
ABI compatible interface.

This commit introduces the 'if_t' type to replace 'struct ifnet *'
as the type of a network interface. The 'if_t' type is defined as
'void *' to enable the compiler to perform type conversion to
'struct ifnet *' and vice versa where needed and without warnings.
The functions that implement the API are the only functions that
need to have an explicit cast.

The MII code has been converted to use the driver API to avoid
unnecessary code churn. Code churn comes from having to work with
both converted and unconverted drivers in correlation with having
callback functions that take an interface. By converting the MII
code first, the callback functions can be defined so that the
compiler will perform the typecasts automatically.

As soon as all drivers have been converted, the if_t type can be
redefined as needed and the API functions can be fix to not need
an explicit cast.

The immediate benefactors of this change are:
1. Juniper Networks - The network stack implementation in Junos
is entirely different from FreeBSD's one and this change
allows Juniper to build "stock" NIC drivers that can be used
in combination with both the FreeBSD and Junos stacks.
2. FreeBSD - This change opens the door towards changing ifnet
and implementing new features and optimizations in the network
stack without it requiring a change in the many NIC drivers
FreeBSD has.

Submitted by: Anuranjan Shukla <anshukla@juniper.net>
Reviewed by: glebius@
Obtained from: Juniper Networks, Inc.


# 2f308a34 29-May-2014 Alan Somers <asomers@FreeBSD.org>

Fix unintended KBI change from r264905. Add _fib versions of
ifa_ifwithnet() and ifa_ifwithdstaddr() The legacy functions will call the
_fib() versions with RT_ALL_FIBS, preserving legacy behavior.

sys/net/if_var.h
sys/net/if.c
Add legacy-compatible functions as described above. Ensure legacy
behavior when RT_ALL_FIBS is passed as fibnum.

sys/netinet/in_pcb.c
sys/netinet/ip_output.c
sys/netinet/ip_options.c
sys/net/route.c
sys/net/rtsock.c
sys/netinet6/nd6.c
Call with _fib() functions if we must use a specific fib, or the
legacy functions otherwise.

tests/sys/netinet/fibs_test.sh
tests/sys/netinet/udp_dontroute.c
Improve the udp_dontroute test. The bug that this test exercises is
that ifa_ifwithnet() will return the wrong address, if multiple
interfaces have addresses on the same subnet but with different
fibs. The previous version of the test only considered one possible
failure mode: that ifa_ifwithnet_fib() might fail to find any
suitable address at all. The new version also checks whether
ifa_ifwithnet_fib() finds the correct address by checking where the
ARP request goes.

Reported by: bz, hrs
Reviewed by: hrs
MFC after: 1 week
X-MFC-with: 264905
Sponsored by: Spectra Logic


# 0cfee0c2 24-Apr-2014 Alan Somers <asomers@FreeBSD.org>

Fix subnet and default routes on different FIBs on the same subnet.

These two bugs are closely related. The root cause is that ifa_ifwithnet
does not consider FIBs when searching for an interface address.

sys/net/if_var.h
sys/net/if.c
Add a fib argument to ifa_ifwithnet and ifa_ifwithdstadddr. Those
functions will only return an address whose interface fib equals the
argument.

sys/net/route.c
Update calls to ifa_ifwithnet and ifa_ifwithdstaddr with fib
arguments.

sys/netinet/in.c
Update in_addprefix to consider the interface fib when adding
prefixes. This will prevent it from not adding a subnet route when
one already exists on a different fib.

sys/net/rtsock.c
sys/netinet/in_pcb.c
sys/netinet/ip_output.c
sys/netinet/ip_options.c
sys/netinet6/nd6.c
Add RT_DEFAULT_FIB arguments to ifa_ifwithdstaddr and ifa_ifwithnet.
In some cases it there wasn't a clear specific fib number to use.
In others, I was unable to test those functions so I chose
RT_DEFAULT_FIB to minimize divergence from current behavior. I will
fix some of the latter changes along with PR kern/187553.

tests/sys/netinet/fibs_test.sh
tests/sys/netinet/udp_dontroute.c
tests/sys/netinet/Makefile
Revert r263738. The udp_dontroute test was right all along.
However, bugs kern/187550 and kern/187553 cancelled each other out
when it came to this test. Because of kern/187553, ifa_ifwithnet
searched the default fib instead of the requested one, but because
of kern/187550, there was an applicable subnet route on the default
fib. The new test added in r263738 doesn't work right, however. I
can verify with dtrace that ifa_ifwithnet returned the wrong address
before I applied this commit, but route(8) miraculously found the
correct interface to use anyway. I don't know how.

Clear expected failure messages for kern/187550 and kern/187552.

PR: kern/187550
PR: kern/187552
Reviewed by: melifaro
MFC after: 3 weeks
Sponsored by: Spectra Logic


# 0489b891 24-Apr-2014 Alan Somers <asomers@FreeBSD.org>

Fix host and network routes for new interfaces when net.add_addr_allfibs=0

sys/net/route.c
In rtinit1, use the interface fib instead of the process fib. The
latter wasn't very useful because ifconfig(8) is usually invoked
with the default process fib. Changing ifconfig(8) to use setfib(2)
would be redundant, because it already sets the interface fib.

tests/sys/netinet/fibs_test.sh
Clear the expected ATF failure

sys/net/if.c
Pass the interface fib in calls to rtrequest1_fib and rtalloc1_fib

sys/netinet/in.c
sys/net/if_var.h
Add a fibnum argument to ifa_switch_loopback_route, a subroutine of
in_scrubprefix. Pass it the interface fib.

PR: kern/187549
Reviewed by: melifaro
MFC after: 3 weeks
Sponsored by: Spectra Logic Corporation


# 61b9f217 19-Mar-2014 Navdeep Parhar <np@FreeBSD.org>

Add a shorter alias for if_data.ifi_oqdrops.


# b245f96c 12-Mar-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Since 32-bit if_baudrate isn't enough to describe a baud rate of a 10 Gbit
interface, in the r241616 a crutch was provided. It didn't work well, and
finally we decided that it is time to break ABI and simply make if_baudrate
a 64-bit value. Meanwhile, the entire struct if_data was reviewed.

o Remove the if_baudrate_pf crutch.

o Make all fields of struct if_data fixed machine independent size. The
notion of data (packet counters, etc) are by no means MD. And it is a
bug that on amd64 we've got a 64-bit counters, while on i386 32-bit,
which at modern speeds overflow within a second.

This also removes quite a lot of COMPAT_FREEBSD32 code.

o Give 16 bit for the ifi_datalen field. This field was provided to
make future changes to if_data less ABI breaking. Unfortunately the
8 bit size of it had effectively limited sizeof if_data to 256 bytes.

o Give 32 bits to ifi_mtu and ifi_metric.
o Give 64 bits to the rest of fields, since they are counters.

__FreeBSD_version bumped.

Discussed with: emax
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 9a6356bc 05-Nov-2013 Gleb Smirnoff <glebius@FreeBSD.org>

In complemence to ifa_add_loopback_route() and ifa_del_loopback_route()
provide function ifa_switch_loopback_route() that will be used in case when
an interface address used for a loopback route goes away, but we have another
interface address with same address value and want to preserve loopback
route.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# b1b9dcae 05-Nov-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Remove net.link.ether.inet.useloopback sysctl tunable. It was always on by
default from the very beginning. It was placed in wrong namespace
net.link.ether, originally it had been at another wrong namespace. It was
incorrectly documented at incorrect manual page arp(8). Since new-ARP commit,
the tunable have been consulted only on route addition, and ignored on route
deletion. Behaviour of a system with tunable turned off is not fully correct,
and has no advantages comparing to normal behavior.


# 5b74cfe4 31-Oct-2013 Andre Oppermann <andre@FreeBSD.org>

Make struct ifnet readable and comprehensible again by grouping
and ordering related variables, fields and locks next to each
other. Add more comments to variables.

Over time 'ifnet' has accumlated a lot of additional pointers and
functionality in an unstructured way making it quite hard to read
and understand while obfuscating relationships between fields and
variables.

Quantify the structure size and how bloated it has become.

This is only a mechanical change in preparation for upcoming
work to make ifnet opaque to drivers and to separate out the
interface queuing.

Sponsored by: The FreeBSD Foundation


# ded7d20f 29-Oct-2013 Andre Oppermann <andre@FreeBSD.org>

Move all interface queue related structures, macros and definitions
from net/if_var to it own new net/ifq.h.

For now net/ifq.h is unconditionally included through net/if_var.h.

This is a mechanical change in preparation to make struct ifnet and
the individual interface queue mechanisms opaque.

Discussed with: glebius
Sponsored by: The FreeBSD Foundation


# eaeb0c13 28-Oct-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Style: s/SYS_EVENTHANDLER_H/_SYS_EVENTHANDLER_H_/g

Submitted by: bde


# c29e1ad9 28-Oct-2013 Gleb Smirnoff <glebius@FreeBSD.org>

- Make the prophecy from 1997 happen and remove if_var.h inclusion
from if.h.
- Remove unnecessary includes and declarations from if.h
- Remove unnecessary includes and declarations from if_var.h [1]
- Mark some declarations that are about to be removed in near
future with comments, explaning why this declaration is still
necessary.
- Protect eventhandler declarations with #ifdef SYS_EVENTHANDLER_H.

Obtained from: bdeBSD [1]
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 7caf4ab7 15-Oct-2013 Gleb Smirnoff <glebius@FreeBSD.org>

- Utilize counter(9) to accumulate statistics on interface addresses. Add
four counters to struct ifaddr. This kills '+=' on a variables shared
between processors for every packet.
- Nuke struct if_data from struct ifaddr.
- In ip_input() do not put a reference on ifaddr, instead update statistics
right now in place and do IN_IFADDR_RUNLOCK(). These removes atomic(9)
for every packet. [1]
- To properly support NET_RT_IFLISTL sysctl used by getifaddrs(3), in
rtsock.c fill if_data fields using counter_u64_fetch().
- Accidentially fix bug in COMPAT_32 version of NET_RT_IFLISTL, which
took if_data not from the ifaddr, but from ifaddr's ifnet. [2]

Submitted by: melifaro [1], pluknet[2]
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 3fffa8c8 15-Oct-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Push some defines under _KERNEL, improve styling and comments.


# 67420bda 15-Oct-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Remove ifa_mtx. It was used only in one place in kernel, and ifnet's
ifaddr lock can substitute it there.

Discussed with: melifaro, ae
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 46758960 15-Oct-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Remove ifa_init() and provide ifa_alloc() that will allocate and setup
struct ifaddr internally.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 6ed910fa 15-Oct-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Hide 'struct ifaddr' definition from userland. Two tools left that use it,
namely ipftest(1) and ifmcstat(1). These sniff structure definition using
_WANT_IFADDR define.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# d36ed80a 05-Jul-2013 Colin Percival <cperciva@FreeBSD.org>

Fix typo: minmum -> minimum.

Submitted by: @z3ndrag0n


# 3c914c54 02-Jun-2013 Andre Oppermann <andre@FreeBSD.org>

Allow drivers to specify a maximum TSO length in bytes if they are
limited in the amount of data they can handle at once.

Drivers can set ifp->if_hw_tsomax before calling ether_ifattach() to
change the limit.

The lowest allowable size is IP_MAXPACKET / 8 (8192 bytes) as anything
less wouldn't be very useful anymore. The upper limit is still at
IP_MAXPACKET (65536 bytes). Raising it requires further auditing of
the IPv4/v6 code path's as the length field in the IP header would
overflow leading to confusion in firewalls and others packet handler on
the real size of the packet.

The placement into "struct ifnet" is a bit hackish but the best place
that was found. When the stack/driver boundary is updated it should
be handled in a better way.

Submitted by: cperciva (earlier version)
Reviewed by: cperciva
Tested by: cperciva
MFC after: 1 week (using spare struct members to preserve ABI)


# f89d4c3a 06-May-2013 Andre Oppermann <andre@FreeBSD.org>

Back out r249318, r249320 and r249327 due to a heisenbug most
likely related to a race condition in the ipi_hash_lock with
the exact cause currently unknown but under investigation.


# 47e8d432 25-Apr-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Add const qualifier to the dst parameter of the ifnet if_output method.


# 18ba072a 10-Apr-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Fix build.


# e8b3186b 09-Apr-2013 Andre Oppermann <andre@FreeBSD.org>

Change certain heavily used network related mutexes and rwlocks to
reside on their own cache line to prevent false sharing with other
nearby structures, especially for those in the .bss segment.

NB: Those mutexes and rwlocks with variables next to them that get
changed on every invocation do not benefit from their own cache line.
Actually it may be net negative because two cache misses would be
incurred in those cases.


# 24421c1c 11-Feb-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Resolve source address selection in presense of CARP. Add a couple
of helper functions:

- carp_master() - boolean function which is true if an address
is in the MASTER state.
- ifa_preferred() - boolean function that compares two addresses,
and is aware of CARP.

Utilize ifa_preferred() in ifa_ifwithnet().

The previous version of patch also changed source address selection
logic in jails using carp_master(), but we failed to negotiate this part
with Bjoern. May be we will approach this problem again later.

Reported & tested by: Anton Yuzhaninov <citrin citrin.ru>
Sponsored by: Nginx, Inc


# ded5ea6a 07-Feb-2013 Randall Stewart <rrs@FreeBSD.org>

This fixes a out-of-order problem with several
of the newer drivers. The basic problem was
that the driver was pulling the mbuf off the
drbr ring and then when sending with xmit(), encounting
a full transmit ring. Thus the lower layer
xmit() function would return an error, and the
drivers would then append the data back on to the ring.
For TCP this is a horrible scenario sure to bring
on a fast-retransmit.

The fix is to use drbr_peek() to pull the data pointer
but not remove it from the ring. If it fails then
we either call the new drbr_putback or drbr_advance
method. Advance moves it forward (we do this sometimes
when the xmit() function frees the mbuf). When
we succeed we always call advance. The
putback will always copy the mbuf back to the top
of the ring. Note that the putback *cannot* be used
with a drbr_dequeue() only with drbr_peek(). We most
of the time, in putback, would not need to copy it
back since most likey the mbuf is still the same, but
sometimes xmit() functions will change the mbuf via
a pullup or other call. So the optimial case for
the single consumer is to always copy it back. If
we ever do a multiple_consumer (for lagg?) we
will need a test and atomic in the put back possibly
a seperate putback_mc() in the ring buf.

Reviewed by: jhb@freebsd.org, jlv@freebsd.org


# c9b652e3 18-Oct-2012 Andre Oppermann <andre@FreeBSD.org>

Mechanically remove the last stray remains of spl* calls from net*/*.
They have been Noop's for a long time now.


# 608ae712 17-Oct-2012 Maksim Yevmenkin <emax@FreeBSD.org>

provide helper if_initbaudrate() to set if_baudrate_pf and if_baudrate_pf.
again, use ixgbe(4) as an example of how to use new helper function.

Reviewed by: jhb
MFC after: 1 week


# 0fef97fe 16-Oct-2012 Maksim Yevmenkin <emax@FreeBSD.org>

introduce concept of ifi_baudrate power factor. the idea is to work
around the problem where high speed interfaces (such as ixgbe(4))
are not able to report real ifi_baudrate. bascially, take a spare
byte from struct if_data and use it to store ifi_baudrate power
factor. in other words,

real ifi_baudrate = ifi_baudrate * 10 ^ ifi_baudrate power factor

this should be backwards compatible with old binaries. use ixgbe(4)
as an example on how drivers would set ifi_baudrate power factor

Discussed with: kib, scottl, glebius
MFC after: 1 week


# 063efed2 28-Sep-2012 Gleb Smirnoff <glebius@FreeBSD.org>

The drbr(9) API appeared to be so unclear, that most drivers in
tree used it incorrectly, which lead to inaccurate overrated
if_obytes accounting. The drbr(9) used to update ifnet stats on
drbr_enqueue(), which is not accurate since enqueuing doesn't
imply successful processing by driver. Dequeuing neither mean
that. Most drivers also called drbr_stats_update() which did
accounting again, leading to doubled if_obytes statistics. And
in case of severe transmitting, when a packet could be several
times enqueued and dequeued it could have been accounted several
times.

o Thus, make drbr(9) API thinner. Now drbr(9) merely chooses between
ALTQ queueing or buf_ring(9) queueing.
- It doesn't touch the buf_ring stats any more.
- It doesn't touch ifnet stats anymore.
- drbr_stats_update() no longer exists.

o buf_ring(9) handles its stats itself:
- It handles br_drops itself.
- br_prod_bytes stats are dropped. Rationale: no one ever
reads them but update of a common counter on every packet
negatively affects performance due to excessive cache
invalidation.
- buf_ring_enqueue_bytes() reduced to buf_ring_enqueue(), since
we no longer account bytes.

o Drivers handle their stats theirselves: if_obytes, if_omcasts.

o mlx4(4), igb(4), em(4), vxge(4), oce(4) and ixv(4) no longer
use drbr_stats_update(), and update ifnet stats theirselves.

o bxe(4) was the most correct driver, it didn't call
drbr_stats_update(), thus it was the only driver accurate under
moderate load. Now it also maintains stats itself.

o ixgbe(4) had already taken stats from hardware, so just
- drop software stats updating.
- take multicast packet count from hardware as well.

o mxge(4) just no longer needs NO_SLOW_STATS define.

o cxgb(4), cxgbe(4) need no change, since they obtain stats
from hardware.

Reviewed by: jfv, gnn


# 73c23f3b 04-Sep-2012 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix the build broken by r240099.
Hide link_pfil_hook under _KERNEL macro.

MFC after: 3 weeks


# 7d4317bd 04-Sep-2012 Alexander V. Chernikov <melifaro@FreeBSD.org>

Introduce new link-layer PFIL hook V_link_pfil_hook.
Merge ether_ipfw_chk() and part of bridge_pfil() into
unified ipfw_check_frame() function called by PFIL.
This change was suggested by rwatson? @ DevSummit.

Remove ipfw headers from ether/bridge code since they are unneeded now.

Note this thange introduce some (temporary) performance penalty since
PFIL read lock has to be acquired for every link-level packet.

MFC after: 3 weeks


# ea537929 02-Aug-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Fix races between in_lltable_prefix_free(), lla_lookup(),
llentry_free() and arptimer():

o Use callout_init_rw() for lle timeout, this allows us safely
disestablish them.
- This allows us to simplify the arptimer() and make it
race safe.
o Consistently use ifp->if_afdata_lock to lock access to
linked lists in the lle hashes.
o Introduce new lle flag LLE_LINKED, which marks an entry that
is attached to the hash.
- Use LLE_LINKED to avoid double unlinking via consequent
calls to llentry_free().
- Mark lle with LLE_DELETED via |= operation istead of =,
so that other flags won't be lost.
o Make LLE_ADDREF(), LLE_REMREF() and LLE_FREE_LOCKED() more
consistent and provide more informative KASSERTs.

The patch is a collaborative work of all submitters and myself.

PR: kern/165863
Submitted by: Andrey Zonov <andrey zonov.org>
Submitted by: Ryan Stone <rysto32 gmail.com>
Submitted by: Eric van Gyzen <eric_van_gyzen dell.com>


# 09fe6320 19-Jun-2012 Navdeep Parhar <np@FreeBSD.org>

- Updated TOE support in the kernel.

- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs.
These are available as t3_tom and t4_tom modules that augment cxgb(4)
and cxgbe(4) respectively. The cxgb/cxgbe drivers continue to work as
usual with or without these extra features.

- iWARP driver for Terminator 3 ASIC (kernel verbs). T4 iWARP in the
works and will follow soon.

Build-tested with make universe.

30s overview
============
What interfaces support TCP offload? Look for TOE4 and/or TOE6 in the
capabilities of an interface:
# ifconfig -m | grep TOE

Enable/disable TCP offload on an interface (just like any other ifnet
capability):
# ifconfig cxgbe0 toe
# ifconfig cxgbe0 -toe

Which connections are offloaded? Look for toe4 and/or toe6 in the
output of netstat and sockstat:
# netstat -np tcp | grep toe
# sockstat -46c | grep toe

Reviewed by: bz, gnn
Sponsored by: Chelsio communications.
MFC after: ~3 months (after 9.1, and after ensuring MFC is feasible)


# 02ed02af 19-Mar-2012 John Baldwin <jhb@FreeBSD.org>

Retire the IF_ADDR_LOCK() and IF_ADDR_UNLOCK() compat macros from HEAD.
The new [RW]LOCK macros are merged back to 8.x so should be suitable for
new code in HEAD even if it is to be MFC'd.


# 4ecf274b 08-Feb-2012 Sergey Kandaurov <pluknet@FreeBSD.org>

g/c last bit of old ipv6 prefix management.

Reviewed by: bz
Obtained from: NetBSD, net/if.h, rev 1.80


# fbcebf7f 09-Jan-2012 John Baldwin <jhb@FreeBSD.org>

Convert the per-interface address list lock from a mutex to a reader/writer
lock.

Reviewed by: bz


# a2cb1d52 05-Jan-2012 John Baldwin <jhb@FreeBSD.org>

Add new variants of the IF_ADDR_*LOCK*() macros used for protecting
interface address lists that distinguish read locks from write locks.
To preserve the KPI, the previous operations are mapped to the write
lock macros. The lock is still kept as a mutex for now.

Reviewed by: bz
MFC after: 2 weeks


# 08b68b0e 15-Dec-2011 Gleb Smirnoff <glebius@FreeBSD.org>

A major overhaul of the CARP implementation. The ip_carp.c was started
from scratch, copying needed functionality from the old implemenation
on demand, with a thorough review of all code. The main change is that
interface layer has been removed from the CARP. Now redundant addresses
are configured exactly on the interfaces, they run on.

The CARP configuration itself is, as before, configured and read via
SIOCSVH/SIOCGVH ioctls. A new prefix created with SIOCAIFADDR or
SIOCAIFADDR_IN6 may now be configured to a particular virtual host id,
which makes the prefix redundant.

ifconfig(8) semantics has been changed too: now one doesn't need
to clone carpXX interface, he/she should directly configure a vhid
on a Ethernet interface.

To supply vhid data from the kernel to an application the getifaddrs(8)
function had been changed to pass ifam_data with each address. [1]

The new implementation definitely closes all PRs related to carp(4)
being an interface, and may close several others. It also allows
to run a single redundant IP per interface.

Big thanks to Bjoern Zeeb for his help with inet6 part of patch, for
idea on using ifam_data and for several rounds of reviewing!

PR: kern/117000, kern/126945, kern/126714, kern/120130, kern/117448
Reviewed by: bz
Submitted by: bz [1]


# f26fa169 09-Dec-2011 Brooks Davis <brooks@FreeBSD.org>

Remove the unused if_free_type() function.

X-MFC after: never


# a0af7c3e 27-Oct-2011 Gleb Smirnoff <glebius@FreeBSD.org>

Add macro IF_DEQUEUE_ALL(ifq, m), that takes the entire mbuf chain off
the queue. It can be utilized in queue processing to avoid multiple
locking/unlocking.


# d9a36286 17-Jul-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Add spares to the network stack for FreeBSD-9:
- TCP keep* timers
- TCP UTO (adjust from what was there already)
- netmap
- route caching
- user cookie (temporary to allow for the real fix)

Slightly re-shuffle struct ifnet moving fields out of the middle
of spares and to better align.

Discussed with: rwatson (slightly earlier version)


# 43deddcd 03-Jul-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Remove extra white space to comply with style for the rest of the struct.

MFC after: 2 weeks


# 35fd7bc0 02-Jul-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Add infrastructure to allow all frames/packets received on an interface
to be assigned to a non-default FIB instance.

You may need to recompile world or ports due to the change of struct ifnet.

Submitted by: cjsp
Submitted by: Alexander V. Chernikov (melifaro ipfw.ru)
(original versions)
Reviewed by: julian
Reviewed by: Alexander V. Chernikov (melifaro ipfw.ru)
MFC after: 2 weeks
X-MFC: use spare in struct ifnet


# e4cd31dd 21-Mar-2011 Jeff Roberson <jeff@FreeBSD.org>

- Merge changes to the base system to support OFED. These include
a wider arg2 for sysctl, updates to vlan code, IFT_INFINIBAND,
and other miscellaneous small features.


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# dd62f5c0 25-Jun-2010 Qing Li <qingli@FreeBSD.org>

MFC r208553

This patch fixes the problem where proxy ARP entries cannot be added
over the if_ng interface.

Approved by: re (bz)


# 0ed6142b 25-May-2010 Qing Li <qingli@FreeBSD.org>

This patch fixes the problem where proxy ARP entries cannot be added
over the if_ng interface.

MFC after: 3 days


# 7c61d493 24-May-2010 Andrew Thompson <thompsa@FreeBSD.org>

MFC r202588

Declare a new EVENTHANDLER called iflladdr_event which signals that the L2
address on an interface has changed. This lets stacked interfaces such as
vlan(4) detect that their lower interface has changed and adjust things in
order to keep working. Previously this situation broke at least vlan(4) and
lagg(4) configurations.

The EVENTHANDLER_INVOKE call was not placed within if_setlladdr() due to the
risk of a loop.

PR: kern/142927
Submitted by: Nikolay Denev

MFC r202611

Do not hold the lock over if_setlladdr() as it calls into the interface driver
init routine.


# 29f2c008 18-Mar-2010 Max Laier <mlaier@FreeBSD.org>

MFC r203834 and r205197: Make ALTQ work for drbr consumers.


# 4c71aa58 15-Mar-2010 Max Laier <mlaier@FreeBSD.org>

Fix a small bug in drbr_dequeue_cond spotted while preparing MFC of r203834.

MFC after: 3 days


# a5a931b3 25-Feb-2010 Xin LI <delphij@FreeBSD.org>

MFC 203052:

Add interface description capability as inspired by OpenBSD. Thanks for
rwatson@, jhb@, brooks@ and others for feedback to the old implementation!

Sponsored by: iXsystems, Inc.


# 193cbc4d 13-Feb-2010 Max Laier <mlaier@FreeBSD.org>

Fix drbr and altq interaction:
- introduce drbr_needs_enqueue that returns whether the interface/br needs
an enqueue operation: returns true if altq is enabled or there are
already packets in the ring (as we need to maintain packet order)
- update all drbr consumers
- fix drbr_flush
- avoid using the driver queue (IFQ_DRV_*) in the altq case as the
multiqueue consumer does not provide enough protection, serialize altq
interaction with the main queue lock
- make drbr_dequeue_cond work with altq

Discussed with: kmacy, yongari, jfv
MFC after: 4 weeks


# 3ddba633 31-Jan-2010 Shteryana Shopova <syrinx@FreeBSD.org>

MFC r202935:
While flushing the multicast filter of an interface, do not zero the relevant
ifmultiaddr structures' reference to the parent interface, unless the parent
interface is really detaching. While here, program only link layer multicast
filters to a wlan's hardware parent interface.

PR: kern/142391, kern/142392
Reviewed by: sam, rpaulo, bms


# 215940b3 26-Jan-2010 Xin LI <delphij@FreeBSD.org>

Revised revision 199201 (add interface description capability as inspired
by OpenBSD), based on comments from many, including rwatson, jhb, brooks
and others.

Sponsored by: iXsystems, Inc.
MFC after: 1 month


# 93ec7edc 24-Jan-2010 Shteryana Shopova <syrinx@FreeBSD.org>

While flushing the multicast filter of an interface, do not zero the relevant
ifmultiaddr structures' reference to the parent interface, unless the parent
interface is really detaching. While here, program only link layer multicast
filters to a wlan's hardware parent interface.

PR: kern/142391, kern/142392
Reviewed by: sam, rpaolo, bms
MFC after: 1 week


# ea4ca115 18-Jan-2010 Andrew Thompson <thompsa@FreeBSD.org>

Declare a new EVENTHANDLER called iflladdr_event which signals that the L2
address on an interface has changed. This lets stacked interfaces such as
vlan(4) detect that their lower interface has changed and adjust things in
order to keep working. Previously this situation broke at least vlan(4) and
lagg(4) configurations.

The EVENTHANDLER_INVOKE call was not placed within if_setlladdr() due to the
risk of a loop.

PR: kern/142927
Submitted by: Nikolay Denev


# 130fd3bc 05-Jan-2010 Qing Li <qingli@FreeBSD.org>

MFC r201319

Remove a deleted comment line that was brought back by
my previous commit.


# 32c53401 05-Jan-2010 Qing Li <qingli@FreeBSD.org>

MFC r201282, r201543

r201282
-------
The proxy arp entries could not be added into the system over the
IFF_POINTOPOINT link types. The reason was due to the routing
entry returned from the kernel covering the remote end is of an
interface type that does not support ARP. This patch fixes this
problem by providing a hint to the kernel routing code, which
indicates the prefix route instead of the PPP host route should
be returned to the caller. Since a host route to the local end
point is also added into the routing table, and there could be
multiple such instantiations due to multiple PPP links can be
created with the same local end IP address, this patch also fixes
the loopback route installation failure problem observed prior to
this patch. The reference count of loopback route to local end would
be either incremented or decremented. The first instantiation would
create the entry and the last removal would delete the route entry.

r201543
-------
The IFA_RTSELF address flag marks a loopback route has been installed
for the interface address. This marker is necessary to properly support
PPP types of links where multiple links can have the same local end
IP address. The IFA_RTSELF flag bit maps to the RTF_HOST value, which
was combined into the route flag bits during prefix installation in
IPv6. This inclusion causing the prefix route to be unusable. This
patch fixes this bug by excluding the IFA_RTSELF flag during route
installation.

PR: ports/141342, kern/141134


# 9f140905 30-Dec-2009 Qing Li <qingli@FreeBSD.org>

Remove a deleted comment line that was brought back by
my previous commit.

MFC after: 5 days


# c7ab6602 30-Dec-2009 Qing Li <qingli@FreeBSD.org>

The proxy arp entries could not be added into the system over the
IFF_POINTOPOINT link types. The reason was due to the routing
entry returned from the kernel covering the remote end is of an
interface type that does not support ARP. This patch fixes this
problem by providing a hint to the kernel routing code, which
indicates the prefix route instead of the PPP host route should
be returned to the caller. Since a host route to the local end
point is also added into the routing table, and there could be
multiple such instantiations due to multiple PPP links can be
created with the same local end IP address, this patch also fixes
the loopback route installation failure problem observed prior to
this patch. The reference count of loopback route to local end would
be either incremented or decremented. The first instantiation would
create the entry and the last removal would delete the route entry.

MFC after: 5 days


# 8e968376 21-Dec-2009 John Baldwin <jhb@FreeBSD.org>

Remove commented out prototype for ifinit(). This prototype has been
commented out since 1.1 and has not been present in <sys/systm.h> since at
least 1.1 of that file. It is also not needed in FreeBSD due to SYSINIT().


# 34605f85 30-Nov-2009 John Baldwin <jhb@FreeBSD.org>

Remove if_timer/if_watchdog now that they are no longer used. The space
used by if_timer is reserved for expanding if_index to an int in the
future.

Reviewed by: rwatson, brooks


# 1a9d4dda 12-Nov-2009 Xin LI <delphij@FreeBSD.org>

Revert revision 199201 for now as it has introduced a kernel vulnerability
and requires more polishing.


# 41c8c6e8 11-Nov-2009 Xin LI <delphij@FreeBSD.org>

Add interface description capability as inspired by OpenBSD.

MFC after: 3 months


# 553a7dec 15-Sep-2009 Qing Li <qingli@FreeBSD.org>

MFC r197227

Self pointing routes are installed for configured interface addresses
and address aliases. After an interface is brought down and brought
back up again, those self pointing routes disappeared. This patch
ensures after an interface is brought back up, the loopback routes
are reinstalled properly.

Reviewed by: bz
Approved by: re


# 9bb7d0f4 15-Sep-2009 Qing Li <qingli@FreeBSD.org>

Self pointing routes are installed for configured interface addresses
and address aliases. After an interface is brought down and brought
back up again, those self pointing routes disappeared. This patch
ensures after an interface is brought back up, the loopback routes
are reinstalled properly.

Reviewed by: bz
MFC after: immediately


# b569420a 28-Aug-2009 Robert Watson <rwatson@FreeBSD.org>

Merge r196510 from head to stable/8:

Make if_grow static -- it's not used outside of if.c, and with the
internals destined to change, it's better if it remains that way.

Approved by: re (kib)


# 3ef94f2b 28-Aug-2009 Robert Watson <rwatson@FreeBSD.org>

Merge r196481 from head to stable/8:

Rework global locks for interface list and index management, correcting
several critical bugs, including race conditions and lock order issues:

Replace the single rwlock, ifnet_lock, with two locks, an rwlock and an
sxlock. Either can be held to stablize the lists and indexes, but both
are required to write. This allows the list to be held stable in both
network interrupt contexts and sleepable user threads across sleeping
memory allocations or device driver interactions. As before, writes to
the interface list must occur from sleepable contexts.

Reviewed by: bz, julian

Approved by: re (kib)


# 8e937462 23-Aug-2009 Robert Watson <rwatson@FreeBSD.org>

Make if_grow static -- it's not used outside of if.c, and with the
internals destined to change, it's better if it remains that way.

MFC after: 3 days


# 77dfcdc4 23-Aug-2009 Robert Watson <rwatson@FreeBSD.org>

Rework global locks for interface list and index management, correcting
several critical bugs, including race conditions and lock order issues:

Replace the single rwlock, ifnet_lock, with two locks, an rwlock and an
sxlock. Either can be held to stablize the lists and indexes, but both
are required to write. This allows the list to be held stable in both
network interrupt contexts and sleepable user threads across sleeping
memory allocations or device driver interactions. As before, writes to
the interface list must occur from sleepable contexts.

Reviewed by: bz, julian
MFC after: 3 days


# aeba9e80 20-Aug-2009 Robert Watson <rwatson@FreeBSD.org>

Merge r196263 from head to stable/8:

Remove unused if_rawoutput() macro; it has been unused since at least
FreeBSD 2.

Approved by: re (kib)


# d931ea09 15-Aug-2009 Robert Watson <rwatson@FreeBSD.org>

Remove unused if_rawoutput() macro; it has been unused since at least
FreeBSD 2.

Approved by: re (kib)


# df813b7e 27-Jul-2009 Qing Li <qingli@FreeBSD.org>

This patch does the following:

- Allow loopback route to be installed for address assigned to
interface of IFF_POINTOPOINT type.
- Install loopback route for an IPv4 interface addreess when the
"useloopback" sysctl variable is enabled. Similarly, install
loopback route for an IPv6 interface address when the sysctl variable
"nd6_useloopback" is enabled. Deleting loopback routes for interface
addresses is unconditional in case these sysctl variables were
disabled after an interface address has been assigned.

Reviewed by: bz
Approved by: re


# 1e77c105 16-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Remove unused VNET_SET() and related macros; only VNET_GET() is
ever actually used. Rename VNET_GET() to VNET() to shorten
variable references.

Discussed with: bz, julian
Reviewed by: bz
Approved by: re (kensmith, kib)


# eddfbb76 14-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)


# 6cb7f168 29-Jun-2009 Brooks Davis <brooks@FreeBSD.org>

Remove support for the /dev/net/* per-interface devices. They serve
little purpose and are unused in the base system.

The IOCTL functionality is entirely duplicated and routing sockets
provide a richer interface than the kqueue functionality.

Further, it is not practical for these devices to be made sensible in
the face of VIMAGE.

Bump __FreeBSD_version on the off chance that there is any code out
there that actually uses this stuff.

Reviewed by: rwatson
Discussed with: bz, zec
Approved by: re@ (kensmith)


# f9ef96ca 25-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Define four wrapper functions for interface address locking,
if_addr_rlock() and if_addr_runlock() for regular address lists, and
if_maddr_rlock() and if_maddr_runlock() for multicast address lists.

We will use these in various kernel modules to avoid encoding specific
type and locking strategy information into modules that currently use
IF_ADDR_LOCK() and IF_ADDR_UNLOCK() directly.

MFC after: 6 weeks


# 8896f83a 22-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Add a new function, ifa_ifwithaddr_check(), which rather than returning
a pointer to an ifaddr matching the passed socket address, returns a
boolean indicating whether one was present. In the (near) future,
ifa_ifwithaddr() will return a referenced ifaddr rather than a raw
ifaddr pointer, and the new wrapper will allow callers that care only
about the boolean condition to avoid having to free that reference.

MFC after: 3 weeks


# 1099f828 21-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Clean up common ifaddr management:

- Unify reference count and lock initialization in a single function,
ifa_init().
- Move tear-down from a macro (IFAFREE) to a function ifa_free().
- Move reference count bump from a macro (IFAREF) to a function ifa_ref().
- Instead of using a u_int protected by a mutex to refcount(9) for
reference count management.

The ifa_mtx is now used for exactly one ioctl, and possibly should be
removed.

MFC after: 3 weeks


# d49cd9a1 19-Jun-2009 Kip Macy <kmacy@FreeBSD.org>

add helper function for flushing software queues


# d659538f 15-Jun-2009 Sam Leffler <sam@FreeBSD.org>

r193336 moved ifq_detach to if_free which broke if_alloc followed
by if_free (w/o doing if_attach); move ifq_attach to if_alloc and
rename ifq_attach/detach to ifq_init/ifq_delete to better identify
their purpose

Reviewed by: jhb, kmacy


# a913be09 09-Jun-2009 Kip Macy <kmacy@FreeBSD.org>

- add drbr routines for accessing #qentries and conditionally dequeueing
- track bytes enqueued in buf_ring


# bc29160d 08-Jun-2009 Marko Zec <zec@FreeBSD.org>

Introduce an infrastructure for dismantling vnet instances.

Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state. The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.

While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers. Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.

Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels. Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.

Bump __FreeBSD_version to 800097.
Reviewed by: bz, julian
Approved by: rwatson, kib (re), julian (mentor)


# 1abcdbd1 30-May-2009 Attilio Rao <attilio@FreeBSD.org>

When user_frac in the polling subsystem is low it is going to busy the
CPU for too long period than necessary. Additively, interfaces are kept
polled (in the tick) even if no more packets are available.
In order to avoid such situations a new generic mechanism can be
implemented in proactive way, keeping track of the time spent on any
packet and fragmenting the time for any tick, stopping the processing
as soon as possible.

In order to implement such mechanism, the polling handler needs to
change, returning the number of packets processed.
While the intended logic is not part of this patch, the polling KPI is
broken by this commit, adding an int return value and the new flag
IFCAP_POLLING_NOCOUNT (which will signal that the return value is
meaningless for the installed handler and checking should be skipped).

Bump __FreeBSD_version in order to signal such situation.

Reviewed by: emaste
Sponsored by: Sandvine Incorporated


# e0c14af9 22-May-2009 Marko Zec <zec@FreeBSD.org>

Introduce the if_vmove() function, which will be used in the future
for reassigning ifnets from one vnet to another.

if_vmove() works by calling a restricted subset of actions normally
executed by if_detach() on an ifnet in the current vnet, and then
switches to the target vnet and executes an appropriate subset of
if_attach() actions there.

if_attach() and if_detach() have become wrapper functions around
if_attach_internal() and if_detach_internal(), where the later
variants have an additional argument, a flag indicating whether a
full attach or detach sequence is to be executed, or only a
restricted subset suitable for moving an ifnet from one vnet to
another. Hence, if_vmove() will not call if_detach() and if_attach()
directly, but will call the if_detach_internal() and
if_attach_internal() variants instead, with the vmove flag set.

While here, staticize ifnet_setbyindex() since it is not referenced
from outside of sys/net/if.c.

Also rename ifccnt field in struct vimage to ifcnt, and do some minor
whitespace garbage collection where appropriate.

This change should have no functional impact on nooptions VIMAGE kernel
builds.

Reviewed by: bz, rwatson, brooks?
Approved by: julian (mentor)


# 21ca7b57 05-May-2009 Marko Zec <zec@FreeBSD.org>

Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one. The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE(). Recursions
on curvnet are permitted, though strongly discuouraged.

This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.

The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.

The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.

This change also introduces a DDB subcommand to show the list of all
vnet instances.

Approved by: julian (mentor)


# f6dfe47a 30-Apr-2009 Marko Zec <zec@FreeBSD.org>

Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance. Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:

1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables. As an example, V_ifnet becomes:

options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet
default build: vnet_net_0._ifnet
options VIMAGE_GLOBALS: ifnet

2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:

INIT_VNET_NET(ifp->if_vnet); becomes

struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];

3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals. If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.

4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet. options VIMAGE builds
will fill in those fields as required.

5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.

6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod. SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.

Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.

Reviewed by: bz, rwatson
Approved by: julian (mentor)


# 6064c5d3 23-Apr-2009 Robert Watson <rwatson@FreeBSD.org>

Add ifunit_ref(), a version of ifunit(), that returns not just an
interface pointer, but also a reference to it.

Modify ifioctl() to use ifunit_ref(), holding the reference until
all ioctls, etc, have completed.

This closes a class of reader-writer races in which interfaces
could be removed during long-running ioctls, leading to crashes.
Many other consumers of ifunit() should now use ifunit_ref() to
avoid similar races.

MFC after: 3 weeks


# 111c6b61 23-Apr-2009 Robert Watson <rwatson@FreeBSD.org>

During if_detach(), invoke if_dead() to set the ifnet's function
pointers to "dead" implementations that no-op rather than invoking
the device driver. This would generally be unexpected and
possibly quite badly handled by most device drivers after
if_detach() has completed.

Reviewed by: bms
MFC after: 3 weeks


# 27d37320 21-Apr-2009 Robert Watson <rwatson@FreeBSD.org>

Start to address a number of races relating to use of ifnet pointers
after the corresponding interface has been destroyed:

(1) Add an ifnet refcount, ifp->if_refcount. Initialize it to 1 in
if_alloc(), and modify if_free_type() to decrement and check the
refcount.

(2) Add new if_ref() and if_rele() interfaces to allow kernel code
walking global interface lists to release IFNET_[RW]LOCK() yet
keep the ifnet stable. Currently, if_rele() is a no-op wrapper
around if_free(), but this may change in the future.

(3) Add new ifnet field, if_alloctype, which caches the type passed
to if_alloc(), but unlike if_type, won't be changed by drivers.
This allows asynchronous free's of the interface after the
driver has released it to still use the right type. Use that
instead of the type passed to if_free_type(), but assert that
they are the same (might have to rethink this if that doesn't
work out).

(4) Add a new ifnet_byindex_ref(), which looks up an interface by
index and returns a reference rather than a pointer to it.

(5) Fix if_alloc() to fully initialize the if_addr_mtx before hooking
up the ifnet to global lists.

(6) Modify sysctls in if_mib.c to use ifnet_byindex_ref() and release
the ifnet when done.

When this change is MFC'd, it will need to replace if_ispare fields
rather than adding new fields in order to avoid breaking the binary
interface. Once this change is MFC'd, if_free_type() should be
removed, as its 'type' argument is now optional.

This refcount is not appropriate for counting mbuf pkthdr references,
and also not for counting entry into the device driver via ifnet
function pointers. An rmlock may be appropriate for the latter.
Rather, this is about ensuring data structure stability when reaching
an ifnet via global ifnet lists and tables followed by copy in or out
of userspace.

MFC after: 3 weeks
Reported by: mdtancsa
Reviewed by: brooks


# 7cc5b47f 16-Apr-2009 Kip Macy <kmacy@FreeBSD.org>

export if_qflush for use by driver if_qflush routines
only set ifp->if_{transmit, qflush} if not already set
KASSERT that neither or both are set


# 279aa3d4 16-Apr-2009 Kip Macy <kmacy@FreeBSD.org>

Change if_output to take a struct route as its fourth argument in order
to allow passing a cached struct llentry * down to L2

Reviewed by: rwatson


# 3efea337 13-Apr-2009 Kip Macy <kmacy@FreeBSD.org>

Adapt buf_ring abstraction interface to allow consumers to interoperate with ALTQ


# e5adda3d 15-Mar-2009 Robert Watson <rwatson@FreeBSD.org>

Remove IFF_NEEDSGIANT, a compatibility infrastructure introduced
in FreeBSD 5.x to allow network device drivers to run with Giant
despite the network stack being Giant-free. This significantly
simplifies calls into ioctl() on network interfaces, especially
in the multicast code, as well as eliminates deferred invocation
of interface if_start routines.

Disable the build on device drivers still depending on
IFF_NEEDSGIANT as they no longer compile. They will be removed
in a few weeks if they haven't been made MPSAFE in that time.
Disabled drivers:

if_ar
if_axe
if_aue
if_cdce
if_cue
if_kue
if_ray
if_rue
if_rum
if_sr
if_udav
if_ural
if_zyd

Drivers that were already disabled because of tty changes:

if_ppp
if_sl

Discussed on: arch@


# 3055e123 28-Feb-2009 Robert Watson <rwatson@FreeBSD.org>

Do a bit of struct ifnet cleanup in preparation for 8.0: group function
pointers together, move padding to the bottom of the structure, and add
two new integer spares due to attrition over time. Remove unused spare
"flags" field, we can use one of the spare ints if we need it later.

This change requires a rebuild of device driver modules that depend on
the layout of ifnet for binary compatibility reasons.

Discussed with: kmacy


# 64c44e5d 17-Dec-2008 Kip Macy <kmacy@FreeBSD.org>

Keep stats in drbr_enqueue

Discussed with: ps


# 1635d917 16-Dec-2008 Kip Macy <kmacy@FreeBSD.org>

merge in 2 buf_ring helper routines for enqueueing and freeing buf_rings


# 991f8615 16-Dec-2008 Kip Macy <kmacy@FreeBSD.org>

convert ifnet and afdata locks from mutexes to rwlocks


# 6e6b3f7c 14-Dec-2008 Qing Li <qingli@FreeBSD.org>

This main goals of this project are:
1. separating L2 tables (ARP, NDP) from the L3 routing tables
2. removing as much locking dependencies among these layers as
possible to allow for some parallelism in the search operations
3. simplify the logic in the routing code,

The most notable end result is the obsolescent of the route
cloning (RTF_CLONING) concept, which translated into code reduction
in both IPv4 ARP and IPv6 NDP related modules, and size reduction in
struct rtentry{}. The change in design obsoletes the semantics of
RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland
applications such as "arp" and "ndp" have been modified to reflect
those changes. The output from "netstat -r" shows only the routing
entries.

Quite a few developers have contributed to this project in the
past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and
Andre Oppermann. And most recently:

- Kip Macy revised the locking code completely, thus completing
the last piece of the puzzle, Kip has also been conducting
active functional testing
- Sam Leffler has helped me improving/refactoring the code, and
provided valuable reviews
- Julian Elischer setup the perforce tree for me and has helped
me maintaining that branch before the svn conversion


# 1b193af6 13-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Second round of putting global variables, which were virtualized
but formerly missed under VIMAGE_GLOBAL.

Put the extern declarations of the virtualized globals
under VIMAGE_GLOBAL as the globals themsevles are already.
This will help by the time when we are going to remove the globals
entirely.

Sponsored by: The FreeBSD Foundation


# 4b79449e 02-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation


# db7f0b97 21-Nov-2008 Kip Macy <kmacy@FreeBSD.org>

- bump __FreeBSD version to reflect added buf_ring, memory barriers,
and ifnet functions

- add memory barriers to <machine/atomic.h>
- update drivers to only conditionally define their own

- add lockless producer / consumer ring buffer
- remove ring buffer implementation from cxgb and update its callers

- add if_transmit(struct ifnet *ifp, struct mbuf *m) to ifnet to
allow drivers to efficiently manage multiple hardware queues
(i.e. not serialize all packets through one ifq)
- expose if_qflush to allow drivers to flush any driver managed queues

This work was supported by Bitgravity Inc. and Chelsio Inc.


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 8b615593 02-Oct-2008 Marko Zec <zec@FreeBSD.org>

Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 516993d4 19-Aug-2008 Andrew Thompson <thompsa@FreeBSD.org>

ifnet_setbyindex() is only used locally, go back to being static.


# 1887d35f 19-Aug-2008 Kip Macy <kmacy@FreeBSD.org>

Fix build


# 02f4879d 26-Jun-2008 Robert Watson <rwatson@FreeBSD.org>

Introduce locking around use of ifindex_table, whose use was previously
unsynchronized. While races were extremely rare, we've now had a
couple of reports of panics in environments involving large numbers of
IPSEC tunnels being added very quickly on an active system.

- Add accessor functions ifnet_byindex(), ifaddr_byindex(),
ifdev_byindex() to replace existing accessor macros. These functions
now acquire the ifnet lock before derefencing the table.
- Add IFNET_WLOCK_ASSERT().
- Add static accessor functions ifnet_setbyindex(), ifdev_setbyindex(),
which set values in the table either asserting of acquiring the ifnet
lock.
- Use accessor functions throughout if.c to modify and read
ifindex_table.
- Rework ifnet attach/detach to lock around ifindex_table modification.

Note that these changes simply close races around use of ifindex_table,
and make no attempt to solve the probem of disappearing ifnets. Further
refinement of this work, including with respect to ifindex_table
resizing, is still required.

In a future change, the ifnet lock should be converted from a mutex to an
rwlock in order to reduce contention.

Reviewed and tested by: brooks


# 8b07e49a 09-May-2008 Julian Elischer <julian@FreeBSD.org>

Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.

Constraints:
------------

I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.

One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".

One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.

This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.

To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.

The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.

The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.

In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.

One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).

You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.

This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.

Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.

Packets fall into one of a number of classes.

1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..

setfib -3 ping target.example.com # will use fib 3 for ping.

It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.

2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)

3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).

4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.

5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.

6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.

Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)

In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.

In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.

Early testing experience:
-------------------------

Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.

For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.

Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.

ipfw has grown 2 new keywords:

setfib N ip from anay to any
count ip from any to any fib N

In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.

SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.

Where to next:
--------------------

After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.

Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.

My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.

When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.

Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.

This work was sponsored by Ironport Systems/Cisco

Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco


# fb27dd1d 25-Mar-2008 Sam Leffler <sam@FreeBSD.org>

expose if_purgemaddrs, it will be used by the vap code unless someone
redesigns the mcast support code in the next few weeks

MFC after: 3 weeks


# 2de2af32 06-Dec-2007 Kip Macy <kmacy@FreeBSD.org>

Add padding for anticipated functionality
- vimage
- TOE
- multiq
- host rtentry caching

Rename spare used by 80211 to if_llsoftc

Reviewed by: rwatson, gnn
MFC after: 1 day


# bec59525 16-May-2007 Brooks Davis <brooks@FreeBSD.org>

The struct if_data members ifi_recvquota and ifi_xmitquota have been
unused for ages. Rename them to ifi_spare_char1 and ifi_spare_char2
respectively to indicate this face.


# 18242d3b 16-Apr-2007 Andrew Thompson <thompsa@FreeBSD.org>

Rename the trunk(4) driver to lagg(4) as it is too similar to vlan trunking.

The name trunk is misused as the networking term trunk means carrying multiple
VLANs over a single connection. The IEEE standard for link aggregation (802.3
section 3) does not talk about 'trunk' at all while it is used throughout IEEE
802.1Q in describing vlans.

The lagg(4) driver provides link aggregation, failover and fault tolerance.

Discussed on: current@


# b47888ce 09-Apr-2007 Andrew Thompson <thompsa@FreeBSD.org>

Add the trunk(4) driver for providing link aggregation, failover and fault
tolerance. This driver allows aggregation of multiple network interfaces as
one virtual interface using a number of different protocols/algorithms.

failover - Sends traffic through the secondary port if the master becomes
inactive.
fec - Supports Cisco Fast EtherChannel.
lacp - Supports the IEEE 802.3ad Link Aggregation Control Protocol
(LACP) and the Marker Protocol.
loadbalance - Static loadbalancing using an outgoing hash.
roundrobin - Distributes outgoing traffic using a round-robin scheduler
through all active ports.

This code was obtained from OpenBSD and this also includes 802.3ad LACP support
from agr(4) in NetBSD.


# 5896d124 19-Mar-2007 Bruce M Simpson <bms@FreeBSD.org>

Fix tinderbox; ng_ether needs to see if_findmulti().


# ec002fee 19-Mar-2007 Bruce M Simpson <bms@FreeBSD.org>

Implement reference counting for ifmultiaddr, in_multi, and in6_multi
structures. Detect when ifnet instances are detached from the network
stack and perform appropriate cleanup to prevent memory leaks.

This has been implemented in such a way as to be backwards ABI compatible.
Kernel consumers are changed to use if_delmulti_ifma(); in_delmulti()
is unable to detect interface removal by design, as it performs searches
on structures which are removed with the interface.

With this architectural change, the panics FreeBSD users have experienced
with carp and pfsync should be resolved.

Obtained from: p4 branch bms_netdev
Reviewed by: andre
Sponsored by: Garance A Drosehn
Idea from: NetBSD
MFC after: 1 month


# 60d4ab7a 06-Sep-2006 Andre Oppermann <andre@FreeBSD.org>

Improve description of if_capabilities, if_capenable and ifi_hwassist.

Sponsored by: TCP/IP Optimization Fundraise 2005


# 773725a2 06-Sep-2006 Andre Oppermann <andre@FreeBSD.org>

Fix the socket option IP_ONESBCAST by giving it its own case in ip_output()
and skip over the normal IP processing.

Add a supporting function ifa_ifwithbroadaddr() to verify and validate the
supplied subnet broadcast address.

PR: kern/99558
Tested by: Andrey V. Elsukov <bu7cher-at-yandex.ru>
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 days


# 43bc7a9c 04-Aug-2006 Brooks Davis <brooks@FreeBSD.org>

With exception of the if_name() macro, all definitions in net_osdep.h
were unused or already in if_var.h so add if_name() to if_var.h and
remove net_osdep.h along with all references to it.

Longer term we may want to kill off if_name() entierly since all modern
BSDs have if_xname variables rendering it unnecessicary.


# 0dad3f0e 19-Jun-2006 Max Laier <mlaier@FreeBSD.org>

Import interface groups from OpenBSD. This allows to group interfaces in
order to - for example - apply firewall rules to a whole group of
interfaces. This is required for importing pf from OpenBSD 3.9

Obtained from: OpenBSD (with changes)
Discussed on: -net (back in April)


# 75ee267c 30-Jan-2006 Gleb Smirnoff <glebius@FreeBSD.org>

Merge the //depot/user/yar/vlan branch into CVS. It contains some collective
work by yar, thompsa and myself. The checksum offloading part also involves
work done by Mihail Balikov.

The most important changes:

o Instead of global linked list of all vlan softc use a per-trunk
hash. The size of hash is dynamically adjusted, depending on
number of entries. This changes struct ifnet, replacing counter
of vlans with a pointer to trunk structure. This change is an
improvement for setups with big number of VLANs, several interfaces
and several CPUs. It is a small regression for a setup with a single
VLAN interface.
An alternative to dynamic hash is a per-trunk static array with
4096 entries, which is a compile time option - VLAN_ARRAY. In my
experiments the array is not an improvement, probably because such
a big trunk structure doesn't fit into CPU cache.
o Introduce an UMA zone for VLAN tags. Since drivers depend on it,
the zone is declared in kern_mbuf.c, not in optional vlan(4) driver.
This change is a big improvement for any setup utilizing vlan(4).
o Use rwlock(9) instead of mutex(9) for locking. We are the first
ones to do this! :)
o Some drivers can do hardware VLAN tagging + hardware checksum
offloading. Add an infrastructure for this. Whenever vlan(4) is
attached to a parent or parent configuration is changed, the flags
on vlan(4) interface are updated.

In collaboration with: yar, thompsa
In collaboration with: Mihail Balikov <mihail.balikov interbgc.com>


# 4a0d6638 11-Nov-2005 Ruslan Ermilov <ru@FreeBSD.org>

- Store pointer to the link-level address right in "struct ifnet"
rather than in ifindex_table[]; all (except one) accesses are
through ifp anyway. IF_LLADDR() works faster, and all (except
one) ifaddr_byindex() users were converted to use ifp->if_addr.

- Stop storing a (pointer to) Ethernet address in "struct arpcom",
and drop the IFP2ENADDR() macro; all users have been converted
to use IF_LLADDR() instead.


# 4e7e0183 08-Nov-2005 Andrew Thompson <thompsa@FreeBSD.org>

Move the cloned interface list management in to if_clone. For some drivers the
softc lists and associated mutex are now unused so these have been removed.

Calling if_clone_detach() will now destroy all the cloned interfaces for the
driver and in most cases is all thats needed to unload.

Idea by: brooks
Reviewed by: brooks


# 40929967 01-Oct-2005 Gleb Smirnoff <glebius@FreeBSD.org>

Big polling(4) cleanup.

o Axe poll in trap.

o Axe IFF_POLLING flag from if_flags.

o Rework revision 1.21 (Giant removal), in such a way that
poll_mtx is not dropped during call to polling handler.
This fixes problem with idle polling.

o Make registration and deregistration from polling in a
functional way, insted of next tick/interrupt.

o Obsolete kern.polling.enable. Polling is turned on/off
with ifconfig.

Detailed kern_poll.c changes:
- Remove polling handler flags, introduced in 1.21. The are not
needed now.
- Forget and do not check if_flags, if_capenable and if_drv_flags.
- Call all registered polling handlers unconditionally.
- Do not drop poll_mtx, when entering polling handlers.
- In ether_poll() NET_LOCK_GIANT prior to locking poll_mtx.
- In netisr_poll() axe the block, where polling code asks drivers
to unregister.
- In netisr_poll() and ether_poll() do polling always, if any
handlers are present.
- In ether_poll_[de]register() remove a lot of error hiding code. Assert
that arguments are correct, instead.
- In ether_poll_[de]register() use standard return values in case of
error or success.
- Introduce poll_switch() that is a sysctl handler for kern.polling.enable.
poll_switch() goes through interface list and enabled/disables polling.
A message that kern.polling.enable is deprecated is printed.

Detailed driver changes:
- On attach driver announces IFCAP_POLLING in if_capabilities, but
not in if_capenable.
- On detach driver calls ether_poll_deregister() if polling is enabled.
- In polling handler driver obtains its lock and checks IFF_DRV_RUNNING
flag. If there is no, then unlocks and returns.
- In ioctl handler driver checks for IFCAP_POLLING flag requested to
be set or cleared. Driver first calls ether_poll_[de]register(), then
obtains driver lock and [dis/en]ables interrupts.
- In interrupt handler driver checks IFCAP_POLLING flag in if_capenable.
If present, then returns.This is important to protect from spurious
interrupts.

Reviewed by: ru, sam, jhb


# 292ee7be 09-Aug-2005 Robert Watson <rwatson@FreeBSD.org>

Rename IFF_RUNNING to IFF_DRV_RUNNING, IFF_OACTIVE to IFF_DRV_OACTIVE,
and move both flags from ifnet.if_flags to ifnet.if_drv_flags, making
and documenting the locking of these flags the responsibility of the
device driver, not the network stack. The flags for these two fields
will be mutually exclusive so that they can be exposed to user space as
though they were stored in the same variable.

Provide #defines to provide the old names #ifndef _KERNEL, so that user
applications (such as ifconfig) can use the old flag names. Using the
old names in a device driver will result in a compile error in order to
help device driver writers adopt the new model.

When exposing the interface flags to user space, via interface ioctls
or routing sockets, or the two fields together. Since the driver flags
cannot currently be set for user space, no new logic is currently
required to handle this case.

Add some assertions that general purpose network stack routines, such
as if_setflags(), are not improperly used on driver-owned flags.

With this change, a large number of very minor network stack races are
closed, subject to correct device driver locking. Most were likely
never triggered.

Driver sweep to follow; many thanks to pjd and bz for the line-by-line
review they gave this patch.

Reviewed by: pjd, bz
MFC after: 7 days


# c3b31afd 02-Aug-2005 Robert Watson <rwatson@FreeBSD.org>

Protect link layer network interface multicast address list manipulation
using ifp->if_addr_mtx:

- Initialize if_addr_mtx when ifnet is initialized.

- Destroy if_addr_mtx when ifnet is torn down.

- Rename ifmaof_ifpforaddr() to if_findmulti(); assert if_addr_mtx.
Staticize.

- Extract ifmultiaddr allocation and initialization into if_allocmulti();
accept a 'mflags' argument to indicate whether or not sleeping is
permitted. This centralizes error handling and address duplication.

- Extract ifmultiaddr tear-down and deallocation in if_freemulti().

- Re-structure if_addmulti() to hold if_addr_mtx around manipulation of
the ifnet multicast address list and reference count manipulation.
Make use of non-sleeping allocations. Annotate the fact that we only
generate routing socket events for explicit address addition, not
implicit link layer address addition.

- Re-structure if_delmulti() to hold if_addr_mtx around manipulation of
the ifnet multicast address list and reference count manipulation.
Annotate the lack of a routing socket event for implicit link layer
address removal.

- De-spl all and sundry.

Problem reported by: Ed Maste <emaste at phaedrus dot sandvine dot ca>
MFC after: 1 week


# de6073aa 02-Aug-2005 Robert Watson <rwatson@FreeBSD.org>

Add if_addr_mtx to struct ifnet, a mutex to protect ifnet-related address
lists. Add accessor macros.

This changes the size of struct ifnet, but ideally, all ifnet consumers
are now using if_alloc() to allocate these structures rather than
embedding them into device driver softc's, so this won't modify the
network device driver ABI.

MFC after: 1 week


# 638ccea0 21-Jul-2005 Robert Watson <rwatson@FreeBSD.org>

Allocate one of the spare ifnet integer fields to hold if_drv_flags,
which in the future will hold IFF_OACTIVE and IFF_RUNNING, and have
its access synchronized by the device driver rather than the
protocol stack. This will avoid potential races in the management
of flags in if_flags.

Discussed with: various (scottl, jhb, ...)
MFC after: 1 week


# fc74a9f9 10-Jun-2005 Brooks Davis <brooks@FreeBSD.org>

Stop embedding struct ifnet at the top of driver softcs. Instead the
struct ifnet or the layer 2 common structure it was embedded in have
been replaced with a struct ifnet pointer to be filled by a call to the
new function, if_alloc(). The layer 2 common structure is also allocated
via if_alloc() based on the interface type. It is hung off the new
struct ifnet member, if_l2com.

This change removes the size of these structures from the kernel ABI and
will allow us to better manage them as interfaces come and go.

Other changes of note:
- Struct arpcom is no longer referenced in normal interface code.
Instead the Ethernet address is accessed via the IFP2ENADDR() macro.
To enforce this ac_enaddr has been renamed to _ac_enaddr.
- The second argument to ether_ifattach is now always the mac address
from driver private storage rather than sometimes being ac_enaddr.

Reviewed by: sobomax, sam


# 8f867517 04-Jun-2005 Andrew Thompson <thompsa@FreeBSD.org>

Add hooks into the networking layer to support if_bridge. This changes struct
ifnet so a buildworld is necessary.

Approved by: mlaier (mentor)
Obtained from: NetBSD


# 45778b37 25-May-2005 Peter Edwards <peadar@FreeBSD.org>

Separate out address-detaching part of if_detach into if_purgeaddrs,
so if_tap doesn't need to rely on locally-rolled code to do same.

The observable symptom of if_tap's bzero'ing the address details
was a crash in "ifconfig tap0" after an if_tap device was closed.

Reported By: Matti Saarinen (mjsaarin at cc dot helsinki dot fi)


# 68a3482f 20-Apr-2005 Gleb Smirnoff <glebius@FreeBSD.org>

Do not call all link state callbacks directly, but schedule
a taskqueue(9) task. This fixes LORs and adds possibility
to serve such events pseudorecursively, when link state
change of interface causes subsequent change on other
interfaces.

Sponsored by: Rambler
Reviewed by: sam, brooks, mux


# 3a84d72a 01-Mar-2005 Gleb Smirnoff <glebius@FreeBSD.org>

Revert change to struct ifnet. Use ifnet pointer in softc. Embedding
ifnet into smth will soon be removed.

Requested by: brooks


# e8c34a71 26-Feb-2005 Gleb Smirnoff <glebius@FreeBSD.org>

Remove carp_softc.sc_ifp member in favor of union pointers in struct ifnet.

Obtained from: OpenBSD


# a9771948 22-Feb-2005 Gleb Smirnoff <glebius@FreeBSD.org>

Add CARP (Common Address Redundancy Protocol), which allows multiple
hosts to share an IP address, providing high availability and load
balancing.

Original work on CARP done by Michael Shalayeff, with many
additions by Marco Pfatschbacher and Ryan McBride.

FreeBSD port done solely by Max Laier.

Patch by: mlaier
Obtained from: OpenBSD (mickey, mcbride)


# c398230b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# 94f5c9cf 07-Dec-2004 Sam Leffler <sam@FreeBSD.org>

Cleanup link state change notification:
o add new if_link_state_change routine that deals with link state changes
o change mii to use if_link_state_change


# 0b39ef4d 09-Nov-2004 Max Laier <mlaier@FreeBSD.org>

Remove the #if 0 wrapping around !ALTQ stuff that can't be used due to ABI
stability anyway.


# 0b762445 30-Oct-2004 Robert Watson <rwatson@FreeBSD.org>

Move if_handoff() from an inline in if_var.h to a function to if.c
in orden to harden the ABI for 5.x; this will permit us to modify
the locking in the ifnet packet dispatch without requiring drivers
to be recompiled.

MFC after: 3 days
Discussed at: EuroBSDCon Developer's Summit


# b4d4574a 30-Oct-2004 Robert Watson <rwatson@FreeBSD.org>

Add additional "spare" fields to 'struct ifnet' in order to improve
the resistance of the network driver ABI to changes that will be
required as we optimize locking.

MFC after: 3 days
Discussed at: Developer Summit


# 2f27e151 25-Oct-2004 John-Mark Gurney <jmg@FreeBSD.org>

use NULL instead of 0 when casting/comparing w/ a pointer...


# 31302ebf 19-Oct-2004 Robert Watson <rwatson@FreeBSD.org>

Define IFF_LOCKGIANT() and IFF_UNLOCKGIANT() macros, which conditionally
acquire Giant if the passed interface has IFF_NEEDSGIANT set on it.
Modify calls into (ifp)->if_ioctl() in if.c to use these macros in order
to ensure that Giant is held.

MFC after: 3 days
Bumped into by: jmg


# ad3b9257 15-Aug-2004 John-Mark Gurney <jmg@FreeBSD.org>

Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by: green, rwatson (both earlier versions)


# de0332d4 07-Aug-2004 Max Laier <mlaier@FreeBSD.org>

Add a "void *if_carp" placeholder to struct ifnet with prospect to bring in
the "Common address redundancy protocol" (CARP) during the 5-STABLE cycle.
Hence doing the ABI break now.

Approved by: re (scottl)


# af5e59bf 27-Jul-2004 Robert Watson <rwatson@FreeBSD.org>

Add a new network interface flag, IFF_NEEDSGIANT, which will allow
device drivers to declare that the ifp->if_start() method implemented
by the driver requires Giant in order to operate correctly.

Add a 'struct task' to 'struct ifnet' that can be used to execute a
deferred ifp->if_start() in the event that if_start needs to be called
in a Giant-free environment. To do this, introduce if_start(), a
wrapper function for ifp->if_start(). If the interface can run MPSAFE,
it directly dispatches into the interface start routine. If it can't
run MPSAFE, we're running with debug.mpsafenet != 0, and Giant isn't
currently held, the task is queued to execute in a swi holding Giant
via if_start_deferred().

Modify if_handoff() to use if_start() instead of direct dispatch.
Modify 802.11 to use if_start() instead of direct dispatch.

This is intended to provide increased compatibility for non-MPSAFE
network device drivers in the presence of Giant-free operation via
asynchronous dispatch. However, this commit does not mark any network
interfaces as IFF_NEEDSGIANT.


# bfe46415 14-Jul-2004 Max Laier <mlaier@FreeBSD.org>

Fix a copy-and-paste-o in IFQ_DRV_PREPEND - all pointyhats to me.
While here also fix a (not less stupid) braino in IFQ_DRV_PURGE.

Reported-by: clement
Tested-by: clement (_PREPEND in sis(4))


# f889d2ef 22-Jun-2004 Brooks Davis <brooks@FreeBSD.org>

Major overhaul of pseudo-interface cloning. Highlights include:

- Split the code out into if_clone.[ch].
- Locked struct if_clone. [1]
- Add a per-cloner match function rather then simply matching names of
the form <name><unit> and <name>.
- Use the match function to allow creation of <interface>.<tag>
vlan interfaces. The old way is preserved unchanged!
- Also the match function to allow creation of stf(4) interfaces named
stf0, stf, or 6to4. This is the only major user visible change in
that "ifconfig stf" creates the interface stf rather then stf0 and
does not print "stf0" to stdout.
- Allow destroy functions to fail so they can refuse to delete
interfaces. Currently, we forbid the deletion of interfaces which
were created in the init function, particularly lo0, pflog0, and
pfsync0. In the case of lo0 this was a panic implementation so it
does not count as a user visiable change. :-)
- Since most interfaces do not need the new functionality, an family of
wrapper functions, ifc_simple_*(), were created to wrap old style
cloner functions.
- The IF_CLONE_INITIALIZER macro is replaced with a new incompatible
IFC_CLONE_INITIALIZER and ifc_simple consumers use IFC_SIMPLE_DECLARE
instead.

Submitted by: Maurycy Pawlowski-Wieronski <maurycy at fouk.org> [1]
Reviewed by: andre, mlaier
Discussed on: net


# 89c9c53d 16-Jun-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Do the dreaded s/dev_t/struct cdev */
Bump __FreeBSD_version accordingly.


# 62d7f46e 14-Jun-2004 Max Laier <mlaier@FreeBSD.org>

Fix a typeo in IFQ_HANDOFF.


# 4cb655c0 14-Jun-2004 Max Laier <mlaier@FreeBSD.org>

Transform tbr_dequeue into a function pointer in order to build drivers with
ALTQ enabled versions of IFQ_* macros by default, as requested by serveral
others. This is a follow-up to the quick fix I committed yesterday which
turned off the ALTQ checks for non-ALTQ kernels.


# 930e2cfa 13-Jun-2004 Max Laier <mlaier@FreeBSD.org>

Unbreak non-ALTQ kernel linking. I forgot about tbr_dequeue.

In the end drivers should be building with ALTQ checks by default, but for
now build them with the old macros for non-ALTQ kernels.

Note: Check new features w/ LINT *and* w/ LINT minus the new feature.

Found-by: rwatson


# 02b199f1 13-Jun-2004 Max Laier <mlaier@FreeBSD.org>

Link ALTQ to the build and break with ABI for struct ifnet. Please recompile
your (network) modules as well as any userland that might make sense of
sizeof(struct ifnet).
This does not change the queueing yet. These changes will follow in a
seperate commit. Same with the driver changes, which need case by case
evaluation.

__FreeBSD_version bump will follow.

Tested-by: (i386)LINT


# 127d7b2d 03-May-2004 Andre Oppermann <andre@FreeBSD.org>

Link state change notification of ethernet media to the routing socket.

o Extend the if_data structure with an ifi_link_state field and
provide the corresponding defines for the valid states.

o The mii_linkchg() callback updates the ifi_link_state field
and calls rt_ifmsg() to notify listeners on the routing socket
in addition to the kqueue KNOTE.

o If vlans are configured on a physical interface notify and update
all vlan pseudo devices as well with the vlan_link_state() callback.

No objections by: sam, wpaul, ru, bms
Brucification by: bde


# 8614fb12 18-Apr-2004 Max Laier <mlaier@FreeBSD.org>

Make if_(un)route static in if.c as they are called from if_up/if_down only.
This is also cleanup to make locking easier.

Reviewed by: luigi
Approved by: bms(mentor)


# 212b6d52 17-Apr-2004 Luigi Rizzo <luigi@FreeBSD.org>

+ rename and document an unused field in struct arpcom (field is still
there so there are no ABI changes);
+ replace 5 redefinitions of the IPF2AC macro with one in if_arp.h

Eventually (but before freezing the ABI) we need to get rid of
struct arpcom (initially with the help of some smart #defines
to avoid having to touch each and every driver, see below).

Apart from the struct ifnet, struct arpcom now only stores a copy
of the MAC address (ac_enaddr, but we already have another copy in
the struct ifnet -- if_addrhead), and a netgraph-specific field
which is _always_ accessed through the ifp, so it might well go
into the struct ifnet too (where, besides, there is already an entry
for AF_NETGRAPH data...)

Too bad ac_enaddr is widely referenced by all drivers. But
this can be fixed as follows:

#define ac_enaddr ac_if.the_original_ac_enaddr_in_struct_ifnet

(note that the right hand side would likely be a pointer rather than
the base address of an array.)


# d65d2351 16-Apr-2004 Luigi Rizzo <luigi@FreeBSD.org>

Documented the intended usage of if_addrhead and ifaddr_byindex()
This commit only changes comments. Nothing to recompile.


# 621b79c4 15-Apr-2004 Luigi Rizzo <luigi@FreeBSD.org>

Document the way if_addrhead and struct ifaddr are used.
Remove a member from 'struct ifaddr' which has been in an
#ifdef notdef block since rev 1.1

No ABI changes -- no need to recompile anything.


# 307c58e2 12-Apr-2004 Ruslan Ermilov <ru@FreeBSD.org>

Count outgoing link-level broadcast packets in if_omcasts.
I'm not sure this is completely correct but at least this
is consistent with the accounting of incoming broadcasts.

PR: kern/65273
Submitted by: David J Duchscher <daved@tamu.edu>


# 41a76b48 11-Apr-2004 Robert Watson <rwatson@FreeBSD.org>

In 4.x, if_ipending is used to track network interrupt state. In 5.x,
it is no longer used, so GC the ifnet.if_ipending field.


# f36cfd49 07-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson


# f7c5baa1 03-Apr-2004 Luigi Rizzo <luigi@FreeBSD.org>

+ arpresolve(): remove an unused argument
+ struct ifnet: remove unused fields, move ipv6-related field close
to each other, add a pointer to l3<->l2 translation tables (arp,nd6,
etc.) for future use.

+ struct route: remove an unused field, move close to each
other some fields that might likely go away in the future


# 196f7f54 12-Mar-2004 Brooks Davis <brooks@FreeBSD.org>

Remove if_withname. It came in with the KAME import, but never got
used. Should someone need its functionality, it's a really expensive
implementation of:
ifnet_byindex(sdl->sdl_index)

Reviewed by: bde, ume


# 25a4adce 25-Feb-2004 Max Laier <mlaier@FreeBSD.org>

Bring eventhandler callbacks for pf.
This enables pf to track dynamic address changes on interfaces (dailup) with
the "on (<ifname>)"-syntax. This also brings hooks in anticipation of
tracking cloned interfaces, which will be in future versions of pf.

Approved by: bms(mentor)


# 2ab763ec 06-Dec-2003 Warner Losh <imp@FreeBSD.org>

Make the if_broadcastaddr const. All the drivers in the tree which
violated the constness were corrected before the freeze. This was
suggested by mdodd@, I think, and sam@ and others have signed off on
this if I recall my conversations with them correctly.


# eca8a663 11-Nov-2003 Robert Watson <rwatson@FreeBSD.org>

Modify the MAC Framework so that instead of embedding a (struct label)
in various kernel objects to represent security data, we embed a
(struct label *) pointer, which now references labels allocated using
a UMA zone (mac_label.c). This allows the size and shape of struct
label to be varied without changing the size and shape of these kernel
objects, which become part of the frozen ABI with 5-STABLE. This opens
the door for boot-time selection of the number of label slots, and hence
changes to the bound on the number of simultaneous labeled policies
at boot-time instead of compile-time. This also makes it easier to
embed label references in new objects as required for locking/caching
with fine-grained network stack locking, such as inpcb structures.

This change also moves us further in the direction of hiding the
structure of kernel objects from MAC policy modules, not to mention
dramatically reducing the number of '&' symbols appearing in both the
MAC Framework and MAC policy modules, and improving readability.

While this results in minimal performance change with MAC enabled, it
will observably shrink the size of a number of critical kernel data
structures for the !MAC case, and should have a small (but measurable)
performance benefit (i.e., struct vnode, struct socket) do to memory
conservation and reduced cost of zeroing memory.

NOTE: Users of MAC must recompile their kernel and all MAC modules as a
result of this change. Because this is an API change, third party
MAC modules will also need to be updated to make less use of the '&'
symbol.

Suggestions from: bmilekic
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 9bf40ede 31-Oct-2003 Brooks Davis <brooks@FreeBSD.org>

Replace the if_name and if_unit members of struct ifnet with new members
if_xname, if_dname, and if_dunit. if_xname is the name of the interface
and if_dname/unit are the driver name and instance.

This change paves the way for interface renaming and enhanced pseudo
device creation and configuration symantics.

Approved By: re (in principle)
Reviewed By: njl, imp
Tested On: i386, amd64, sparc64
Obtained From: NetBSD (if_xname)


# 234a35c7 24-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

Since dp->dom_ifattach calls malloc() with M_WAITOK, we cannot
use mutex lock directly here. Protect ifp->if_afdata instead.

Reported by: grehan


# 31b1bfe1 17-Oct-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

- add dom_if{attach,detach} framework.
- transition to use ifp->if_afdata.

Obtained from: KAME


# 9d5abbdd 01-Jan-2003 Jens Schweikhardt <schweikh@FreeBSD.org>

Correct typos, mostly s/ a / an / where appropriate. Some whitespace cleanup,
especially in troff files.


# d64ada50 30-Dec-2002 Jens Schweikhardt <schweikh@FreeBSD.org>

Fix typos, mostly s/ an / a / where appropriate and a few s/an/and/
Add FreeBSD Id tag where missing.


# c919b1e2 26-Dec-2002 Jeffrey Hsu <hsu@FreeBSD.org>

Long chain of calls starting with bridge_on(), going through IPv6, and
ending up at ifa_ifwithdstaddr() could lead to a recursive lock of
the ifnet list mutex.


# b30a244c 21-Dec-2002 Jeffrey Hsu <hsu@FreeBSD.org>

SMP locking for ifnet list.


# 68eec1f8 17-Dec-2002 Jeffrey Hsu <hsu@FreeBSD.org>

Switch to the conventional reference counting scheme.


# 19fc74fb 18-Dec-2002 Jeffrey Hsu <hsu@FreeBSD.org>

Lock up ifaddr reference counts.


# 76cfd300 14-Nov-2002 Sam Leffler <sam@FreeBSD.org>

o add if_nvlans member to track the number of vlans active on an interface
o add if_input member for interface drivers to call through to pass packets "up"
o remove ethernet-specific function decls (moved to ethernet.h)

Reviewed by: many
Approved by: re


# a7f62430 28-Sep-2002 Bruce Evans <bde@FreeBSD.org>

Fixed some of the namespace pollution in rev.1.33. <sys/systm.h> was
included here because it was once a prerequisite of <sys/mutex.h>
although that bug was fixed long ago.


# fa882e87 24-Sep-2002 Brooks Davis <brooks@FreeBSD.org>

Add a new helper function if_printf() modeled on device_printf(). The
function takes a struct ifnet pointer followed by the usual printf
arguments and prints "<interfacename>: " before the results of printf.
Since this is the primary form of printf calls in network device drivers
and accounts for most uses of the ifnet menber if_unit, this
significantly simplifies many printf()s.


# 62f76486 18-Aug-2002 Maxim Sobolev <sobomax@FreeBSD.org>

Increase size of ifnet.if_flags from 16 bits (short) to 32 bits (int). To avoid
breaking application ABI use unused ifreq.ifru_flags[1] for upper 16 bits in
SIOCSIFFLAGS and SIOCGIFFLAGS ioctl's.

Reviewed by: -hackers, -net


# c44d8405 13-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Move to nested include of _label.h instead of mac.h, reducing namespace
pollution.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
Suggested by: bde


# 19930ae5 30-Jul-2002 Robert Watson <rwatson@FreeBSD.org>

Introduce support for Mandatory Access Control and extensible
kernel access control.

Label network interface structures, permitting security features to
be maintained on those objects. if_label will be used to authorize
data flow using the network interface. if_label will be protected
using the same synchronization primitives as other mutable entries
in struct ifnet.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# f0a8d5cb 07-May-2002 Warner Losh <imp@FreeBSD.org>

Minor style nit


# 929ddbbb 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# effa274e 14-Dec-2001 Jonathan Lemon <jlemon@FreeBSD.org>

whitespace fixes.


# e4fc250c 14-Dec-2001 Luigi Rizzo <luigi@FreeBSD.org>

Device Polling code for -current.

Non-SMP, i386-only, no polling in the idle loop at the moment.

To use this code you must compile a kernel with

options DEVICE_POLLING

and at runtime enable polling with

sysctl kern.polling.enable=1

The percentage of CPU reserved to userland can be set with

sysctl kern.polling.user_frac=NN (default is 50)

while the remainder is used by polling device drivers and netisr's.
These are the only two variables that you should need to touch. There
are a few more parameters in kern.polling but the default values
are adequate for all purposes. See the code in kern_poll.c for
more details on them.

Polling in the idle loop will be implemented shortly by introducing
a kernel thread which does the job. Until then, the amount of CPU
dedicated to polling will never exceed (100-user_frac).
The equivalent (actually, better) code for -stable is at

http://info.iet.unipi.it/~luigi/polling/

and also supports polling in the idle loop.

NOTE to Alpha developers:
There is really nothing in this code that is i386-specific.
If you move the 2 lines supporting the new option from
sys/conf/{files,options}.i386 to sys/conf/{files,options} I am
pretty sure that this should work on the Alpha as well, just that
I do not have a suitable test box to try it. If someone feels like
trying it, I would appreciate it.

NOTE to other developers:
sure some things could be done better, and as always I am open to
constructive criticism, which a few of you have already given and
I greatly appreciated.
However, before proposing radical architectural changes, please
take some time to possibly try out this code, or at the very least
read the comments in kern_poll.c, especially re. the reason why I
am using a soft netisr and cannot (I believe) replace it with a
simple timeout.

Quick description of files touched by this commit:

sys/conf/files.i386
new file kern/kern_poll.c
sys/conf/options.i386
new option
sys/i386/i386/trap.c
poll in trap (disabled by default)
sys/kern/kern_clock.c
initialization and hardclock hooks.
sys/kern/kern_intr.c
minor swi_net changes
sys/kern/kern_poll.c
the bulk of the code.
sys/net/if.h
new flag
sys/net/if_var.h
declaration for functions used in device drivers.
sys/net/netisr.h
NETISR_POLL
sys/dev/fxp/if_fxp.c
sys/dev/fxp/if_fxpvar.h
sys/pci/if_dc.c
sys/pci/if_dcreg.h
sys/pci/if_sis.c
sys/pci/if_sisreg.h
device driver modifications


# 985fbf6b 22-Nov-2001 Luigi Rizzo <luigi@FreeBSD.org>

Expand the comment on the layout of softc, arpcom and ifnet structures,
and list the places where the assumption is used.


# 99efe4f0 14-Nov-2001 John Baldwin <jhb@FreeBSD.org>

Remove ifnet.if_mpsafe for now. If this is needed, it won't be needed
until much later when the network stack locking is farther along.

Approved by: jlemon


# 8071913d 17-Oct-2001 Ruslan Ermilov <ru@FreeBSD.org>

Pull post-4.4BSD change to sys/net/route.c from BSD/OS 4.2.

Have sys/net/route.c:rtrequest1(), which takes ``rt_addrinfo *''
as the argument. Pass rt_addrinfo all the way down to rtrequest1
and ifa->ifa_rtrequest. 3rd argument of ifa->ifa_rtrequest is now
``rt_addrinfo *'' instead of ``sockaddr *'' (almost noone is
using it anyways).

Benefit: the following command now works. Previously we needed
two route(8) invocations, "add" then "change".
# route add -inet6 default ::1 -ifp gif0

Remove unsafe typecast in rtrequest(), from ``rtentry *'' to
``sockaddr *''. It was introduced by 4.3BSD-Reno and never
corrected.

Obtained from: BSD/OS, NetBSD
MFC after: 1 month
PR: kern/28360


# 322dcb8d 14-Oct-2001 Max Khon <fjoe@FreeBSD.org>

bring in ARP support for variable length link level addresses

Reviewed by: jdp
Approved by: jdp
Obtained from: NetBSD
MFC after: 6 weeks


# 8ec410b5 02-Oct-2001 Matt Jacob <mjacob@FreeBSD.org>

Documentation comment: note that the each NIC's softc is assumed to start
with an ifnet structure.

MFC after: 1 week


# 016da741 18-Sep-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Add two fields to the ifnet structure indicating what extra capabilities
a network device has, and which ones are enabled.


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# f9132ceb 05-Sep-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Wrap array accesses in macros, which also happen to be lvalues:

ifnet_addrs[i - 1] -> ifaddr_byindex(i)
ifindex2ifnet[i] -> ifnet_byindex(i)

This is intended to ease the conversion to SMPng.


# 30aad87d 02-Jul-2001 Brooks Davis <brooks@FreeBSD.org>

Add kernel infrastructure for network device cloning.

Reviewed by: ru, ume
Obtained from: NetBSD
MFC after: 1 week


# f34fa851 28-Mar-2001 John Baldwin <jhb@FreeBSD.org>

Catch up to header include changes:
- <sys/mutex.h> now requires <sys/systm.h>
- <sys/mutex.h> and <sys/sx.h> now require <sys/lock.h>


# 9ed346ba 08-Feb-2001 Bosko Milekic <bmilekic@FreeBSD.org>

Change and clean the mutex lock interface.

mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)


# 6817526d 06-Feb-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Convert if_multiaddrs from LIST to TAILQ so that it can be traversed
backwards in the three drivers which want to do that.

Reviewed by: mikeh


# 90d9802f 29-Jan-2001 Peter Wemm <peter@FreeBSD.org>

Make the number of loopback interfaces dynamically tunable. Why one
would *want* to is a different story, but it used to be able to be done
statically. Get rid of #include "loop.h" and struct ifnet loif[NLOOP];
This could be used as an example of how to do this in other drivers,
for example: ccd.


# 9ba20c31 26-Nov-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Unbreak world; #include <sys/mutex.h> instead of <machine/mutex.h>
Only include <sys/mbuf.h> when building kernel sources. This should
probably be changed to require callers to include it themselves.


# df5e1987 25-Nov-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Lock down the network interface queues. The queue mutex must be obtained
before adding/removing packets from the queue. Also, the if_obytes and
if_omcasts fields should only be manipulated under protection of the mutex.

IF_ENQUEUE, IF_PREPEND, and IF_DEQUEUE perform all necessary locking on
the queue. An IF_LOCK macro is provided, as well as the old (mutex-less)
versions of the macros in the form _IF_ENQUEUE, _IF_QFULL, for code which
needs them, but their use is discouraged.

Two new macros are introduced: IF_DRAIN() to drain a queue, and IF_HANDOFF,
which takes care of locking/enqueue, and also statistics updating/start
if necessary.


# 5da9f8fa 19-Oct-2000 Josef Karthauser <joe@FreeBSD.org>

Augment the 'ifaddr' structure with a 'struct if_data' to keep
statistics on a per network address basis.

Teach the IPv4 and IPv6 input/output routines to log packets/bytes
against the network address connected to the flow.

Teach netstat to display the per-address stats for IP protocols
when 'netstat -i' is evoked, instead of displaying the per-interface
stats.


# 66ce51ce 14-Aug-2000 Archie Cobbs <archie@FreeBSD.org>

Export the functionality of SIOCSIFLLADDR with if_setlladdr()
and add some more rigorous sanity checking in the process.

Reviewed by: freebsd-net


# 21b8ebd9 13-Jul-2000 Archie Cobbs <archie@FreeBSD.org>

Make all Ethernet drivers attach using ether_ifattach() and detach using
ether_ifdetach().

The former consolidates the operations of if_attach(), ng_ether_attach(),
and bpfattach(). The latter consolidates the corresponding detach operations.

Reviewed by: julian, freebsd-net


# 6ec86086 29-Jun-2000 Archie Cobbs <archie@FreeBSD.org>

Fix kernel build breakage when 'device ether' was not included.


# e1e1452d 26-Jun-2000 Archie Cobbs <archie@FreeBSD.org>

Make the ng_ether(4) node type dynamically loadable like the rest.
This means 'options NETGRAPH' is no longer necessary in order to get
netgraph-enabled Ethernet interfaces. This supports loading/unloading
the ng_ether.ko and attaching/detaching the Ethernet interface in any
order.

Add two new hooks 'upper' and 'lower' to allow access to the protocol
demux engine and the raw device, respectively. This enables bridging
to be defined as a netgraph node, if so desired.

Reviewed by: freebsd-net@freebsd.org


# e3975643 25-May-2000 Jake Burkholder <jake@FreeBSD.org>

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 06a429a3 24-May-2000 Archie Cobbs <archie@FreeBSD.org>

Just need to pass the address family to if_simloop(), not the whole sockaddr.


# 740a1973 23-May-2000 Jake Burkholder <jake@FreeBSD.org>

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# db4f9cc7 27-Mar-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Add support for offloading IP/TCP/UDP checksums to NIC hardware which
supports them.


# 664a31e4 28-Dec-1999 Peter Wemm <peter@FreeBSD.org>

Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL"
is an application space macro and the applications are supposed to be free
to use it as they please (but cannot). This is consistant with the other
BSD's who made this change quite some time ago. More commits to come.


# 82cd038d 21-Nov-1999 Yoshinobu Inoue <shin@FreeBSD.org>

KAME netinet6 basic part(no IPsec,no V6 Multicast Forwarding, no UDP/TCP
for IPv6 yet)

With this patch, you can assigne IPv6 addr automatically, and can reply to
IPv6 ping.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 76429de4 05-Nov-1999 Yoshinobu Inoue <shin@FreeBSD.org>

KAME related header files additions and merges.
(only those which don't affect c source files so much)

Reviewed by: cvs-committers
Obtained from: KAME project


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# aab3beee 06-Aug-1999 Brian Somers <brian@FreeBSD.org>

Define IF_MAXMTU and IF_MINMTU and don't allow MTUs with
out-of-range values.

``comparison is always 0'' warnings are silly !

Ok'd by: wollman, dg
Advised by: bde


# e3c1388b 16-May-1999 Pierre Beyssac <pb@FreeBSD.org>

PR: kern/10570
Submitted by: adrian@freebsd.org

Change reference count in struct ifaddr to a u_int, to be able
to handle more than 2^16 routes to the same interface.

Fix suggested by Andrew Bangs <andrewb@demon.net> in PR kern/10570.
Tested by <adrian@freebsd.org> and me under -current.


# dfd5dee1 06-May-1999 Peter Wemm <peter@FreeBSD.org>

Add sufficient braces to keep egcs happy about potentially ambiguous
if/else nesting.


# 6182fdbd 16-Apr-1999 Peter Wemm <peter@FreeBSD.org>

Bring the 'new-bus' to the i386. This extensively changes the way the
i386 platform boots, it is no longer ISA-centric, and is fully dynamic.
Most old drivers compile and run without modification via 'compatability
shims' to enable a smoother transition. eisa, isapnp and pccard* are
not yet using the new resource manager. Once fully converted, all drivers
will be loadable, including PCI and ISA.

(Some other changes appear to have snuck in, including a port of Soren's
ATA driver to the Alpha. Soren, back this out if you need to.)

This is a checkpoint of work-in-progress, but is quite functional.

The bulk of the work was done over the last few years by Doug Rabson and
Garrett Wollman.

Approved by: core


# e8c2601d 16-Dec-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Generalize the if_up() and if_down() functions under the names
if_route() and if_unroute().

This is first step towards sanitizing IFF_UP and IFF_RUNNING


# ed7509ac 11-Jun-1998 Julian Elischer <julian@FreeBSD.org>

Go through the loopback code with a broom..
Remove lots'o'hacks.
looutput is now static.

Other callers who want to use loopback to allow shortcutting
should call the special entrypoint for this, if_simloop(), which is
specifically designed for this purpose. Using looutput for this purpose
was problematic, particularly with bpf and trying to keep track
of whether one should be using the charateristics of the loopback interface
or the interface (e.g. if_ethersubr.c) that was requesting the loopback.
There was a whole class of errors due to this mis-use each of which had
hacks to cover them up.

Consists largly of hack removal :-)


# ecbb00a2 07-Jun-1998 Doug Rabson <dfr@FreeBSD.org>

This commit fixes various 64bit portability problems required for
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.

The prototype FreeBSD/alpha machdep will follow in a couple of days
time.


# c1087c13 15-Apr-1998 Bruce Evans <bde@FreeBSD.org>

Support compiling with `gcc -ansi'.


# 7ed8f465 27-Aug-1997 Julian Elischer <julian@FreeBSD.org>

Add a per-interface-address pointer to a function that can be supplied
by a protocol, to detirmine if an address matches the net this address
is part of. This is needed by protocols for which netmasks
"just don't work", for example appletalk.

Also add the code in appletalk to make use of this new feature.
Thsi fixes one of the longest standing bugs in appletalk.
The inability to talk to machines to which the path is via a router
which is on a different net, but the same netrange, as your interface.
Protocols that do not supply this function (e.g. IP) should not be affected.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 373f88ed 08-Jan-1997 Garrett Wollman <wollman@FreeBSD.org>

Fix a few oversights in the new multicast membership interface.


# 1158dfb7 07-Jan-1997 Garrett Wollman <wollman@FreeBSD.org>

Checkpoint the beginnings of the new kernel interface for
multicast group memberships. This is not actually operative
at the moment (a lot of other code still needs to be changed), but
this seemed like a useful reference point to check in so that
others (i.e. Bill Fenner) have fair warning of where we are going.


# 19ff91c6 03-Jan-1997 Garrett Wollman <wollman@FreeBSD.org>

Separate kernel-internal data structures from exposed user interface
to interfaces. (Amazing nobody had done this!)

More commits to fix up user-land to follow.