History log of /freebsd-current/sys/net/if_lagg.c
Revision Date Author Comments
# fadbb6f8 06-May-2024 Gleb Smirnoff <glebius@FreeBSD.org>

lagg: remove use of net epoch in the ioctl paths

Rely on LAGG_SLOCK() instead. The use of network epoch(9) here was added
in 6573d7580b851 (later tidied by 87bf9b9cbeebc) as a large sweep that
blindly substituted blocking kernel primitives with epoch(9). In these
particular code paths use of epoch(9) is incorrect and doesn't provide any
protection against a stale pointer. Recent fix 48698ead6ff0, which should
actually have removed the epoch use, created a potential sleeping in epoch
problem.


# 57068597 06-May-2024 Gleb Smirnoff <glebius@FreeBSD.org>

lagg: propagate up/down to the children

Based on the old submission from asomers@. With modern state of locking
in lagg(4), the patch got much simplier. Enable the test that was
waiting for this change.

PR: 226144
Reviewed by: asomers
Differential Revision: https://reviews.freebsd.org/D44605


# 48698ead 23-Feb-2024 Gleb Smirnoff <glebius@FreeBSD.org>

lagg: wrap lagg_port2req() into LAGG_SLOCK()

Although a port addition is coded in a sequence where first all softc
information is fulfilled and only then it is attached to the lagg, we
still need a locking primitive to guarantee cache invalidation. Panic
observed in the wild shows that lacp_portreq() called via
lagg_port_ioctl(SIOCGLAGGPORT) immediately after port creation may see
lp->lp_psc as NULL and panic. In the core file we will see valid data
all around. A race via lagg_ioctl() wasn't observed but potentially
is possible.

Differential Revision: https://reviews.freebsd.org/D43501


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 401f0344 17-Apr-2023 Zhenlei Huang <zlei@FreeBSD.org>

lagg(4): Correctly define some sysctl variables

939a050ad96c virtualized lagg(4), but the corresponding sysctl of some
virtualized global variables are not marked with CTLFLAG_VNET. A try to
operate on those variables via sysctl will effectively go to the 'master'
copies and the virtualized ones are not read or set accordingly. As a
side effect, on updating the 'master' copy, the virtualized global
variables of newly created vnets will have correct values.

PR: 270705
Reviewed by: kp
Fixes: 939a050ad96c Virtualize lagg(4) cloner
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D39467


# 5f3d0399 02-Apr-2023 Zhenlei Huang <zlei@FreeBSD.org>

lagg(4): Tap traffic after protocol processing

Different lagg protocols have different means and policies to process incoming
traffic. For example, for failover protocol, by default received traffic is only
accepted when they are received through the active port. For lacp protocol, LACP
control messages are tapped off, also traffic will be dropped if they are
received through the port which is not in collecting state or is not joined to
the active aggregator. It confuses if user dump and see inbound traffic on
lagg(4) interfaces but they are actually silently dropped and not passed into
the net stack.

Tap traffic after protocol processing so that user will have consistent view of
the inbound traffic, meanwhile mbuf is set with correct receiving interface and
bpf(4) will diagnose the right direction of inbound packets.

PR: 270417
Reviewed by: melifaro (previous version)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39225


# 90820ef1 02-Apr-2023 Zhenlei Huang <zlei@FreeBSD.org>

infiniband: Widen NET_EPOCH coverage

From static code analysis, some device drivers (cxgbe, mlx4, mthca, and qlnx)
do not enter net epoch before lagg_input_infiniband(). If IPoIB interface is a
member of lagg(4) interface, and after returning from lagg_input_infiniband()
the receiving interface of mbuf is set to lagg(4) interface, then when
concurrently destroying the lagg(4) interface, there is a small window that the
interface gets destroyed and becomes invalid before infiniband_input() re-enter
net epoch, thus leading use-after-free.

Widen NET_EPOCH coverage to prevent use-after-free.

Thanks hselasky@ for testing with mlx5 devices.

Reviewed by: hselasky
Tested by: hselasky
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39275


# 5a8abd0a 31-Mar-2023 Zhenlei Huang <zlei@FreeBSD.org>

lacp: Use C99 bool for boolean return value

This improves readability.

No functional change intended.

MFC after: 1 week


# d4a80d21 29-Mar-2023 Zhenlei Huang <zlei@FreeBSD.org>

lagg(4): Do not enter net epoch recursively

This saves a little resources.

No functional change intended.

Reviewed by: kp
Fixes: b8a6e03fac92 Widen NET_EPOCH coverage
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39267


# dbe86dd5 29-Mar-2023 Zhenlei Huang <zlei@FreeBSD.org>

lagg(4): Refactor out some lagg protocol input routines into a default one

Those input routines are identical.

Also inline two fast paths.

No functional change intended.

MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39251


# fcac5719 29-Mar-2023 Zhenlei Huang <zlei@FreeBSD.org>

lagg(4): Make lagg_list and lagg_detach_cookie static

They are used internally only.

No functional change intended.

MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39250


# dcd7f0bd 24-Mar-2023 Zhenlei Huang <zlei@FreeBSD.org>

lagg: Various style fixes

MFC after: 1 week


# adf62e83 08-Feb-2023 Justin Hibbits <jhibbits@FreeBSD.org>

infiniband: Convert BPF handling for IfAPI

Summary:
All callers of infiniband_bpf_mtap() call it through the wrapper macro,
which checks the if_bpf member explicitly. Since this is getting
hidden, move this check into the internal function and remove the
wrapper macro.

Reviewed by: hselasky
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D39024


# 66bdbcd5 03-Mar-2023 Alexander V. Chernikov <melifaro@FreeBSD.org>

net: unify mtu update code

Subscribers: imp, ae, glebius

Differential Revision: https://reviews.freebsd.org/D38893


# 2c2b37ad 13-Jan-2023 Justin Hibbits <jhibbits@FreeBSD.org>

ifnet/API: Move struct ifnet definition to a <net/if_private.h>

Hide the ifnet structure definition, no user serviceable parts inside,
it's a netstack implementation detail. Include it temporarily in
<net/if_var.h> until all drivers are updated to use the accessors
exclusively.

Reviewed by: glebius
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D38046


# 110ce09c 13-Jan-2023 Tom Jones <thj@FreeBSD.org>

if_lagg: Allow lagg interfaces to be used with netmap

Reviewed by: zlei
Sponsored by: Zenarmor
Sponsored by: OPNsense
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D37436


# 91ebcbe0 21-Sep-2022 Alexander V. Chernikov <melifaro@FreeBSD.org>

if_clone: migrate some consumers to the new KPI.

Convert most of the cloner customers who require custom params
to the new if_clone KPI.

Reviewed by: kp
Differential Revision: https://reviews.freebsd.org/D36636
MFC after: 2 weeks


# 713ceb99 28-Jul-2022 Andrew Gallatin <gallatin@FreeBSD.org>

lagg: fix lagg ifioctl after SIOCSIFCAPNV

Lagg was broken by SIOCSIFCAPNV when all underlying devices
support SIOCSIFCAPNV. This change updates lagg to work with
SIOCSIFCAPNV and if_capabilities2.

Reviewed by: kib, hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D35865


# fa267a32 21-Jul-2022 Dimitry Andric <dim@FreeBSD.org>

Fix unused variable warning in if_lagg.c

With clang 15, the following -Werror warning is produced:

sys/net/if_lagg.c:2413:6: error: variable 'active_ports' set but not used [-Werror,-Wunused-but-set-variable]
int active_ports = 0;
^

The 'active_ports' variable appears to have been a debugging aid that
has never been used for anything (ref https://reviews.freebsd.org/D549),
so remove it.

MFC after: 3 days


# 1967e313 24-May-2022 Hans Petter Selasky <hselasky@FreeBSD.org>

lagg(4): Add support for allocating TLS receive tags.

The TLS receive tags are allocated directly from the receiving interface,
because mbufs are flowing in the opposite direction and then route change
checks are not useful, because they only work for outgoing traffic.

Differential revision: https://reviews.freebsd.org/D32356
Sponsored by: NVIDIA Networking


# 3142d4f6 19-Nov-2021 Kristof Provost <kp@FreeBSD.org>

lagg: fix unused-but-set-variable

MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")


# acdfc096 06-Nov-2021 Wojciech Macek <wma@FreeBSD.org>

lagg: update capabilites on SIOCSIFMTU

Some NICs might have limited capabilities when Jumbo frames are used.
For exampe some neta interfaces only support TX csum offload when the
packet size is lower than a value specified in DT.
Fix it by re-reading capabilities of children interfaces after MTU
has been successfully changed.

Found by: Jerome Tomczyk <jerome.tomczyk@stormshield.eu>
Reviewed by: jhb
Obtained from: Semihalf
Sponsored by: Stormshield
Differential revision: https://reviews.freebsd.org/D32724


# c782ea8b 14-Sep-2021 John Baldwin <jhb@FreeBSD.org>

Add a switch structure for send tags.

Move the type and function pointers for operations on existing send
tags (modify, query, next, free) out of 'struct ifnet' and into a new
'struct if_snd_tag_sw'. A pointer to this structure is added to the
generic part of send tags and is initialized by m_snd_tag_init()
(which now accepts a switch structure as a new argument in place of
the type).

Previously, device driver ifnet methods switched on the type to call
type-specific functions. Now, those type-specific functions are saved
in the switch structure and invoked directly. In addition, this more
gracefully permits multiple implementations of the same tag within a
driver. In particular, NIC TLS for future Chelsio adapters will use a
different implementation than the existing NIC TLS support for T6
adapters.

Reviewed by: gallatin, hselasky, kib (older version)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D31572


# c1384241 17-Aug-2021 Luiz Otavio O Souza <loos@FreeBSD.org>

lagg: don't update link layer addresses on destroy

When the lagg is being destroyed it is not necessary update the
lladdr of all the lagg members every time we update the primary
interface.

Reviewed by: scottl
Obtained from: pfSense
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D31586


# 1a714ff2 26-Jan-2021 Randall Stewart <rrs@FreeBSD.org>

This pulls over all the changes that are in the netflix
tree that fix the ratelimit code. There were several bugs
in tcp_ratelimit itself and we needed further work to support
the multiple tag format coming for the joint TLS and Ratelimit dances.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D28357


# 19ecb5e8 29-Dec-2020 Hans Petter Selasky <hselasky@FreeBSD.org>

Fix for IPoIB over lagg(4).

Need to update both link layer address and broadcast address when active link changes for IP over infiniband.
This is because the broadcast address contains the so-called P-key, which is interface dependent.

Reviewed by: kib @
Differential Revision: https://reviews.freebsd.org/D27658
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking


# 5ee33a90 08-Dec-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Fixup r368446 with KERN_TLS.


# e1074ed6 08-Dec-2020 Gleb Smirnoff <glebius@FreeBSD.org>

The list of ports in configuration path shall be protected by locks,
epoch shall be used only for fast path. Thus use LAGG_XLOCK() in
lagg_[un]register_vlan. This fixes sleeping in epoch panic.

PR: 240609


# 87bf9b9c 08-Dec-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Convert LAGG_RLOCK() to NET_EPOCH_ENTER(). No functional changes.


# 8732245d 18-Nov-2020 Andrew Gallatin <gallatin@FreeBSD.org>

LACP: When suppressing distributing, return ENOBUFS

When links come and go, lacp goes into a "suppress distributing" mode
where it drops traffic for 3 seconds. When in this mode, lagg/lacp
historiclally drops traffic with ENETDOWN. That return value causes TCP
to close any connection where it gets that value back from the lower
parts of the stack. This means that any TCP connection with active
traffic during a 3-second windown when an LACP link comes or goes
would get closed.

TCP treats return values of ENOBUFS as transient errors, and re-schedules
transmission later. So rather than returning ENETDOWN, lets
return ENOBUFS instead. This allows TCP connections to be preserved.

I've tested this by repeatedly bouncing links on a Netlfix CDN server
under a moderate (20Gb/s) load and overved ENOBUFS reported back to
the TCP stack (as reported by a RACK TCP sysctl).

Reviewed by: jhb, jtl, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27188


# 2f4ffa9f 11-Nov-2020 Andrey V. Elsukov <ae@FreeBSD.org>

Fix possible NULL pointer dereference.

lagg(4) replaces if_output method of its child interfaces and expects
that this method can be called only by child interfaces. But it is
possible that lagg_port_output() could be called by children of child
interfaces. In this case ifnet's if_lagg field is NULL. Add check that
lp is not NULL.

Obtained from: Yandex LLC
MFC after: 1 week
Sponsored by: Yandex LLC


# 36e0a362 29-Oct-2020 John Baldwin <jhb@FreeBSD.org>

Add m_snd_tag_alloc() as a wrapper around if_snd_tag_alloc().

This gives a more uniform API for send tag life cycle management.

Reviewed by: gallatin, hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27000


# a92c4bb6 22-Oct-2020 Hans Petter Selasky <hselasky@FreeBSD.org>

Add support for IP over infiniband, IPoIB, to lagg(4). Currently only
the failover protocol is supported due to limitations in the IPoIB
architecture. Refer to the lagg(4) manual page for how to configure
and use this new feature. A new network interface type,
IFT_INFINIBANDLAG, has been added, similar to the existing
IFT_IEEE8023ADLAG .

ifconfig(8) has been updated to accept a new laggtype argument when
creating lagg(4) network interfaces. This new argument is used to
distinguish between ethernet and infiniband type of lagg(4) network
interface. The laggtype argument is optional and defaults to
ethernet. The lagg(4) command line syntax is backwards compatible.

Differential Revision: https://reviews.freebsd.org/D26254
Reviewed by: melifaro@
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking


# 56fb710f 06-Oct-2020 John Baldwin <jhb@FreeBSD.org>

Store the send tag type in the common send tag header.

Both cxgbe(4) and mlx5(4) wrapped the existing send tag header with
their own identical headers that stored the type that the
type-specific tag structures inherited from, so in practice it seems
drivers need this in the tag anyway. This permits removing these
extra header indirections (struct cxgbe_snd_tag and struct
mlx5e_snd_tag).

In addition, this permits driver-independent code to query the type of
a tag, e.g. to know what type of tag is being queried via
if_snd_query.

Reviewed by: gallatin, hselasky, np, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26689


# 662c1305 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

net: clean up empty lines in .c and .h files


# 3869d414 13-Aug-2020 Bryan Drewery <bdrewery@FreeBSD.org>

lagg: Avoid adding a port to a lagg device being destroyed.

The lagg_clone_destroy() handles detach and waiting for ifconfig callers
to drain already.

This narrows the race for 2 panics that the tests triggered. Both were a
consequence of adding a port to the lagg device after it had already detached
from all of its ports. The link state task would run after lagg_clone_destroy()
free'd the lagg softc.

kernel:trap_fatal+0xa4
kernel:trap_pfault+0x61
kernel:trap+0x316
kernel:witness_checkorder+0x6d
kernel:_sx_xlock+0x72
if_lagg.ko:lagg_port_state+0x3b
kernel:if_down+0x144
kernel:if_detach+0x659
if_tap.ko:tap_destroy+0x46
kernel:if_clone_destroyif+0x1b7
kernel:if_clone_destroy+0x8d
kernel:ifioctl+0x29c
kernel:kern_ioctl+0x2bd
kernel:sys_ioctl+0x16d
kernel:amd64_syscall+0x337

kernel:trap_fatal+0xa4
kernel:trap_pfault+0x61
kernel:trap+0x316
kernel:witness_checkorder+0x6d
kernel:_sx_xlock+0x72
if_lagg.ko:lagg_port_state+0x3b
kernel:do_link_state_change+0x9b
kernel:taskqueue_run_locked+0x10b
kernel:taskqueue_run+0x49
kernel:ithread_loop+0x19c
kernel:fork_exit+0x83

PR: 244168
Reviewed by: markj
MFC after: 2 weeks
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D25284


# 2a73c8f5 11-Jun-2020 Ravi Pokala <rpokala@FreeBSD.org>

Decode the "LACP Fast Timeout" LAGG option flag

r286700 added the "lacp_fast_timeout" option to `ifconfig', but we forgot to
include the new option in the string used to decode the option bits. Add
"LACP_FAST_TIMO" to LAGG_OPT_BITS.

Also, s/LAGG_OPT_LACP_TIMEOUT/LAGG_OPT_LACP_FAST_TIMO/g , to be clearer that
the flag indicates "Fast Timeout" mode.

Reported by: Greg Foster <gfoster at panasas dot com>
Reviewed by: jpaetzel
MFC after: 1 week
Sponsored by: Panasas
Differential Revision: https://reviews.freebsd.org/D25239


# bd673b99 13-Apr-2020 Andrew Gallatin <gallatin@FreeBSD.org>

lagg: stop double-counting output errors and counting drops as errors

Before this change, lagg double-counted errors from lagg members, and counted
every drop by a lagg member as an error. Eg, if lagg sent a packet, and the
underlying hardware driver dropped it, a counter would be incremented by both
lagg and the underlying driver.

This change attempts to fix that by incrementing lagg's counters only for
errors that do not come from underlying drivers.

Reviewed by: hselasky, jhb
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24331


# 98085bae 09-Mar-2020 Andrew Gallatin <gallatin@FreeBSD.org>

make lacp's use_numa hashing aware of send tags

When I did the use_numa support, I missed the fact that there is
a separate hash function for send tag nic selection. So when
use_numa is enabled, ktls offload does not work properly, as it
does not reliably allocate a send tag on the proper egress nic
since different egress nics are selected for send-tag allocation
and packet transmit. To fix this, this change:

- refectors lacp_select_tx_port_by_hash() and
lacp_select_tx_port() to make lacp_select_tx_port_by_hash()
always called by lacp_select_tx_port()

- pre-shifts flowids to convert them to hashes when calling lacp_select_tx_port_by_hash()

- adds a numa_domain field to if_snd_tag_alloc_params

- plumbs the numa domain into places where we allocate send tags

In testing with NIC TLS setup on a NUMA machine, I see thousands
of output errors before the change when enabling
kern.ipc.tls.ifnet.permitted=1. After the change, I see no
errors, and I see the NIC sysctl counters showing active TLS
offload sessions.

Reviewed by: rrs, hselasky, jhb
Sponsored by: Netflix


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# 84becee1 22-Jan-2020 Alexander Motin <mav@FreeBSD.org>

Update route MTUs for bridge, lagg and vlan interfaces.

Those interfaces may implicitly change their MTU on addition of parent
interface in addition to normal SIOCSIFMTU ioctl path, where the route
MTUs are updated normally.

MFC after: 2 weeks
Sponsored by: iXsystems, Inc.


# 2a4bd982 14-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Introduce NET_EPOCH_CALL() macro and use it everywhere where we free
data based on the network epoch. The macro reverses the argument
order of epoch_call(9) - first function, then its argument. NFC


# 97168be8 14-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Mechanically substitute assertion of in_epoch(net_epoch_preempt) to
NET_EPOCH_ASSERT(). NFC


# c23df8ea 09-Jan-2020 Mark Johnston <markj@FreeBSD.org>

lagg: Further cleanup of the rr_limit option.

Add an option flag so that arbitrary updates to a lagg's configuration
do not clear sc_stride. Preseve compatibility for old ifconfig
binaries. Update ifconfig to use the new flag and improve the casting
used when parsing the option parameter.

Modify the RR transmit function to avoid locklessly reading sc_stride
twice. Ensure that sc_stride is always 1 or greater.

Reviewed by: hselasky
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23092


# c104c299 22-Dec-2019 Mark Johnston <markj@FreeBSD.org>

lagg: Clean up handling of the rr_limit option.

- Don't allow an unprivileged user to set the stride. [1]
- Only set the stride under the softc lock.
- Rename the internal fields to accurately reflect their use. Keep
ro_bkt to avoid changing the user API.
- Simplify the implementation. The port index is just sc_seq / stride.
- Document rr_limit in ifconfig.8.

Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com> [1]
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22857


# b2e60773 26-Aug-2019 John Baldwin <jhb@FreeBSD.org>

Add kernel-side support for in-kernel TLS.

KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277


# 20abea66 01-Aug-2019 Randall Stewart <rrs@FreeBSD.org>

This adds the third step in getting BBR into the tree. BBR and
an updated rack depend on having access to the new
ratelimit api in this commit.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D20953


# ce7fb386 06-Jun-2019 Mark Johnston <markj@FreeBSD.org>

Restore the comment removed in r348745.

LAGG_RLOCK() enters an epoch section, so the comment wasn't stale.

Reported by: jhb
MFC with: r348745


# 9995dfd3 06-Jun-2019 Mark Johnston <markj@FreeBSD.org>

Conditionalize an in_epoch() call on INVARIANTS.

Its result is only used to determine whether to perform further
INVARIANTS-only checks. Remove a stale comment while here.

Submitted by: Sebastian Huber <sebastian.huber@embedded-brains.de>
MFC after: 1 week


# fb3bc596 24-May-2019 John Baldwin <jhb@FreeBSD.org>

Restructure mbuf send tags to provide stronger guarantees.

- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.

Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.

To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.

For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.

- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.

This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).

In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.

- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.

To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.

- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.

- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.

Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117


# 35961dce 03-May-2019 Andrew Gallatin <gallatin@FreeBSD.org>

Select lacp egress ports based on NUMA domain

This change creates an array of port maps indexed by numa domain
for lacp port selection. If we have lacp interfaces in more than
one domain, then we select the egress port by indexing into the
numa port maps and picking a port on the appropriate numa domain.

This is behavior is controlled by the new ifconfig use_numa flag
and net.link.lagg.use_numa sysctl/tunable (both modeled after the
existing use_flowid), which default to enabled.

Reviewed by: bz, hselasky, markj (and scottl, earlier version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20060


# 841613dc 28-Mar-2019 John Baldwin <jhb@FreeBSD.org>

Use a dedicated malloc type for lagg(4)'s structures.

Reviewed by: gallatin
MFC after: 1 month
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19719


# 2f59b04a 28-Mar-2019 John Baldwin <jhb@FreeBSD.org>

Remove nested epochs from lagg(4).

lagg_bcast_start appeared to have a bug in that was using the last
lagg port structure after exiting the epoch that was keeping that
structure alive. However, upon further inspection, the epoch was
already entered by the caller (lagg_transmit), so the epoch enter/exit
in lagg_bcast_start was actually unnecessary.

This commit generally removes uses of the net epoch via LAGG_RLOCK to
protect the list of ports when the list of ports was already protected
by an existing LAGG_RLOCK in a caller, or the LAGG_XLOCK.

It also adds a missing epoch enter/exit in lagg_snd_tag_alloc while
accessing the lagg port structures. An ifp is still accessed via an
unsafe reference after the epoch is exited, but that is true in the
current code and will be fixed in a future change.

Reviewed by: gallatin
MFC after: 1 month
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19718


# fa91f845 13-Feb-2019 Randall Stewart <rrs@FreeBSD.org>

This commit adds the missing release mechanism for the
ratelimiting code. The two modules (lagg and vlan) did have
allocation routines, and even though they are indirect (and
vector down to the underlying interfaces) they both need to
have a free routine (that also vectors down to the actual interface).

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D19032


# 829c56fc 31-Jan-2019 John Baldwin <jhb@FreeBSD.org>

Don't set IFCAP_TXRTLMT during lagg_clone_create().

lagg_capabilities() will set the capability once interfaces supporting
the feature are added to the lagg. Setting it on a lagg without any
interfaces is pointless as the if_snd_tag_alloc call will always fail
in that case.

Reviewed by: hselasky, gallatin
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19040


# fbd8c330 30-Oct-2018 Marcelo Araujo <araujo@FreeBSD.org>

Allow changing lagg(4) MTU.

Previously, changing the MTU would require destroying the lagg and
creating a new one. Now it is allowed to change the MTU of
the lagg interface and the MTU of the ports will be set to match.

If any port cannot set the new MTU, all ports are reverted to the original
MTU of the lagg. Additionally, when adding ports, the MTU of a port will be
automatically set to the MTU of the lagg. As always, the MTU of the lagg is
initially determined by the MTU of the first port added. If adding an
interface as a port for some reason fails, that interface is reverted to its
original MTU.

Submitted by: Ryan Moeller <ryan@freqlabs.com>
Reviewed by: mav
Relnotes: Yes
Sponsored by: iXsystems Inc.
Differential Revision: https://reviews.freebsd.org/D17576


# 13c6ba6d 09-Oct-2018 Jonathan T. Looney <jtl@FreeBSD.org>

There are three places where we return from a function which entered an
epoch section without exiting that epoch section. This is bad for two
reasons: the epoch section won't exit, and we will leave the epoch tracker
from the stack on the epoch list.

Fix the epoch leak by making sure we exit epoch sections before returning.

Reviewed by: ae, gallatin, mmacy
Approved by: re (gjb, kib)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D17450


# 5ccac9f9 13-Aug-2018 Andrew Gallatin <gallatin@FreeBSD.org>

lagg: allow lacp to manage the link state

Lacp needs to manage the link state itself. Unlike other
lagg protocols, the ability of lacp to pass traffic
depends not only on the lagg members having link, but also
on the lacp protocol converging to a distributing state with the
link partner.

If we prematurely mark the link as up, then we will send a
gratuitous arp (via arp_handle_ifllchange()) before the lacp
interface is capable of passing traffic. When this happens,
the gratuitous arp is lost, and our link partner may cache
a stale mac address (eg, when the base mac address for the
lagg bundle changes, due to a BIOS change re-ordering NIC
unit numbers)

Reviewed by: jtl, hselasky
Sponsored by: Netflix


# 5f901c92 24-Jul-2018 Andrew Turner <andrew@FreeBSD.org>

Use the new VNET_DEFINE_STATIC macro when we are defining static VNET
variables.

Reviewed by: bz
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D16147


# 6573d758 03-Jul-2018 Matt Macy <mmacy@FreeBSD.org>

epoch(9): allow preemptible epochs to compose

- Add tracker argument to preemptible epochs
- Inline epoch read path in kernel and tied modules
- Change in_epoch to take an epoch as argument
- Simplify tfb_tcp_do_segment to not take a ti_locked argument,
there's no longer any benefit to dropping the pcbinfo lock
and trying to do so just adds an error prone branchfest to
these functions
- Remove cases of same function recursion on the epoch as
recursing is no longer free.
- Remove the the TAILQ_ENTRY and epoch_section from struct
thread as the tracker field is now stack or heap allocated
as appropriate.

Tested by: pho and Limelight Networks
Reviewed by: kbowling at llnw dot com
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16066


# 0f8d79d9 24-May-2018 Matt Macy <mmacy@FreeBSD.org>

CK: update consumers to use CK macros across the board

r334189 changed the fields to have names distinct from those in queue.h
in order to expose the oversights as compile time errors


# db5a36bd 22-May-2018 Mark Johnston <markj@FreeBSD.org>

Simplify lagg_input().

No functional change intended.

MFC after: 2 weeks


# 46d0f824 18-May-2018 Matt Macy <mmacy@FreeBSD.org>

net: fix set but not used


# d7c5a620 18-May-2018 Matt Macy <mmacy@FreeBSD.org>

ifnet: Replace if_addr_lock rwlock with epoch + mutex

Run on LLNW canaries and tested by pho@

gallatin:
Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5
based ConnectX 4-LX NIC, I see an almost 12% improvement in received
packet rate, and a larger improvement in bytes delivered all the way
to userspace.

When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1,
I see, using nstat -I mce0 1 before the patch:

InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree
4.98 0.00 4.42 0.00 4235592 33 83.80 4720653 2149771 1235 247.32
4.73 0.00 4.20 0.00 4025260 33 82.99 4724900 2139833 1204 247.32
4.72 0.00 4.20 0.00 4035252 33 82.14 4719162 2132023 1264 247.32
4.71 0.00 4.21 0.00 4073206 33 83.68 4744973 2123317 1347 247.32
4.72 0.00 4.21 0.00 4061118 33 80.82 4713615 2188091 1490 247.32
4.72 0.00 4.21 0.00 4051675 33 85.29 4727399 2109011 1205 247.32
4.73 0.00 4.21 0.00 4039056 33 84.65 4724735 2102603 1053 247.32

After the patch

InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree
5.43 0.00 4.20 0.00 3313143 33 84.96 5434214 1900162 2656 245.51
5.43 0.00 4.20 0.00 3308527 33 85.24 5439695 1809382 2521 245.51
5.42 0.00 4.19 0.00 3316778 33 87.54 5416028 1805835 2256 245.51
5.42 0.00 4.19 0.00 3317673 33 90.44 5426044 1763056 2332 245.51
5.42 0.00 4.19 0.00 3314839 33 88.11 5435732 1792218 2499 245.52
5.44 0.00 4.19 0.00 3293228 33 91.84 5426301 1668597 2121 245.52

Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch

Reviewed by: gallatin
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15366


# 70398c2f 18-May-2018 Matt Macy <mmacy@FreeBSD.org>

epoch(9): Make epochs non-preemptible by default

There are risks associated with waiting on a preemptible epoch section.
Change the name to make them not be the default and document the issue
under CAVEATS.

Reported by: markj


# 99031b8f 14-May-2018 Stephen Hurd <shurd@FreeBSD.org>

Replace rmlock with epoch in lagg

Use the new epoch based reclamation API. Now the hot paths will not
block at all, and the sx lock is used for the softc data. This fixes LORs
reported where the rwlock was obtained when the sxlock was held.

Submitted by: mmacy
Reported by: Harry Schmalzbauer <freebsd@omnilan.de>
Reviewed by: sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15355


# fd3bb7aa 04-Jan-2018 Steven Hartland <smh@FreeBSD.org>

Disabled the use of flowid for lagg by default

Disabled the use of RSS hash from the network card aka flowid for
lagg(4) interfaces by default as it's currently incompatible with
the lacp and loadbalance protocols.

The incompatibility is due to the fact that the flowid isn't know
for the first packet of a new outbound stream which can result in
the hash calculation method changing and hence a stream being
incorrectly split across multiple interfaces during normal
operation.

This can be re-enabled by setting the following in loader.conf:
net.link.lagg.default_use_flowid="1"

Discussed with: kmacy
Sponsored by: Multiplay


# 51352d9d 25-Jul-2017 Sean Bruno <sbruno@FreeBSD.org>

Don't hold the RM lock during lagg_proto_addport() to avoid an LOR.

Submitted by: Kevin Bowling <kevin.bowling@kev009.com>
Reviewed by: mav
MFC after: 1 week
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D11711


# 41cf0d54 26-May-2017 Alexander Motin <mav@FreeBSD.org>

Call VLAN_CAPABILITIES() when LAGG capabilities change.

This makes VLAN on top of LAGG to expose proper capabilities if they are
changed after creation.

MFC after: 1 week


# 8403ab79 26-May-2017 Alexander Motin <mav@FreeBSD.org>

Improve applying unified capabilities to the lagg ports.

Some NICs have some capabilities dependent, so that disabling one require
disabling some other (TXCSUM/RXCSUM on em). This code tries to reach the
consensus more insistently.

PR: 219453
MFC after: 1 week


# e3d90506 25-May-2017 Alexander Motin <mav@FreeBSD.org>

Remove some code, dead from the day one.


# bbfc32a6 05-May-2017 Alexander Motin <mav@FreeBSD.org>

Relax r317696 locking to not drain taskqueue under the lock.

MFC after: 11 days


# e83177fb 02-May-2017 Alexander Motin <mav@FreeBSD.org>

Fix r317696 build without debug.

MFC after: 2 weeks


# 2f86d4b0 02-May-2017 Alexander Motin <mav@FreeBSD.org>

Introduce sleepable locks into if_lagg.

Before this change if_lagg was using nonsleepable rmlocks to protect its
internal state. This patch introduces another sx lock to protect code
paths that require sleeping, while still uses old rmlock to protect hot
nonsleepable data paths.

This change allows to remove taskqueue decoupling used before to change
interface addresses without holding the lock. Instead it uses sx lock to
protect direct if_ioctl() calls.

As another bonus, the new code synchronizes enabled capabilities of member
interfaces, and allows to control them with ifconfig laggX, that was
impossible before. This part should fix interoperation with if_bridge,
that may need to disable some capabilities, such as TXCSUM or LRO, to allow
bridging with noncapable interfaces.

MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D10514


# 1e04441a 22-Apr-2017 Alexander Motin <mav@FreeBSD.org>

Remove unneeded conditions.

MFC after: 2 weeks


# b98b5ae8 21-Apr-2017 Alexander Motin <mav@FreeBSD.org>

Add interface reference counting to if_lagg.

Using plain ifunit() looks like request for troubles.

MFC after: 2 weeks


# 13157b2b 29-Jan-2017 Luiz Otavio O Souza <loos@FreeBSD.org>

Do not update the lagg link layer address when destroying a lagg clone.

This would enqueue an event to send the gratuitous arp on a dying lagg
interface without any physical ports attached to it.

Apart from that, the taskqueue_drain() on lagg_clone_destroy() runs too
late, when the ifp data structure is already freed. Fix that too.

Obtained from: pfSense
MFC after: 2 weeks
Sponsored by: Rubicon Communications, LLC (Netgate)


# f3e7afe2 18-Jan-2017 Hans Petter Selasky <hselasky@FreeBSD.org>

Implement kernel support for hardware rate limited sockets.

- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.

- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.

- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().

- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.

- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.

- How rate limiting works:

1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.

2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.

3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.

4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.

Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months


# 8a73c85d 20-Dec-2016 Alan Somers <asomers@FreeBSD.org>

Remove stray debugging code from r310180

Reported by: rstone
Pointy hat to: asomers
MFC after: 3 weeks
X-MFC-with: 310180
Sponsored by: Spectra Logic Corp


# d9fa2d67 16-Dec-2016 Alan Somers <asomers@FreeBSD.org>

Fix panic during lagg destruction with simultaneous status check

If you run "ifconfig lagg0 destroy" and "ifconfig lagg0" at the same time a
page fault may result. The first process will destroy ifp->if_lagg in
lagg_clone_destroy (called by if_clone_destroy). Then the second process
will observe that ifp->if_lagg is NULL at the top of lagg_port_ioctl and
goto fallback: where it will promptly dereference ifp->if_lagg anyway.

The solution is to repeat the NULL check for ifp->if_lagg

MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D8512


# 89856f7e 21-Jun-2016 Bjoern A. Zeeb <bz@FreeBSD.org>

Get closer to a VIMAGE network stack teardown from top to bottom rather
than removing the network interfaces first. This change is rather larger
and convoluted as the ordering requirements cannot be separated.

Move the pfil(9) framework to SI_SUB_PROTO_PFIL, move Firewalls and
related modules to their own SI_SUB_PROTO_FIREWALL.
Move initialization of "physical" interfaces to SI_SUB_DRIVERS,
move virtual (cloned) interfaces to SI_SUB_PSEUDO.
Move Multicast to SI_SUB_PROTO_MC.

Re-work parts of multicast initialisation and teardown, not taking the
huge amount of memory into account if used as a module yet.

For interface teardown we try to do as many of them as we can on
SI_SUB_INIT_IF, but for some this makes no sense, e.g., when tunnelling
over a higher layer protocol such as IP. In that case the interface
has to go along (or before) the higher layer protocol is shutdown.

Kernel hhooks need to go last on teardown as they may be used at various
higher layers and we cannot remove them before we cleaned up the higher
layers.

For interface teardown there are multiple paths:
(a) a cloned interface is destroyed (inside a VIMAGE or in the base system),
(b) any interface is moved from a virtual network stack to a different
network stack ("vmove"), or (c) a virtual network stack is being shut down.
All code paths go through if_detach_internal() where we, depending on the
vmove flag or the vnet state, make a decision on how much to shut down;
in case we are destroying a VNET the individual protocol layers will
cleanup their own parts thus we cannot do so again for each interface as
we end up with, e.g., double-frees, destroying locks twice or acquiring
already destroyed locks.
When calling into protocol cleanups we equally have to tell them
whether they need to detach upper layer protocols ("ulp") or not
(e.g., in6_ifdetach()).

Provide or enahnce helper functions to do proper cleanup at a protocol
rather than at an interface level.

Approved by: re (hrs)
Obtained from: projects/vnet
Reviewed by: gnn, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D6747


# a4641f4e 03-May-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/net*: minor spelling fixes.

No functional change.


# 729a4cff 05-Apr-2016 Ravi Pokala <rpokala@FreeBSD.org>

Revert accidental submit of WIP as part of r297609

Pointyhat to: rpokala


# 06152bf0 05-Apr-2016 Ravi Pokala <rpokala@FreeBSD.org>

Storage Controller Interface driver - typo in unimplemented macro in
scic_sds_controller_registers.h

s/contoller/controller/

PR: 207336
Submitted by: Tony Narlock <tony @ git-pull.com>


# d931334b 18-Feb-2016 Marcelo Araujo <araujo@FreeBSD.org>

Fix regression introduced on 272446r.

lagg(4) supports the protocol none, where it disables any traffic without
disabling the lagg(4) interface itself.

PR: 206921
Submitted by: Pushkar Kothavade <pushkarbk@gmail.com>
Reviewed by: rpokala
Approved by: bapt (mentor)
MFC after: 3 weeks
Sponsored by: gandi.net
Differential Revision: https://reviews.freebsd.org/D5076


# d62edc5e 22-Jan-2016 Marcelo Araujo <araujo@FreeBSD.org>

Add an IOCTL rr_limit to let users fine tuning the number of packets to be
sent using roundrobin protocol and set a better granularity and distribution
among the interfaces. Tuning the number of packages sent by interface can
increase throughput and reduce unordered packets as well as reduce SACK.

Example of usage:
# ifconfig bge0 up
# ifconfig bge1 up
# ifconfig lagg0 create
# ifconfig lagg0 laggproto roundrobin laggport bge0 laggport bge1 \
192.168.1.1 netmask 255.255.255.0
# ifconfig lagg0 rr_limit 500

Reviewed by: thompsa, glebius, adrian (old patch)
Approved by: bapt (mentor)
Relnotes: Yes
Differential Revision: https://reviews.freebsd.org/D540


# d6e82913 17-Dec-2015 Steven Hartland <smh@FreeBSD.org>

Revert r292275 & r292379

glebius has concerns about these changes so reverting those can be discussed
and addressed.

Sponsored by: Multiplay


# 52e53e2d 15-Dec-2015 Steven Hartland <smh@FreeBSD.org>

Fix lagg failover due to missing notifications

When using lagg failover mode neither Gratuitous ARP (IPv4) or Unsolicited
Neighbour Advertisements (IPv6) are sent to notify other nodes that the
address may have moved.

This results is slow failover, dropped packets and network outages for the
lagg interface when the primary link goes down.

We now use the new if_link_state_change_cond with the force param set to
allow lagg to force through link state changes and hence fire a
ifnet_link_event which are now monitored by rip and nd6.

Upon receiving these events each protocol trigger the relevant
notifications:
* inet4 => Gratuitous ARP
* inet6 => Unsolicited Neighbour Announce

This also fixes the carp IPv6 NA's that stopped working after r251584 which
added the ipv6_route__llma route.

The new behavour can be controlled using the sysctls:
* net.link.ether.inet.arp_on_link
* net.inet6.icmp6.nd6_on_link

Also removed unused param from lagg_port_state and added descriptions for the
sysctls while here.

PR: 156226
MFC after: 1 month
Sponsored by: Multiplay
Differential Revision: https://reviews.freebsd.org/D4111


# 8ad43f2d 14-Nov-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Move iflladdr_event eventhandler invocation to if_setlladdr.

Suggested by: glebius


# bb3d23fd 01-Nov-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix lladdr change propagation for on vlans on top of it.
Fix lladdr update when setting mac address manually.
Fix lladdr_event for slave ports addition.

MFC after: 4 weeks
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D4004


# b7a581ea 15-Oct-2015 Hiroki Sato <hrs@FreeBSD.org>

Fix a panic when destroying a lagg interface.

Differential Revision: https://reviews.freebsd.org/D3883


# 023d10cb 07-Oct-2015 Hiroki Sato <hrs@FreeBSD.org>

Fix a bug that caused reinitialization failure of MAC addresses on
the lagg interface when removing the primary port.

PR: 201916
Differential Revision: https://reviews.freebsd.org/D3301


# 973532fc 04-Oct-2015 Marcelo Araujo <araujo@FreeBSD.org>

Remove per complete the fec aggregation protocol.
The remove began with revision r271733.

NOTE: This patch must never be merge to 10-Stable

Reviewed by: glebius
Approved by: bapt (mentor)
Relnotes: Yes
Sponsored by: EuroBSDCon Sweden.
Differential Revision: D3786


# 0e02b43a 12-Aug-2015 Hiren Panchasara <hiren@FreeBSD.org>

Make LAG LACP fast timeout tunable through IOCTL.

Differential Revision: D3300
Submitted by: LN Sundararajan <lakshmi.n at msystechnologies>
Reviewed by: wblock, smh, gnn, hiren, rpokala at panasas
MFC after: 2 weeks
Sponsored by: Panasas


# a4b65afc 26-Mar-2015 Andrey V. Elsukov <ae@FreeBSD.org>

Fix a possible mbuf leak on interface departure.

Reported by: Alexandre Martins


# b7ba031f 11-Mar-2015 Hans Petter Selasky <hselasky@FreeBSD.org>

Factor out mbuf hashing code from LAGG driver so that other network
drivers can use it. This avoids some code duplication. Add missing
default case to all switch statements while at it. Also move the
hashing of the IPv6 flow field to layer 4 because the IPv6 flow field
is constant on a per L4 connection basis and not on a per L3 network.

Differential Revision: https://reviews.freebsd.org/D1987
Sponsored by: Mellanox Technologies
MFC after: 1 month


# 0e5f55bb 22-Jan-2015 Will Andrews <will@FreeBSD.org>

Improve the distribution of LAGG port traffic.

I edited the original change to retain the use of arc4random() as a seed for
the hashing as a very basic defense against intentional lagg port selection.

The author's original commit message (edited slightly):

sys/net/ieee8023ad_lacp.c
sys/net/if_lagg.c
In lagg_hashmbuf, use the FNV hash instead of the old
hash32_buf. The hash32 family of functions operate one octet
at a time, and when run on a string s of length n, their output
is equivalent to :

----- i=n-1
\
n \ (n-i-1) 32
( seed^ + / 33^ * s[i] ) % 2^
/
----- i=0

The problem is that the last five bytes of input don't get
multiplied by sufficiently many powers of 33 to rollover 2^32.
That means that changing the last few bytes (but obviously not
the very last) of input will always change the value of the
hash by a multiple of 33. In the case of lagg_hashmbuf() with
ipv4 input, the last four bytes are the TCP or UDP port
numbers. Since the output of lagg_hashmbuf is always taken
modulo the port count, and 3 is a common port count for a lagg,
that's bad. It means that the UDP or TCP source port will
never affect which lagg member is selected on a 3-port lagg.

At 10Gbps, I was not able to measure any difference in CPU
consumption between the old and new hash.

Submitted by: asomers (original commit)
Reviewed by: emaste, glebius
MFC after: 1 week
Sponsored by: Spectra Logic
MFSpectraBSD: 1001723 on 2013/08/28 (original)
1114258 on 2015/01/22 (edit)


# 504289ea 17-Jan-2015 Andrey V. Elsukov <ae@FreeBSD.org>

Fix condition and really sort ports. Also add comment describing
the intent of this code.

Reported by: sbruno
MFC after: 1 week
Sponsored by: Yandex LLC


# c2529042 01-Dec-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Start process of removing the use of the deprecated "M_FLOWID" flag
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.

This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.

"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.

Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.

MFC after: 1 month
Sponsored by: Mellanox Technologies


# 033074c4 09-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Replace 'struct route *' if_output() argument with 'struct nhop_info *'.
Leave 'struct route' as is for legacy routing api users.
Remove most of rtalloc_ign*-derived functions.


# bf6d3f0c 17-Oct-2014 Hiroki Sato <hrs@FreeBSD.org>

- Fix lladdr configuration which could prevent LACP mode from working.
- Fix LORs when a laggport interface has an IPv6 LLA.

PR: 194321


# 6d478167 04-Oct-2014 Hiroki Sato <hrs@FreeBSD.org>

- Move L2 addr configuration for the primary port to a taskqueue. This fixes
LOR of softc rmlock in iflladdr_event handlers.

- Call if_delmulti_ifma() after LACP_UNLOCK(). This fixes another LOR.

- Fix a panic in lacp_transit_expire().

- Fix a panic in lagg_input() upon shutting down a port.


# 9732189c 02-Oct-2014 Hiroki Sato <hrs@FreeBSD.org>

Separate option handling from SIOC[SG]LAGG to SIOC[SG]LAGGOPTS for
backward compatibility with old ifconfig(8).


# 939a050a 01-Oct-2014 Hiroki Sato <hrs@FreeBSD.org>

Virtualize lagg(4) cloner. This change fixes a panic when tearing down
if_lagg(4) interfaces which were cloned in a vnet jail.

Sysctl nodes which are dynamically generated for each cloned interface
(net.link.lagg.N.*) have been removed, and use_flowid and flowid_shift
ifconfig(8) parameters have been added instead. Flags and per-interface
statistics counters are displayed in "ifconfig -v".

CR: D842


# dee826ce 01-Oct-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Fix off by one in lagg_port_destroy().

Reported by: "Max N. Boyarov" <zotrix bsd.by>


# 112f50ff 28-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Finally, convert counters in struct ifnet to counter(9).

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 23575437 28-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Convert to if_inc_counter() last remnantes of bare access to struct ifnet
counters.


# 7d6cc45c 27-Sep-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Use underlying ports counters to get lagg statistics instead of
per-packet accounting.
This introduce user-visible changes like aggregating error counters.

Reviewed by: asomers (prev.version), glebius
CR: D781
MFC after: 2 weeks
Sponsored by: Yandex LLC


# eade13f9 26-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Remove macros that hide access to struct ifnet fields.


# 38738d73 25-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Make all lagg protocol methods live in lagg_protos, not in softc. All
interfaces of a same protocol, use the same methods.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 30e5de48 25-Sep-2014 Andrey V. Elsukov <ae@FreeBSD.org>

Keep list of lagg ports sorted by if_index.

Obtained from: Yandex LLC
MFC after: 1 week
Sponsored by: Yandex LLC


# 6900d0d3 25-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

- Whitespace.
- Remove caddr_t.


# 16ca790e 26-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

- Provide lagg_proto_attach(), lagg_proto_detach().
- Make detach a protocol method in lagg_protos.
- Simplify code to lookup protocols.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 09c7577e 26-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

- When reconfiguring protocol on a lagg, first set it to LAGG_PROTO_NONE,
then drop lock, run the attach routines, and then set it to specific
proto. This removes tons of WITNESS warnings.
- Make lagg protocol attach handlers not failing and allocate memory
with M_WAITOK.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# b1bbc5b3 26-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Make lagg protocols detach methods returning void.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 9fd573c3 22-Sep-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Improve transmit sending offload, TSO, algorithm in general.

The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.

Reviewed by: adrian, rmacklem
Sponsored by: Mellanox Technologies
MFC after: 1 week


# 99cdd961 17-Sep-2014 Marcelo Araujo <araujo@FreeBSD.org>

Add laggproto broadcast, it allows sends frames to all ports of the lagg(4) group
and receives frames on any port of the lagg(4).

Phabric: D549
Reviewed by: glebius, thompsa
Approved by: glebius
Obtained from: OpenBSD
Sponsored by: QNAP Systems Inc.


# 72f31000 13-Sep-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Revert r271504. A new patch to solve this issue will be made.

Suggested by: adrian @


# eb93b77a 13-Sep-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Improve transmit sending offload, TSO, algorithm in general.

The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.

MFC after: 1 week
Sponsored by: Mellanox Technologies


# 13399157 10-Aug-2014 Marcelo Araujo <araujo@FreeBSD.org>

- Remove unneeded include.

Phabric: D563
Reviewed by: kevlo
Approved by: kevlo


# 2d222cb7 03-Aug-2014 Alexander Motin <mav@FreeBSD.org>

Improve locking of multicast addresses in VLAN and LAGG interfaces.

This fixes several scenarios of reproducible panics, cause by races
between multicast address changes and interface destruction.

MFC after: 2 weeks


# af3b2549 27-Jun-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Pull in r267961 and r267973 again. Fix for issues reported will follow.


# 37a107a4 27-Jun-2014 Glen Barber <gjb@FreeBSD.org>

Revert r267961, r267973:

These changes prevent sysctl(8) from returning proper output,
such as:

1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory


# 3da1cf1e 27-Jun-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after: 2 weeks
Sponsored by: Mellanox Technologies


# d092e11c 15-Apr-2014 Rick Macklem <rmacklem@FreeBSD.org>

Fix build for non-INET that was broken by r264469.

MFC after: 2 weeks


# 9387570f 14-Apr-2014 Rick Macklem <rmacklem@FreeBSD.org>

Lagg did not set the value of if_hw_tsomax, so when lagg
was stacked on top of network interfaces that set if_hw_tsomax,
tcp_output() would see the default value instead of the value
set by the network interface(s). This patch modifies lagg so that
it sets if_hw_tsomax to the minimum of the value(s) for the
underlying network interfaces.

Reviewed by: glebius
MFC after: 2 weeks


# 95fbe4d0 18-Jan-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Simplify filling sockaddr_dl structure for if_resolvemulti()
callback providers. link_init_sdl() function can be used to
fill most of the parameters. Use caller stack instead of
allocation / freing memory for each request. Do not drop support
for extra-long (probably non-existing) link-layer protocols by
introducing link_alloc_sdl() (used by if_resolvemulti() callback)
and link_free_sdl() (used by caller).
Since this change breaks KBI, MFC requires slightly different approach
(link_init_sdl() auto-allocating buffer if necessary to handle cases
with unmodified if_resolvemulti() callers).

MFC after: 2 weeks


# 1a8959da 29-Dec-2013 Scott Long <scottl@FreeBSD.org>

Multi-queue NIC drivers and multi-port lagg tend to use the same lower
bits of the flowid as each other, resulting in a poor distribution of
packets among queues in certain cases. Work around this by adding a
set of sysctls for controlling a bit-shift on the flowid when doing
multi-port aggrigation in lagg and lacp. By default, lagg/lacp will
now use bits 16 and higher instead of 0 and higher.

Reviewed by: max
Obtained from: Netflix
MFC after: 3 days


# 4cdc1f54 09-Oct-2013 Gleb Smirnoff <glebius@FreeBSD.org>

There are some high performance NICs that count statistics in hardware,
and there are ifnets, that do that via counter(9). Provide a flag that
would skip cache line trashing '+=' operation in ether_input().

Sponsored by: Netflix
Sponsored by: Nginx, Inc.
Reviewed by: melifaro, adrian
Approved by: re (marius)


# 310915a4 29-Aug-2013 Adrian Chadd <adrian@FreeBSD.org>

Convert the if_lagg rwlock to an rmlock.

We've been seeing lots of cache line contention (but not lock contention!)
in our workloads between the various TX and RX threads going on.

The write lock is only grabbed when configuration changes are made - which
are infrequent.

With this patch, the contention and cycles spent waiting for updates
disappear.

Sponsored by: Netflix, Inc.


# 49de4f22 26-Jul-2013 Adrian Chadd <adrian@FreeBSD.org>

Break out the static, global LACP debug options into a per-lagg unit
sysctl tree.

* Create a net.link.lagg.X.lacp node
* Add a debug node under that for tx_test and rx_test
* Add lacp_strict_mode, defaulting to 1

tx_test and rx_test are still a bitmap of unit numbers for now.
At some point it would be nice to create child nodes of the lagg bundle
for each sub-interface, and then populate those with various knobs
and statistics.

Sponsored by: Netflix


# 31402c27 12-Jul-2013 Adrian Chadd <adrian@FreeBSD.org>

Bring over some link aggregation / LACP protocol improvements and debugging
additions.

* Add some new tracing events to aid in debugging.
* Add in a debugging mode to drop transmit and received frames, specifically
to test whether seeing or hearing heartbeats correctly cause LACP to
drop the port.
* Add in (and make default) a strict LACP mode, which requires the
heartbeat on a port to be heard before it's used. Sometimes vendor ports
will hang but the link layer stays up, resulting in hung traffic.
* Add logging the number of link status flaps, again to aid in debugging
badly behaving switch ports.
* Calculate the lagg interface port speed as the multiple of the
configured ports, rather than the largest.

Obtained from: Netflix
MFC after: 2 weeks


# af805644 02-Jul-2013 Hiroki Sato <hrs@FreeBSD.org>

- Allow ND6_IFF_AUTO_LINKLOCAL for IFT_BRIDGE. An interface with IFT_BRIDGE
is initialized with !ND6_IFF_AUTO_LINKLOCAL && !ND6_IFF_ACCEPT_RTADV
regardless of net.inet6.ip6.accept_rtadv and net.inet6.ip6.auto_linklocal.
To configure an autoconfigured link-local address (RFC 4862), the
following rc.conf(5) configuration can be used:

ifconfig_bridge0_ipv6="inet6 auto_linklocal"

- if_bridge(4) now removes IPv6 addresses on a member interface to be
added when the parent interface or one of the existing member
interfaces has an IPv6 address. if_bridge(4) merges each link-local
scope zone which the member interfaces form respectively, so it causes
address scope violation. Removal of the IPv6 addresses prevents it.

- if_lagg(4) now removes IPv6 addresses on a member interfaces
unconditionally.

- Set reasonable flags to non-IPv6-capable interfaces. [*]

Submitted by: rpaulo [*]
MFC after: 1 week


# eda6cf02 17-Jun-2013 Xin LI <delphij@FreeBSD.org>

Return ENETDOWN instead of ENOENT when all lagg(4) links are
inactive when upper layer tries to transmit packet. This
gives better feedback and meaningful errors for applications.

MFC after: 2 weeks
Reviewed by: thompsa


# f8afe337 07-Jun-2013 Mikolaj Golub <trociny@FreeBSD.org>

Properly set curvnet context in lagg_port_setlladdr() task handler.

Reported by: Nikos Vassiliadis <nvass gmx.com>
Submitted by: zec
Tested by: Nikos Vassiliadis <nvass gmx.com>
MFC after: 1 week


# 47e8d432 25-Apr-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Add const qualifier to the dst parameter of the ifnet if_output method.


# b64478a1 15-Apr-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Switch lagg(4) statistics to counter(9).

The lagg(4) is often used to bond high speed links, so basic per-packet +=
on statistics cause cache misses and statistics loss.

Perfect solution would be to convert ifnet(9) to counters(9), but this
requires much more work, and unfortunately ABI change, so temporarily
patch lagg(4) manually.

We store counters in the softc, and once per second push their values
to legacy ifnet counters.

Sponsored by: Nginx, Inc.


# 209dddb9 22-Mar-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Remove __FreeBSD_version ifdefs.


# 1d9797f1 21-Jan-2013 Gleb Smirnoff <glebius@FreeBSD.org>

If lagg(4) can't forward a packet due to underlying port problems,
return much more meaningful ENETDOWN to the stack, instead of EBUSY.


# 6684e469 17-Oct-2012 Xin LI <delphij@FreeBSD.org>

Fix build.


# e9bbb44e 16-Oct-2012 Maksim Yevmenkin <emax@FreeBSD.org>

report total number of ports for each lagg(4) interface
via net.link.lagg.X.count sysctl

MFC after: 1 week


# 42a58907 16-Oct-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Make the "struct if_clone" opaque to users of the cloning API. Users
now use function calls:

if_clone_simple()
if_clone_advanced()

to initialize a cloner, instead of macros that initialize if_clone
structure.

Discussed with: brooks, bz, 1 year ago


# 9823d527 10-Oct-2012 Kevin Lo <kevlo@FreeBSD.org>

Revert previous commit...

Pointyhat to: kevlo (myself)


# a10cee30 09-Oct-2012 Kevin Lo <kevlo@FreeBSD.org>

Prefer NULL over 0 for pointers


# 3b7d677b 20-Sep-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Convert lagg(4) to use if_transmit instead of if_start.

In collaboration with: thompsa, sbruno, fabient


# 61587a84 30-Jun-2012 Andrew Thompson <thompsa@FreeBSD.org>

Add the same check as vlan(4) where we ignore the ifnet departure event if the
interface is just being renamed.

PR: kern/169557
Submitted by: Mark Johnston
MFC after: 3 days


# f74d5a7a 27-May-2012 Eygene Ryabinkin <rea@FreeBSD.org>

if_lagg: allow to invoke SIOCSLAGGPORT multiple times in a row

Currently, 'ifconfig laggX down' does not remove members from this
lagg(4) interface. So, 'service netif stop laggX' followed by
'service netif start laggX' will choke, because "stop" will leave
interfaces attached to the laggX and ifconfig from the "start" will
refuse to add already-existing interfaces.

The real-world case is when I am bundling together my Ethernet and
WiFi interfaces and using multiple profiles for accessing network in
different places: system being booted up with one profile, but later
this profile being exchanged to another one, followed by 'service
netif restart' will not add WiFi interface back to the lagg: the
"stop" action from 'service netif restart' will shut down my main WiFi
interface, so wlan0 that exists in the lagg0 will be destroyed and
purged from lagg0; the "start" action will try to re-add both
interfaces, but since Ethernet one is already in lagg0, ifconfig will
refuse to add the wlan0 from WiFi interface.

Since adding the interface to the lagg(4) when it is already here
should be an idempotent action: we're really not changing anything,
so this fix doesn't change the semantics of interface addition.

Approved by: thompsa
Reviewed by: emaste
MFC after: 1 week


# 6107adc3 02-May-2012 Ed Maste <emaste@FreeBSD.org>

Relax restriction on direct tx to child ports

Lagg(4) restricts the type of packet that may be sent directly to a child
port, to avoid undesired output from accidental misconfiguration.
Previously only ETHERTYPE_PAE was permitted.

BPF writes to a lagg(4) child port are presumably intentional, so just
allow them, while still blocking other packets that should take the
aggregation path.

PR: kern/138620
Approved by: thompsa@


# b517176a 11-Apr-2012 Andrew Thompson <thompsa@FreeBSD.org>

Set the proto to LAGG_PROTO_NONE before calling the detach routine so packets
are discarded, this is an issue because lacp drops the lock which may allow
network threads to access freed memory. Expand the lock coverage so the
detach/attach happen atomically.

Submitted by: Andrew Boyer (earlier version)


# cd613b63 07-Mar-2012 Andrew Thompson <thompsa@FreeBSD.org>

Move the vlan buffer space into the union which also fixes an unused variable
warning with !INET & !INET6.

Spotted by: pluknet


# 86f67641 06-Mar-2012 Andrew Thompson <thompsa@FreeBSD.org>

Add the ability to set which packet layers are used for the load balance hash
calculation.


# 3122b912 23-Feb-2012 Andrew Thompson <thompsa@FreeBSD.org>

Add a sysctl/tunable default value for the use_flowid sysctl in r232008.


# 0bf97ae2 22-Feb-2012 Andrew Thompson <thompsa@FreeBSD.org>

Using the flowid in the mbuf assumes the network card is giving a good hash for
the traffic flow, this may not be the case giving poor traffic distribution.
Add a sysctl which allows us to fall back to our own flow hash code.

PR: kern/164901
Submitted by: Eugene Grosbein
MFC after: 1 week


# 4b22573a 11-Nov-2011 Brooks Davis <brooks@FreeBSD.org>

In r191367 the need for if_free_type() was removed and a new member
if_alloctype was used to store the origional interface type. Take
advantage of this change by removing all existing uses of if_free_type()
in favor of if_free().

MFC after: 1 Month


# 6472ac3d 07-Nov-2011 Ed Schouten <ed@FreeBSD.org>

Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.

The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.


# c94a66f8 01-Aug-2011 Sergey Kandaurov <pluknet@FreeBSD.org>

Add missing MODULE_VERSION() definition to protect against duplicating
module loads.

PR: kern/159345
Reported by: Eugene Grosbein <egrosbein att rdtc ru>
Tested by: Eugene Grosbein <egrosbein att rdtc ru>
Approved by: re (kib)
MFC after: 1 week


# 6069a2c0 07-Jul-2011 Andrew Thompson <thompsa@FreeBSD.org>

Grab the rlock before checking if our interface is enabled, it could be
possible to hit a dead pointer when changing interfaces.

PR: kern/156978
Submitted by: Andrew Boyer
MFC after: 1 week


# 627cecc5 30-Apr-2011 Andrew Thompson <thompsa@FreeBSD.org>

LACP frames must not be send VLAN-tagged, check for that before processing.

PR: kern/156743
Submitted by: Dmitrij Tejblum
MFC after: 1 week


# a0ae8f04 27-Apr-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Make various (pseudo) interfaces compile without INET in the kernel
adding appropriate #ifdefs. For module builds the framework needs
adjustments for at least carp.

Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 4 days


# 5f82cfdf 04-Mar-2011 Ermal Luçi <eri@FreeBSD.org>

Fix a panic that can happen when trying to destroy a lagg(4) with scheduler set to none.

Approved by: thompsa(mentor)
MFC after: 1 week


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# be4572c8 01-Sep-2010 Ed Maste <emaste@FreeBSD.org>

Add a sysctl knob to accept input packets on any link in a failover lagg.


# 7c61d493 24-May-2010 Andrew Thompson <thompsa@FreeBSD.org>

MFC r202588

Declare a new EVENTHANDLER called iflladdr_event which signals that the L2
address on an interface has changed. This lets stacked interfaces such as
vlan(4) detect that their lower interface has changed and adjust things in
order to keep working. Previously this situation broke at least vlan(4) and
lagg(4) configurations.

The EVENTHANDLER_INVOKE call was not placed within if_setlladdr() due to the
risk of a loop.

PR: kern/142927
Submitted by: Nikolay Denev

MFC r202611

Do not hold the lock over if_setlladdr() as it calls into the interface driver
init routine.


# e546195f 07-Apr-2010 Xin LI <delphij@FreeBSD.org>

MFC r204901

Remove the check for IFF_DRV_OACTIVE right before adding a port into lagg
interface. The check itself seems to be coming from OpenBSD but does not
seem to be useful for our code.

Discussed with: thomasa


# 13d85d43 08-Mar-2010 Xin LI <delphij@FreeBSD.org>

Remove the check for IFF_DRV_OACTIVE right before adding a port into lagg
interface. The check itself seems to be coming from OpenBSD but does not
seem to be useful for our code.

Discussed with: thomasa
MFC after: 1 month


# 644da90d 06-Feb-2010 Ermal Luçi <eri@FreeBSD.org>

Propagate the vlan eventis to the underlying interfaces/members so they can do initialization of hw related features.

PR: kern/141646
Reviewed by: thompsa
Approved by: thompsa(co-mentor)
MFC after: 2 weeks


# ea4ca115 18-Jan-2010 Andrew Thompson <thompsa@FreeBSD.org>

Declare a new EVENTHANDLER called iflladdr_event which signals that the L2
address on an interface has changed. This lets stacked interfaces such as
vlan(4) detect that their lower interface has changed and adjust things in
order to keep working. Previously this situation broke at least vlan(4) and
lagg(4) configurations.

The EVENTHANDLER_INVOKE call was not placed within if_setlladdr() due to the
risk of a loop.

PR: kern/142927
Submitted by: Nikolay Denev


# 22133b44 08-Jan-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

Stop GCC from complaining about lagg_port_checkstacking() being unused.


# 5c6026e9 30-Apr-2009 Andrew Thompson <thompsa@FreeBSD.org>

Use the flowid if its available for selecting the tx port.


# 279aa3d4 16-Apr-2009 Kip Macy <kmacy@FreeBSD.org>

Change if_output to take a struct route as its fourth argument in order
to allow passing a cached struct llentry * down to L2

Reviewed by: rwatson


# f812e067 17-Dec-2008 Andrew Thompson <thompsa@FreeBSD.org>

- Protect against sc->sc_primary being null
- Initialise speed where its used


# be07c180 17-Dec-2008 Andrew Thompson <thompsa@FreeBSD.org>

Update the interface baudrate taking into account the max speed for the
different aggregation protocols.


# 09efca80 16-Dec-2008 Andrew Thompson <thompsa@FreeBSD.org>

Also propagate the if_hwassist value to the parent so that cksum offload works.

Submitted by: Tom Hicks (thicks_averesys.com)


# aea78d20 22-Nov-2008 Kip Macy <kmacy@FreeBSD.org>

convert calls to IFQ_HANDOFF to if_transmit


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 8e46f311 30-Sep-2008 Gleb Smirnoff <glebius@FreeBSD.org>

Do not mangle if_oerrors of the underlying interface. This counter
belongs solely to the driver.
We don't lose any statistics with this change, because in a error
case the drop counter on the interface output queue is always incremented.

Reviewed by: thompsa


# 149bac03 18-Sep-2008 Andrew Thompson <thompsa@FreeBSD.org>

Move the protocol and port count checks to outside the loop, these conditions
can not change while we have the lock so no point retesting.


# 96c41c08 17-Sep-2008 Andrew Thompson <thompsa@FreeBSD.org>

Make sure there is at least one port to avoid divide by zero when choosing the
tx port.

PR: kern/122794
MFC after: 3 days


# 6729225f 03-Jul-2008 Andrew Thompson <thompsa@FreeBSD.org>

port % count will never be greater than LAGG_MAX_PORTS so nuke the test.


# 3de18008 16-Mar-2008 Andrew Thompson <thompsa@FreeBSD.org>

Switch the LACP state machine over to its own mutex to protect the internals,
this means that it no longer grabs the lagg rwlock. Use two port table arrays
which list the active ports for Tx and switch between them with an atomic op.
Now the lagg rwlock is only exclusively locked for management (ioctls) and
queuing of lacp control frames isnt needed.


# af0084c9 30-Dec-2007 Andrew Thompson <thompsa@FreeBSD.org>

Pass any unmatched slowprotocols frames up the stack instead of dropping them,
there are more subtypes than just LACP.


# 1f019d83 17-Dec-2007 Andrew Thompson <thompsa@FreeBSD.org>

- Use the macro to check the port status has it will also test if its
administratively down (!IFF_UP)
- Use the same parameters to lagg_link_active() to get the backup port as in
the output path, this didnt actually matter in practice as sc_primary is
always the first on the port list.

MFC after: 3 days


# f51133ee 17-Dec-2007 Andrew Thompson <thompsa@FreeBSD.org>

Add myself to the copyright.


# d3b28963 04-Dec-2007 Andrew Thompson <thompsa@FreeBSD.org>

Support monitor mode where the frame is discarded after bpf and stats processing.


# 80ddfb40 24-Nov-2007 Andrew Thompson <thompsa@FreeBSD.org>

Have the lagg interface generate link up/down events, the interface is marked
as up if at least one of its ports also has a link up. This fixes using
carp+lagg together and any other system that relies on linkstate events.

PR: kern/113956
MFC after: 3 days


# 544f7141 19-Oct-2007 Andrew Thompson <thompsa@FreeBSD.org>

Use ETHER_BPF_MTAP so that the vlan tags are visible to bpf(4) when stacked
under a vlan.

MFC after: 3 days


# 960dab09 11-Oct-2007 Andrew Thompson <thompsa@FreeBSD.org>

Fix two panics in lagg.

1. The locking was changed to shared but roundrobin mode still updated a
pointer in the softc with the next tx interface to use. This will panic
under high load. Change this to an atomically incremented sequence number in
order to choose the tx port in round robin.

2. IFQ_HANDOFF will free the mbuf if the queue is full, this will then be freed
again by lagg_start() and panic. Reorganised the error handling and freeing
to fix this.

MFC after: 3 days


# 20745551 30-Aug-2007 Andrew Thompson <thompsa@FreeBSD.org>

Show the ACTIVE flag in ifconfig for the single interface that is actaully
active in failover mode rather than all interfaces with a link. This makes it
clear if the master interface is in use or one of the backup links.

Found by: Writing the Handbook section
Approved by: re (kensmith)


# de75afe6 30-Jul-2007 Andrew Thompson <thompsa@FreeBSD.org>

- Propagate the largest set of interface capabilities supported by all lagg
ports to the lagg interface.
- Use the MTU from the first interface as the lagg MTU, all extra interfaces
must be the same.

This fixes using a lagg interface for a vlan or enabling jumbo frames, etc.

Approved by: re (kensmith)
MFC After: 3 days


# 82056f42 26-Jul-2007 Andrew Thompson <thompsa@FreeBSD.org>

Avoid holding the softc lock when using copyout().

Reported by: dfr
Approved by: re (rwatson)


# b3d37ca5 05-Jul-2007 Andrew Thompson <thompsa@FreeBSD.org>

Allow the LACP state to be queried from userland which at the moment is the
actor and partner peer info. Print out the active aggregator and per port data
in verbose mode from ifconfig.

Approved by: re (mux)


# ec32b37e 12-Jun-2007 Andrew Thompson <thompsa@FreeBSD.org>

non-functional cleanup
- remove dead code
- use consistent variable names
- gc unused defines
- whitespace cleanup


# 6469e186 19-May-2007 Andrew Thompson <thompsa@FreeBSD.org>

- packets on the input interface were counted twice
- Use IFQ_HANDOFF instead of rolling our own


# 9bbba41e 18-May-2007 Andrew Thompson <thompsa@FreeBSD.org>

Fix a mbuf leak where sc_start fails or the protocol is none.


# 3362a474 18-May-2007 Andrew Thompson <thompsa@FreeBSD.org>

Fix locking assert where we should hold the reader lock.


# e2a77bb8 15-May-2007 Andrew Thompson <thompsa@FreeBSD.org>

Fix unused variable error with !INET6

Reported by: Artem Naluzhny, Frank Terhaar-Yonkers


# 7a04b0f6 15-May-2007 Andrew Thompson <thompsa@FreeBSD.org>

Feed ipv6 flowlabel to hash calculation.

Obtained from: NetBSD


# 3bf517e3 15-May-2007 Andrew Thompson <thompsa@FreeBSD.org>

Change from a mutex to a read/write lock. This allows the tx port to be
selected simultaneously by multiple senders and transmit/receive is not
serialised between aggregated interfaces.


# a5715cb2 07-May-2007 Andrew Thompson <thompsa@FreeBSD.org>

- Correctly check if lp_ioctl is null
- Remove lagg_ether_purgemulti as its no longer needed
- Mark the interface as up if any ports are active rather than just the primary


# efcd0965 06-May-2007 Andrew Thompson <thompsa@FreeBSD.org>

The purgemulti call is not needed since all the ports have already been detached.


# cdc6f95f 06-May-2007 Andrew Thompson <thompsa@FreeBSD.org>

Call if_setlladdr() on the aggregation port from a taskqueue so the softc lock
is not held. The short delay between aggregating the port and setting the MAC
address is fine.


# 108fe96a 06-May-2007 Andrew Thompson <thompsa@FreeBSD.org>

Avoid touching various unsafe parts if the interface is disappearing.


# d74fd345 06-May-2007 Andrew Thompson <thompsa@FreeBSD.org>

Change from using if_delmulti() to if_delmulti_ifma() as it simplifies the code
and is safe to use if the ifp has disappeared.

Suggested by: bms


# e3163ef6 03-May-2007 Andrew Thompson <thompsa@FreeBSD.org>

- Add a disabled state for ports that can not be aggregated
- Refine check for lacp links, set to disabled if not suitable


# 139722d4 02-May-2007 Andrew Thompson <thompsa@FreeBSD.org>

Set the master flag on the right variable.


# 18242d3b 16-Apr-2007 Andrew Thompson <thompsa@FreeBSD.org>

Rename the trunk(4) driver to lagg(4) as it is too similar to vlan trunking.

The name trunk is misused as the networking term trunk means carrying multiple
VLANs over a single connection. The IEEE standard for link aggregation (802.3
section 3) does not talk about 'trunk' at all while it is used throughout IEEE
802.1Q in describing vlans.

The lagg(4) driver provides link aggregation, failover and fault tolerance.

Discussed on: current@