History log of /freebsd-current/sys/netinet/ip_output.c
Revision Date Author Comments
# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# e3ba0d6a 26-Jul-2023 Gleb Smirnoff <glebius@FreeBSD.org>

inpcb: do not copy so_options into inp_flags2

Since f71cb9f74808 socket stays connnected with inpcb through latter's
lifetime and there is no reason to complicate things and copy these
flags.

Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D41198


# bc310a95 20-Jul-2023 Konstantin Belousov <kib@FreeBSD.org>

ip output: ensure that mbufs are mapped if ipsec is enabled

Ipsec needs access to packet headers to determine if a policy is
applicable. It seems that typically IP headers are mapped, but the code
is arguably needs to check this before blindly accessing them. Then,
operations like m_unshare() and m_makespace() are not yet ready for
unmapped mbufs.

Ensure that the packet is mapped before calling into IPSEC_OUTPUT().

PR: 272616
Reviewed by: jhb, markj
Sponsored by: NVidia networking
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D41112


# 185c1cdd 02-Jun-2023 Kristof Provost <kp@FreeBSD.org>

netinet: re-read IP length after PFIL hook

The pfil hook may modify the packet, so before we check its length (to
decide if it needs to be fragmented or not) we should re-read that
length.

This is most likely to happen when pf is reassembling packets. In that
scenario we'd receive the last fragment, which is likely to be a short
packet, pf would reassemble it (likely exceeding the interface MTU) and
then we'd transmit it without fragmenting, because we're comparing the
MTU to the length of the last fragment, not the fully reassembled
packet.

See also: https://redmine.pfsense.org/issues/14396
Reviewed by: cy
MFC after: 3 weeks
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D40395


# 317fa516 28-Feb-2023 Mark Johnston <markj@FreeBSD.org>

netinet: Remove the IP(V6)_RSS_LISTEN_BUCKET socket option

It has no effect, and an exp-run revealed that it is not in use.

PR: 261398 (exp-run)
Reviewed by: mjg, glebius
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D38822


# 3aff4ccd 27-Feb-2023 Mark Johnston <markj@FreeBSD.org>

netinet: Remove IP(V6)_BINDMULTI

This option was added in commit 0a100a6f1ee5 but was never completed.
In particular, there is no logic to map flowids to different listening
sockets, so it accomplishes basically the same thing as SO_REUSEPORT.
Meanwhile, we've since added SO_REUSEPORT_LB, which at least tries to
balance among listening sockets using a hash of the 4-tuple and some
optional NUMA policy.

The option was never documented or completed, and an exp-run revealed
nothing using it in the ports tree. Moreover, it complicates the
already very complicated in_pcbbind_setup(), and the checking in
in_pcbbind_check_bindmulti() is insufficient. So, let's remove it.

PR: 261398 (exp-run)
Reviewed by: glebius
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D38574


# a2256150 14-Feb-2023 Gleb Smirnoff <glebius@FreeBSD.org>

net: use pfil_mbuf_{in,out} where we always have an mbuf

This finalizes what has been started in 0b70e3e78b0.

Reviewed by: kp, mjg
Differential revision: https://reviews.freebsd.org/D37976


# 3d0d5b21 23-Jan-2023 Justin Hibbits <jhibbits@FreeBSD.org>

IfAPI: Explicitly include <net/if_private.h> in netstack

Summary:
In preparation of making if_t completely opaque outside of the netstack,
explicitly include the header. <net/if_var.h> will stop including the
header in the future.

Sponsored by: Juniper Networks, Inc.
Reviewed by: glebius, melifaro
Differential Revision: https://reviews.freebsd.org/D38200


# da6715bb 14-Sep-2022 Gleb Smirnoff <glebius@FreeBSD.org>

ip_output: always increase "cantfrag" stat if ip_fragment() fails

While here, join two unlikely cases into one if clause.

Submitted by: Ivan Rozhuk <rozhuk.im gmail.com>
PR: 265718
Reviewed by: mjg, melifaro
Differential revision: https://reviews.freebsd.org/D36584


# 14c9a2db 02-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

net: retire PFIL_FWD

It is now unused and not having it allows further clean ups.

Reviewed by: cy, glebius, kp
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D36452


# e7d02be1 17-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protosw: refactor protosw and domain static declaration and load

o Assert that every protosw has pr_attach. Now this structure is
only for socket protocols declarations and nothing else.
o Merge struct pr_usrreqs into struct protosw. This was suggested
in 1996 by wollman@ (see 7b187005d18ef), and later reiterated
in 2006 by rwatson@ (see 6fbb9cf860dcd).
o Make struct domain hold a variable sized array of protosw pointers.
For most protocols these pointers are initialized statically.
Those domains that may have loadable protocols have spacers. IPv4
and IPv6 have 8 spacers each (andre@ dff3237ee54ea).
o For inetsw and inet6sw leave a comment noting that many protosw
entries very likely are dead code.
o Refactor pf_proto_[un]register() into protosw_[un]register().
o Isolate pr_*_notsupp() methods into uipc_domain.c

Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36232


# 7d98cc09 01-Apr-2022 Andrey V. Elsukov <ae@FreeBSD.org>

Fix ipfw fwd that doesn't work in some cases

For IPv4 use dst pointer as destination address in fib4_lookup().
It keeps destination address from IPv4 header and can be changed
when PACKET_TAG_IPFORWARD tag was set by packet filter.

For IPv6 override destination address with address from dst_sa.sin6_addr,
that was set from PACKET_TAG_IPFORWARD tag.

Reviewed by: eugen
MFC after: 1 week
PR: 256828, 261697, 255705
Differential Revision: https://reviews.freebsd.org/D34732


# 77223d98 25-Jan-2022 Wojciech Macek <wma@FreeBSD.org>

ip_mroute: refactor epoch-basd locking

Remove duplicated epoch_enter and epoch_exit in IP inp/outp routines.
Remove unnecessary macros as well.

Obtained from: Semihalf
Spponsored by: Stormshield
Reviewed by: glebius
Differential revision: https://reviews.freebsd.org/D34030


# 9ba11796 27-Jan-2022 Andrew Gallatin <gallatin@FreeBSD.org>

Fix a memory leak when ip_output_send() returns EAGAIN due to send tag issues

When ip_output_send() returns EAGAIN due to issues with send tags (route
change, lagg failover, etc), it must free the mbuf. This is because
ip_output_send() was written as a wrapper/replacement for a direct
call to if_output(), and the contract with if_output() has
historically been that it owns the mbufs once called. When
ip_output_send() failed to free mbufs, it violated this assumption
and lead to leaked mbufs.

This was noticed when using NIC TLS in combination with hardware
rate-limited connections. When seeing lots of NIC output drops
triggered ratelimit send tag changes, we noticed we were leaking
ktls_sessions, send tags and mbufs. This was due ip_output_send()
leaking mbufs which held references to ktls_sessions, which in
turn held references to send tags.

Many thanks to jbh, rrs, hselasky and markj for their help in
debugging this.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34054
Reviewed by: hselasky, jhb, rrs
MFC after: 2 weeks


# 44775b16 24-Nov-2021 Mark Johnston <markj@FreeBSD.org>

netinet: Remove unneeded mb_unmapped_to_ext() calls

in_cksum_skip() now handles unmapped mbufs on platforms where they're
permitted.

Reviewed by: glebius, jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33097


# 756bb50b 16-Nov-2021 Mark Johnston <markj@FreeBSD.org>

sctp: Remove now-unneeded mb_unmapped_to_ext() calls

sctp_delayed_checksum() now handles unmapped mbufs, thanks to m_apply().

No functional change intended.

Reviewed by: tuexen
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32942


# 2144431c 08-Oct-2021 Gleb Smirnoff <glebius@FreeBSD.org>

Remove in_ifaddr_lock acquisiton to access in_ifaddrhead.

An IPv4 address is embedded into an ifaddr which is freed
via epoch. And the in_ifaddrhead is already a CK list. Use
the network epoch to protect against use after free.

Next step would be to CK-ify the in_addr hash and get rid of the...

Reviewed by: melifaro
Differential Revision: https://reviews.freebsd.org/D32434


# 62e1a437 22-Aug-2021 Zhenlei Huang <zlei.huang@gmail.com>

routing: Allow using IPv6 next-hops for IPv4 routes (RFC 5549).

Implement kernel support for RFC 5549/8950.

* Relax control plane restrictions and allow specifying IPv6 gateways
for IPv4 routes. This behavior is controlled by the
net.route.rib_route_ipv6_nexthop sysctl (on by default).

* Always pass final destination in ro->ro_dst in ip_forward().

* Use ro->ro_dst to exract packet family inside if_output() routines.
Consistently use RO_GET_FAMILY() macro to handle ro=NULL case.

* Pass extracted family to nd6_resolve() to get the LLE with proper encap.
It leverages recent lltable changes committed in c541bd368f86.

Presence of the functionality can be checked using ipv4_rfc5549_support feature(3).
Example usage:
route add -net 192.0.0.0/24 -inet6 fe80::5054:ff:fe14:e319%vtnet0

Differential Revision: https://reviews.freebsd.org/D30398
MFC after: 2 weeks


# 9748eb74 07-Aug-2021 Alexander V. Chernikov <melifaro@FreeBSD.org>

Simplify nhop operations in ip_output().

Consistently use `nh` instead of always dereferencing
ro->ro_nh inside the if block.
Always use nexthop mtu, as it provides guarantee that mtu is accurate.
Pass `nh` pointer to rt_update_ro_flags() to allow upcoming uses
of updating ro flags based on different nexthop.

Differential Revision: https://reviews.freebsd.org/D31451
Reviewed by: kp
MFC after: 2 weeks


# 65634ae7 22-Apr-2021 Wojciech Macek <wma@FreeBSD.org>

mroute: fix race condition during mrouter shutting down

There is a race condition between V_ip_mrouter de-init
and ip_mforward handling. It might happen that mrouted
is cleaned up after V_ip_mrouter check and before
processing packet in ip_mforward.
Use epoch call aproach, similar to IPSec which also handles
such case.

Reported by: Damien Deville
Obtained from: Stormshield
Reviewed by: mw
Differential Revision: https://reviews.freebsd.org/D29946


# 3f43ada9 28-Jan-2021 Gleb Smirnoff <glebius@FreeBSD.org>

Catch up with 6edfd179c86: mechanically rename IFCAP_NOMAP to IFCAP_MEXTPG.

Originally IFCAP_NOMAP meant that the mbuf has external storage pointer
that points to unmapped address. Then, this was extended to array of
such pointers. Then, such mbufs were augmented with header/trailer.
Basically, extended mbufs are extended, and set of features is subject
to change. The new name should be generic enough to avoid further
renaming.


# 868aabb4 08-Oct-2020 Richard Scheffenegger <rscheff@FreeBSD.org>

Add IP(V6)_VLAN_PCP to set 802.1 priority per-flow.

This adds a new IP_PROTO / IPV6_PROTO setsockopt (getsockopt)
option IP(V6)_VLAN_PCP, which can be set to -1 (interface
default), or explicitly to any priority between 0 and 7.

Note that for untagged traffic, explicitly adding a
priority will insert a special 801.1Q vlan header with
vlan ID = 0 to carry the priority setting

Reviewed by: gallatin, rrs
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D26409


# fedeb08b 03-Oct-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Introduce scalable route multipath.

This change is based on the nexthop objects landed in D24232.

The change introduces the concept of nexthop groups.
Each group contains the collection of nexthops with their
relative weights and a dataplane-optimized structure to enable
efficient nexthop selection.

Simular to the nexthops, nexthop groups are immutable. Dataplane part
gets compiled during group creation and is basically an array of
nexthop pointers, compiled w.r.t their weights.

With this change, `rt_nhop` field of `struct rtentry` contains either
nexthop or nexthop group. They are distinguished by the presense of
NHF_MULTIPATH flag.
All dataplane lookup functions returns pointer to the nexthop object,
leaving nexhop groups details inside routing subsystem.

User-visible changes:

The change is intended to be backward-compatible: all non-mpath operations
should work as before with ROUTE_MPATH and net.route.multipath=1.

All routes now comes with weight, default weight is 1, maximum is 2^24-1.

Current maximum multipath group width is statically set to 64.
This will become sysctl-tunable in the followup changes.

Using functionality:
* Recompile kernel with ROUTE_MPATH
* set net.route.multipath to 1

route add -6 2001:db8::/32 2001:db8::2 -weight 10
route add -6 2001:db8::/32 2001:db8::3 -weight 20

netstat -6On

Nexthop groups data

Internet6:
GrpIdx NhIdx Weight Slots Gateway Netif Refcnt
1 ------- ------- ------- --------------------------------------- --------- 1
13 10 1 2001:db8::2 vlan2
14 20 2 2001:db8::3 vlan2

Next steps:
* Land outbound hashing for locally-originated routes ( D26523 ).
* Fix net/bird multipath (net/frr seems to work fine)
* Add ROUTE_MPATH to GENERIC
* Set net.route.multipath=1 by default

Tested by: olivier
Reviewed by: glebius
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D26449


# 2259a030 21-Sep-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Rework part of routing code to reduce difference to D26449.

* Split rt_setmetrics into get_info_weight() and rt_set_expire_info(),
as these two can be applied at different entities and at different times.
* Start filling route weight in route change notifications
* Pass flowid to UDP/raw IP route lookups
* Rework nd6_subscription_cb() and sysctl_dumpentry() to prepare for the fact
that rtentry can contain multiple nexthops.

Differential Revision: https://reviews.freebsd.org/D26497


# 374ce248 18-Sep-2020 Mitchell Horne <mhorne@FreeBSD.org>

Initialize some local variables earlier

Move the initialization of these variables to the beginning of their
respective functions.

On our end this creates a small amount of unneeded churn, as these
variables are properly initialized before their first use in all cases.
However, changing this benefits at least one downstream consumer
(NetApp) by allowing local and future modifications to these functions
to be made without worrying about where the initialization occurs.

Reviewed by: melifaro, rscheff
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26454


# b092fd6c 17-Sep-2020 Navdeep Parhar <np@FreeBSD.org>

if_vxlan(4): add support for hardware assisted checksumming, TSO, and RSS.

This lets a VXLAN pseudo-interface take advantage of hardware checksumming (tx
and rx), TSO, and RSS if the NIC is capable of performing these operations on
inner VXLAN traffic.

A VXLAN interface inherits the capabilities of its vxlandev interface if one is
specified or of the interface that hosts the vxlanlocal address. If other
interfaces will carry traffic for that VXLAN then they must have the same
hardware capabilities.

On transmit, if_vxlan verifies that the outbound interface has the required
capabilities and then translates the CSUM_ flags to their inner equivalents.
This tells the hardware ifnet that it needs to operate on the inner frame and
not the outer VXLAN headers.

An event is generated when a VXLAN ifnet starts. This allows hardware drivers to
configure their devices to expect VXLAN traffic on the specified incoming port.

On receive, the hardware does RSS and checksum verification on the inner frame.
if_vxlan now does a direct netisr dispatch to take full advantage of RSS. It is
not very clear why it didn't do this already.

Future work:
Rx: it should be possible to avoid the first trip up the protocol stack to get
the frame to if_vxlan just so it can decapsulate and requeue for a second trip
up the stack. The hardware NIC driver could directly call an if_vxlan receive
routine for VXLAN traffic instead.

Rx: LRO. depends on what happens with the previous item. There will have to to
be a mechanism to indicate that it's time for if_vxlan to flush its LRO state.

Reviewed by: kib@
Relnotes: Yes
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D25873


# 662c1305 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

net: clean up empty lines in .c and .h files


# 95033af9 18-Jun-2020 Mark Johnston <markj@FreeBSD.org>

Add the SCTP_SUPPORT kernel option.

This is in preparation for enabling a loadable SCTP stack. Analogous to
IPSEC/IPSEC_SUPPORT, the SCTP_SUPPORT kernel option must be configured
in order to support a loadable SCTP implementation.

Discussed with: tuexen
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# 3553b300 28-May-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Switch ip_output/icmp_reflect rt lookup calls with fib4_lookup.

fib4_lookup_nh_ represents pre-epoch generation of fib api,
providing less guarantees over pointer validness and requiring
on-stack data copying.

Conversion is straight-forwarded, as the only 2 differences are
requirement of running in network epoch and the need to handle
RTF_GATEWAY case in the caller code.

Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D24976


# 174fb9db 17-May-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Remove redundant checks for nhop validity.
Currently NH_IS_VALID() simly aliases to RT_LINK_IS_UP(), so we're
checking the same thing twice.

In the near future the implementation of this check will be simpler,
as there are plans to introduce control-plane interface status monitoring
similar to ipfw interface tracker.


# 6043ac20 11-May-2020 Andrew Gallatin <gallatin@FreeBSD.org>

Ktls: never skip stamping tags for NIC TLS

The newer RACK and BBR TCP stacks have added a mechanism
to disable hardware packet pacing for TCP retransmits.
This mechanism works by skipping the send-tag stamp
on rate-limited connections when the TCP stack calls
ip_output() with the IP_NO_SND_TAG_RL flag set.

When doing NIC TLS, we must ignore this flag, as
NIC TLS packets must always be stamped. Failure
to stamp a NIC TLS packet will result in crypto
issues.

Reviewed by: hselasky, rrs
Sponsored by: Netflix, Mellanox


# 7b6c99d0 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Step 3: anonymize struct mbuf_ext_pgs and move all its fields into mbuf
within m_epg namespace.
All edits except the 'struct mbuf' declaration and mb_dupcl() were done
mechanically with sed:

s/->m_ext_pgs.nrdy/->m_epg_nrdy/g
s/->m_ext_pgs.hdr_len/->m_epg_hdrlen/g
s/->m_ext_pgs.trail_len/->m_epg_trllen/g
s/->m_ext_pgs.first_pg_off/->m_epg_1st_off/g
s/->m_ext_pgs.last_pg_len/->m_epg_last_len/g
s/->m_ext_pgs.flags/->m_epg_flags/g
s/->m_ext_pgs.record_type/->m_epg_record_type/g
s/->m_ext_pgs.enc_cnt/->m_epg_enc_cnt/g
s/->m_ext_pgs.tls/->m_epg_tls/g
s/->m_ext_pgs.so/->m_epg_so/g
s/->m_ext_pgs.seqno/->m_epg_seqno/g
s/->m_ext_pgs.stailq/->m_epg_stailq/g

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# 4043ee3c 28-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Convert rtalloc_mpath_fib() users to the new KPI.

New fib[46]_lookup() functions support multipath transparently.
Given that, switch the last rtalloc_mpath_fib() calls to
dib4_lookup() and eliminate the function itself.

Note: proper flowid generation (especially for the outbound traffic) is a
bigger topic and will be handled in a separate review.
This change leaves flowid generation intact.

Differential Revision: https://reviews.freebsd.org/D24595


# 983066f0 25-Apr-2020 Alexander V. Chernikov <melifaro@FreeBSD.org>

Convert route caching to nexthop caching.

This change is build on top of nexthop objects introduced in r359823.

Nexthops are separate datastructures, containing all necessary information
to perform packet forwarding such as gateway interface and mtu. Nexthops
are shared among the routes, providing more pre-computed cache-efficient
data while requiring less memory. Splitting the LPM code and the attached
data solves multiple long-standing problems in the routing layer,
drastically reduces the coupling with outher parts of the stack and allows
to transparently introduce faster lookup algorithms.

Route caching was (re)introduced to minimise (slow) routing lookups, allowing
for notably better performance for large TCP senders. Caching works by
acquiring rtentry reference, which is protected by per-rtentry mutex.
If the routing table is changed (checked by comparing the rtable generation id)
or link goes down, cache record gets withdrawn.

Nexthops have the same reference counting interface, backed by refcount(9).
This change merely replaces rtentry with the actual forwarding nextop as a
cached object, which is mostly mechanical. Other moving parts like cache
cleanup on rtable change remains the same.

Differential Revision: https://reviews.freebsd.org/D24340


# 23feb563 14-Apr-2020 Andrew Gallatin <gallatin@FreeBSD.org>

KTLS: Re-work unmapped mbufs to carry ext_pgs in the mbuf itself.

While the original implementation of unmapped mbufs was a large
step forward in terms of reducing cache misses by enabling mbufs
to carry more than a single page for sendfile, they are rather
cache unfriendly when accessing the ext_pgs metadata and
data. This is because the ext_pgs part of the mbuf is allocated
separately, and almost guaranteed to be cold in cache.

This change takes advantage of the fact that unmapped mbufs
are never used at the same time as pkthdr mbufs. Given this
fact, we can overlap the ext_pgs metadata with the mbuf
pkthdr, and carry the ext_pgs meta directly in the mbuf itself.
Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array
of pages) directly after the existing m_ext.

In order to be able to carry 5 pages (which is the minimum
required for a 16K TLS record which is not perfectly aligned) on
LP64, I've had to steal ext_arg2. The only user of this in the
xmit path is sendfile, and I've adjusted it to use arg1 when
using unmapped mbufs.

This change is almost entirely mechanical, except that we
change mb_alloc_ext_pgs() to no longer allow allocating
pkthdrs, the change to avoid ext_arg2 as mentioned above,
and the removal of the ext_pgs zone,

This change saves roughly 2% "raw" CPU (~59% -> 57%), or over
3% "scaled" CPU on a Netflix 100% software kTLS workload at
90+ Gb/s on Broadwell Xeons.

In a follow-on commit, I plan to remove some hacks to avoid
access ext_pgs fields of mbufs, since they will now be in
cache.

Many thanks to glebius for helping to make this better in
the Netflix tree.

Reviewed by: hselasky, jhb, rrs, glebius (early version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24213


# b9555453 21-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Make ip6_output() and ip_output() require network epoch.

All callers that before may called into these functions
without network epoch now must enter it.


# 8d5c56da 01-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

In r343631 error code for a packet blocked by a firewall was
changed from EACCES to EPERM. This change was not intentional,
so fix that. Return EACCESS if a firewall forbids sending.

Noticed by: ae


# 0732ac0e 09-Oct-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Revert most of the multicast changes from r353292. This needs a more
accurate approach.


# b8a6e03f 07-Oct-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Widen NET_EPOCH coverage.

When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.

However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.

Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.

On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().

This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.

Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.

This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.

Reviewed by: gallatin, hselasky, cy, adrian, kristof
Differential Revision: https://reviews.freebsd.org/D19111


# 35c7bb34 24-Sep-2019 Randall Stewart <rrs@FreeBSD.org>

This commit adds BBR (Bottleneck Bandwidth and RTT) congestion control. This
is a completely separate TCP stack (tcp_bbr.ko) that will be built only if
you add the make options WITH_EXTRA_TCP_STACKS=1 and also include the option
TCPHPTS. You can also include the RATELIMIT option if you have a NIC interface that
supports hardware pacing, BBR understands how to use such a feature.

Note that this commit also adds in a general purpose time-filter which
allows you to have a min-filter or max-filter. A filter allows you to
have a low (or high) value for some period of time and degrade slowly
to another value has time passes. You can find out the details of
BBR by looking at the original paper at:

https://queue.acm.org/detail.cfm?id=3022184

or consult many other web resources you can find on the web
referenced by "BBR congestion control". It should be noted that
BBRv1 (which this is) does tend to unfairness in cases of small
buffered paths, and it will usually get less bandwidth in the case
of large BDP paths(when competing with new-reno or cubic flows). BBR
is still an active research area and we do plan on implementing V2
of BBR to see if it is an improvement over V1.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D21582


# b2e60773 26-Aug-2019 John Baldwin <jhb@FreeBSD.org>

Add kernel-side support for in-kernel TLS.

KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277


# 82334850 28-Jun-2019 John Baldwin <jhb@FreeBSD.org>

Add an external mbuf buffer type that holds multiple unmapped pages.

Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.

For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer. This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers). It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused. To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.

Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.

NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.

If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.

Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: ae, kp (firewalls)
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616


# 05fc9d78 21-Jun-2019 Kristof Provost <kp@FreeBSD.org>

ip_output: pass PFIL_FWD in the slow path

If we take the slow path for forwarding we should still tell our
firewalls (hooked through pfil(9)) that we're forwarding. Pass the
ip_output() flags to ip_output_pfil() so it can set the PFIL_FWD flag
when we're forwarding.

MFC after: 1 week
Sponsored by: Axiado


# 77a01441 11-Jun-2019 John Baldwin <jhb@FreeBSD.org>

Sort opt_foo.h #includes and add a missing blank line in ip_output().


# fb3bc596 24-May-2019 John Baldwin <jhb@FreeBSD.org>

Restructure mbuf send tags to provide stronger guarantees.

- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.

Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.

To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.

For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.

- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.

This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).

In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.

- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.

To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.

- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.

- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.

Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117


# 54bb7ac0 10-May-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Fix regression from r347375: do not panic when sending an IP multicast
packet from an interface that doesn't have IPv4 address.

Reported by: Michael Butler <imb protected-networks.net>


# 6ca363eb 08-May-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Existense of PCB route caching doesn't allow us to use new fast route
lookup KPI in ip_output() like it is already used in ip_forward().
However, when there is no PCB provided we can use fast KPI, gaining
performance advantage.

Typical case when ip_output() is called without a PCB pointer is a
sendto(2) on a not connected UDP socket. In practice DNS servers do
this.

Reviewed by: melifaro
Differential Revision: https://reviews.freebsd.org/D19804


# 50575ce1 25-Apr-2019 Andrew Gallatin <gallatin@FreeBSD.org>

Track TCP connection's NUMA domain in the inpcb

Drivers can now pass up numa domain information via the
mbuf numa domain field. This information is then used
by TCP syncache_socket() to associate that information
with the inpcb. The domain information is then fed back
into transmitted mbufs in ip{6}_output(). This mechanism
is nearly identical to what is done to track RSS hash values
in the inp_flowid.

Follow on changes will use this information for lacp egress
port selection, binding TCP pacers to the appropriate NUMA
domain, etc.

Reviewed by: markj, kib, slavash, bz, scottl, jtl, tuexen
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20028


# 6c1c6ae5 04-Apr-2019 Rodney W. Grimes <rgrimes@FreeBSD.org>

Use IN_foo() macros from sys/netinet/in.h inplace of handcrafted code

There are a few places that use hand crafted versions of the macros
from sys/netinet/in.h making it difficult to actually alter the
values in use by these macros. Correct that by replacing handcrafted
code with proper macro usage.

Reviewed by: karels, kristof
Approved by: bde (mentor)
MFC after: 3 weeks
Sponsored by: John Gilmore
Differential Revision: https://reviews.freebsd.org/D19317


# b252313f 31-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

New pfil(9) KPI together with newborn pfil API and control utility.

The KPI have been reviewed and cleansed of features that were planned
back 20 years ago and never implemented. The pfil(9) internals have
been made opaque to protocols with only returned types and function
declarations exposed. The KPI is made more strict, but at the same time
more extensible, as kernel uses same command structures that userland
ioctl uses.

In nutshell [KA]PI is about declaring filtering points, declaring
filters and linking and unlinking them together.

New [KA]PI makes it possible to reconfigure pfil(9) configuration:
change order of hooks, rehook filter from one filtering point to a
different one, disconnect a hook on output leaving it on input only,
prepend/append a filter to existing list of filters.

Now it possible for a single packet filter to provide multiple rulesets
that may be linked to different points. Think of per-interface ACLs in
Cisco or Juniper. None of existing packet filters yet support that,
however limited usage is already possible, e.g. default ruleset can
be moved to single interface, as soon as interface would pride their
filtering points.

Another future feature is possiblity to create pfil heads, that provide
not an mbuf pointer but just a memory pointer with length. That would
allow filtering at very early stages of a packet lifecycle, e.g. when
packet has just been received by a NIC and no mbuf was yet allocated.

Differential Revision: https://reviews.freebsd.org/D18951


# 10731c54 08-Jan-2019 Michael Tuexen <tuexen@FreeBSD.org>

Fix getsockopt() for IP_OPTIONS/IP_RETOPTS.

r336616 copies inp->inp_options using the m_dup() function.
However, this function expects an mbuf packet header at the beginning,
which is not true in this case.
Therefore, use m_copym() instead of m_dup().

This issue was found by syzkaller.
Reviewed by: mmacy@
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D18753


# a68cc388 08-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Mechanical cleanup of epoch(9) usage in network stack.

- Remove macros that covertly create epoch_tracker on thread stack. Such
macros a quite unsafe, e.g. will produce a buggy code if same macro is
used in embedded scopes. Explicitly declare epoch_tracker always.

- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
locking macros to what they actually are - the net_epoch.
Keeping them as is is very misleading. They all are named FOO_RLOCK(),
while they no longer have lock semantics. Now they allow recursion and
what's more important they now no longer guarantee protection against
their companion WLOCK macros.
Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.

This is non functional mechanical change. The only functionally changed
functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
epoch recursively.

Discussed with: jtl, gallatin


# 3924dfa7 07-Oct-2018 Michael Tuexen <tuexen@FreeBSD.org>

Ensure that the ips_localout counter is incremented for
locally generated SCTP packets sent over IPv4. This make
the behaviour consistent with IPv6.

Reviewed by: ae@, bz@, jtl@
Approved by: re (kib@)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17406


# b6e87011 04-Oct-2018 Tom Jones <thj@FreeBSD.org>

Convert UDP length to host byte order

When getting the number of bytes to checksum make sure to convert the UDP
length to host byte order when the entire header is not in the first mbuf.

Reviewed by: jtl, tuexen, ae
Approved by: re (gjb), jtl (mentor)
Differential Revision: https://reviews.freebsd.org/D17357


# e5e3e746 22-Jul-2018 Matt Macy <mmacy@FreeBSD.org>

Fix a potential use after free in getsockopt() access to inp_options

Discussed with: jhb
Reviewed by: sbruno, transport
MFC after: 2 weeks
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14621


# c8b1bdc3 14-Jul-2018 Sean Bruno <sbruno@FreeBSD.org>

There was quite a bit of feedback on r336282 that has led to the
submitter to want to revert it.


# 179a28b0 14-Jul-2018 Sean Bruno <sbruno@FreeBSD.org>

Fixup memory management for fetching options in ip_ctloutput()

Submitted by: Jason Eggleston <jason@eggnet.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14621


# 1a43cff9 06-Jun-2018 Sean Bruno <sbruno@FreeBSD.org>

Load balance sockets with new SO_REUSEPORT_LB option.

This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures:
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

Limitations:
As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or
threads sharing the same socket).

This is a substantially different contribution as compared to its original
incarnation at svn r332894 and reverted at svn r332967. Thanks to rwatson@
for the substantive feedback that is included in this commit.

Submitted by: Johannes Lundberg <johalun0@gmail.com>
Obtained from: DragonflyBSD
Relnotes: Yes
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D11003


# 590d0a43 06-Jun-2018 Andrey V. Elsukov <ae@FreeBSD.org>

Make in_delayed_cksum() be similar to IPv6 implementation.

Use m_copyback() function to write checksum when it isn't located
in the first mbuf of the chain. Handmade analog doesn't handle the
case when parts of checksum are located in different mbufs.
Also in case when mbuf is too short, m_copyback() will allocate new
mbuf in the chain instead of making out of bounds write.

Also wrap long line and remove now useless KASSERTs.

X-MFC after: r334705


# 1fdbfb90 06-Jun-2018 Tom Jones <thj@FreeBSD.org>

Use UDP len when calculating UDP checksums

The length of the IP payload is normally equal to the UDP length, UDP Options
(draft-ietf-tsvwg-udp-options-02) suggests using the difference between IP
length and UDP length to create space for trailing data.

Correct checksum length calculation to use the UDP length rather than the IP
length when not offloading UDP checksums.

Approved by: jtl (mentor)
Differential Revision: https://reviews.freebsd.org/D15222


# 4f6c66cc 23-May-2018 Matt Macy <mmacy@FreeBSD.org>

UDP: further performance improvements on tx

Cumulative throughput while running 64
netperf -H $DUT -t UDP_STREAM -- -m 1
on a 2x8x2 SKL went from 1.1Mpps to 2.5Mpps

Single stream throughput increases from 910kpps to 1.18Mpps

Baseline:
https://people.freebsd.org/~mmacy/2018.05.11/udpsender2.svg

- Protect read access to global ifnet list with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender3.svg

- Protect short lived ifaddr references with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender4.svg

- Convert if_afdata read lock path to epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender5.svg

A fix for the inpcbhash contention is pending sufficient time
on a canary at LLNW.

Reviewed by: gallatin
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15409


# 7875017c 24-Apr-2018 Sean Bruno <sbruno@FreeBSD.org>

Revert r332894 at the request of the submitter.

Submitted by: Johannes Lundberg <johalun0_gmail.com>
Sponsored by: Limelight Networks


# 7b7796ee 23-Apr-2018 Sean Bruno <sbruno@FreeBSD.org>

Load balance sockets with new SO_REUSEPORT_LB option

This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

Limitations
As DragonflyBSD, a load balance group is limited to 256 pcbs
(256 programs or threads sharing the same socket).

Submitted by: Johannes Lundberg <johanlun0@gmail.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D11003


# 72bfa0bf 23-Mar-2018 Sean Bruno <sbruno@FreeBSD.org>

Revert r331379 as the "simple" lock changes have revealed a deeper problem
and need for a rethink.

Submitted by: Jason Eggleston <jason@eggnet.com>
Sponsored by: Limelight Networks


# effaab88 23-Mar-2018 Kristof Provost <kp@FreeBSD.org>

netpfil: Introduce PFIL_FWD flag

Forwarded packets passed through PFIL_OUT, which made it difficult for
firewalls to figure out if they were forwarding or producing packets. This in
turn is an issue for pf for IPv6 fragment handling: it needs to call
ip6_output() or ip6_forward() to handle the fragments. Figuring out which was
difficult (and until now, incorrect).
Having pfil distinguish the two removes an ugly piece of code from pf.

Introduce a new variant of the netpfil callbacks with a flags variable, which
has PFIL_FWD set for forwarded packets. This allows pf to reliably work out if
a packet is forwarded.

Reviewed by: ae, kevans
Differential Revision: https://reviews.freebsd.org/D13715


# 2a499acf 22-Mar-2018 Sean Bruno <sbruno@FreeBSD.org>

Simple locking fixes in ip_ctloutput, ip6_ctloutput, rip_ctloutput.

Submitted by: Jason Eggleston <jason@eggnet.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14624


# fc21c53f 22-Jan-2018 Ryan Stone <rstone@FreeBSD.org>

Reduce code duplication for inpcb route caching

Add a new macro to clear both the L3 and L2 route caches, to
hopefully prevent future instances where only the L3 cache was
cleared when both should have been.

MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13989
Reviewed by: karels


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# ae69ad88 27-Jul-2017 Bjoern A. Zeeb <bz@FreeBSD.org>

After inpcb route caching was put back in place there is no need for
flowtable anymore (as flowtable was never considered to be useful in
the forwarding path).

Reviewed by: np
Differential Revision: https://reviews.freebsd.org/D11448


# 8c1960d5 25-Mar-2017 Mike Karels <karels@FreeBSD.org>

Fix reference count leak with L2 caching.

ip_forward, TCP/IPv6, and probably SCTP leaked references to L2 cache
entry because they used their own routes on the stack, not in_pcb routes.
The original model for route caching was callers that provided a route
structure to ip{,6}input() would keep the route, and this model was used
for L2 caching as well. Instead, change L2 caching to be done by default
only when using a route structure in the in_pcb; the pcb deallocation
code frees L2 as well as L3 cacches. A separate change will add route
caching to TCP/IPv6.

Another suggestion was to have the transport protocols indicate willingness
to use L2 caching, but this approach keeps the changes in the network
level

Reviewed by: ae gnn
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D10059


# dce33a45 05-Mar-2017 Ermal Luçi <eri@FreeBSD.org>

The patch provides the same socket option as Linux IP_ORIGDSTADDR.
Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD.

The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application.

This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets.

Reviewed by: adrian, aw
Approved by: ae (mentor)
Sponsored by: rsync.net
Differential Revision: D9235


# fbbd9655 28-Feb-2017 Warner Losh <imp@FreeBSD.org>

Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96


# c10c5b1e 11-Feb-2017 Ermal Luçi <eri@FreeBSD.org>

Committed without approval from mentor.

Reported by: gnn


# ed55edce 09-Feb-2017 Ermal Luçi <eri@FreeBSD.org>

The patch provides the same socket option as Linux IP_ORIGDSTADDR.
Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD.

The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application.

This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets.

Sponsored-by: rsync.net
Differential Revision: D9235
Reviewed-by: adrian


# fcf59617 06-Feb-2017 Andrey V. Elsukov <ae@FreeBSD.org>

Merge projects/ipsec into head/.

Small summary
-------------

o Almost all IPsec releated code was moved into sys/netipsec.
o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel
option IPSEC_SUPPORT added. It enables support for loading
and unloading of ipsec.ko and tcpmd5.ko kernel modules.
o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by
default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type
support was removed. Added TCP/UDP checksum handling for
inbound packets that were decapsulated by transport mode SAs.
setkey(8) modified to show run-time NAT-T configuration of SA.
o New network pseudo interface if_ipsec(4) added. For now it is
build as part of ipsec.ko module (or with IPSEC kernel).
It implements IPsec virtual tunnels to create route-based VPNs.
o The network stack now invokes IPsec functions using special
methods. The only one header file <netipsec/ipsec_support.h>
should be included to declare all the needed things to work
with IPsec.
o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed.
Now these protocols are handled directly via IPsec methods.
o TCP_SIGNATURE support was reworked to be more close to RFC.
o PF_KEY SADB was reworked:
- now all security associations stored in the single SPI namespace,
and all SAs MUST have unique SPI.
- several hash tables added to speed up lookups in SADB.
- SADB now uses rmlock to protect access, and concurrent threads
can do SA lookups in the same time.
- many PF_KEY message handlers were reworked to reflect changes
in SADB.
- SADB_UPDATE message was extended to support new PF_KEY headers:
SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They
can be used by IKE daemon to change SA addresses.
o ipsecrequest and secpolicy structures were cardinally changed to
avoid locking protection for ipsecrequest. Now we support
only limited number (4) of bundled SAs, but they are supported
for both INET and INET6.
o INPCB security policy cache was introduced. Each PCB now caches
used security policies to avoid SP lookup for each packet.
o For inbound security policies added the mode, when the kernel does
check for full history of applied IPsec transforms.
o References counting rules for security policies and security
associations were changed. The proper SA locking added into xform
code.
o xform code was also changed. Now it is possible to unregister xforms.
tdb_xxx structures were changed and renamed to reflect changes in
SADB/SPDB, and changed rules for locking and refcounting.

Reviewed by: gnn, wblock
Obtained from: Yandex LLC
Relnotes: yes
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D9352


# f3e7afe2 18-Jan-2017 Hans Petter Selasky <hselasky@FreeBSD.org>

Implement kernel support for hardware rate limited sockets.

- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.

- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.

- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().

- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.

- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.

- How rate limiting works:

1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.

2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.

3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.

4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.

Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months


# 2e77d270 29-Dec-2016 Andrey V. Elsukov <ae@FreeBSD.org>

When we are sending IP fragments, update ip pointers in IP_PROBE() for
each fragment.

MFC after: 1 week


# 6c1bd558 24-Oct-2016 Ryan Stone <rstone@FreeBSD.org>

Fix ip_output() on point-to-point links

In r304435, ip_output() was changed to use the result of the route
lookup to decide whether the outgoing packet was a broadcast or
not. This introduced a regression on interfaces where
IFF_BROADCAST was not set (e.g. point-to-point links), as the
algorithm could incorrectly treat the destination address as a
broadcast address, and ip_output() would subsequently drop the
packet as broadcasting on a non-IFF_BROADCAST interface is not
allowed.

Differential Revision: https://reviews.freebsd.org/D8303
Reviewed by: jtl
Reported by: ambrisko
MFC after: 2 weeks
X-MFC-With: r304435
Sponsored by: Dell EMC Isilon


# cc94f0c2 13-Oct-2016 Gleb Smirnoff <glebius@FreeBSD.org>

- Revert r300854, r303657 which tried to fix regression from r297225.
- Fix the regression proper way using RO_RTFREE().

Submitted by: ae


# 90cc51a1 18-Aug-2016 Ryan Stone <rstone@FreeBSD.org>

Don't iterate over the ifnet addr list in ip_output()

For almost every packet that is transmitted through ip_output(),
a call to in_broadcast() was made to decide if the destination
IP was a broadcast address. in_broadcast() iterates over the
ifnet's address to find a source IP matching the subnet of the
destination IP, and then checks if the IP is a broadcast in that
subnet.

This is completely redundant as we have already performed the
route lookup, so the source IP is already known. Just use that
address to directly check whether the destination IP is a
broadcast address or not.

MFC after: 2 months
Sponsored By: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7266


# 4c105402 08-Jun-2016 Andrey V. Elsukov <ae@FreeBSD.org>

Cleanup unneded include "opt_ipfw.h".

It was used for conditional build IPFIREWALL_FORWARD support.
But IPFIREWALL_FORWARD option was removed a long time ago.


# 6d768226 02-Jun-2016 George V. Neville-Neil <gnn@FreeBSD.org>

This change re-adds L2 caching for TCP and UDP, as originally added in D4306
but removed due to other changes in the system. Restore the llentry pointer
to the "struct route", and use it to cache the L2 lookup (ARP or ND6) as
appropriate.

Submitted by: Mike Karels
Differential Revision: https://reviews.freebsd.org/D6262


# 6351b385 27-May-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Plug route reference underleak that happens with FLOWTABLE after r297225.

Submitted by: Mike Karels <mike karels.net>


# 99d628d5 15-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

netinet: for pointers replace 0 with NULL.

These are mostly cosmetical, no functional change.

Found with devel/coccinelle.

Reviewed by: ae. tuexen


# 84cc0778 24-Mar-2016 George V. Neville-Neil <gnn@FreeBSD.org>

FreeBSD previously provided route caching for TCP (and UDP). Re-add
route caching for TCP, with some improvements. In particular, invalidate
the route cache if a new route is added, which might be a better match.
The cache is automatically invalidated if the old route is deleted.

Submitted by: Mike Karels
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D4306


# 36402a68 09-Jan-2016 Alexander V. Chernikov <melifaro@FreeBSD.org>

Finish r275196: do not dereference rtentry in if_output() routines.

The only piece of information that is required is rt_flags subset.

In particular, if_loop() requires RTF_REJECT and RTF_BLACKHOLE flags
to check if this particular mbuf needs to be dropped (and what
error should be returned).
Note that if_loop() will always return EHOSTUNREACH for "reject" routes
regardless of RTF_HOST flag existence. This is due to upcoming routing
changes where RTF_HOST value won't be available as lookup result.

All other functions require RTF_GATEWAY flag to check if they need
to return EHOSTUNREACH instead of EHOSTDOWN error.

There are 11 places where non-zero 'struct route' is passed to if_output().
For most of the callers (forwarding, bpf, arp) does not care about exact
error value. In fact, the only place where this result is propagated
is ip_output(). (ip6_output() passes NULL route to nd6_output_ifp()).

Given that, add 3 new 'struct route' flags (RT_REJECT, RT_BLACKHOLE and
RT_IS_GW) and inline function (rt_update_ro_flags()) to copy necessary
rte flags to ro_flags. Call this function in ip_output() after looking up/
verifying rte.

Reviewed by: ae


# 4fb3a820 30-Dec-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Implement interface link header precomputation API.

Add if_requestencap() interface method which is capable of calculating
various link headers for given interface. Right now there is support
for INET/INET6/ARP llheader calculation (IFENCAP_LL type request).
Other types are planned to support more complex calculation
(L2 multipath lagg nexthops, tunnel encap nexthops, etc..).

Reshape 'struct route' to be able to pass additional data (with is length)
to prepend to mbuf.

These two changes permits routing code to pass pre-calculated nexthop data
(like L2 header for route w/gateway) down to the stack eliminating the
need for other lookups. It also brings us closer to more complex scenarios
like transparently handling MPLS nexthops and tunnel interfaces.
Last, but not least, it removes layering violation introduced by flowtable
code (ro_lle) and simplifies handling of existing if_output consumers.

ARP/ND changes:
Make arp/ndp stack pre-calculate link header upon installing/updating lle
record. Interface link address change are handled by re-calculating
headers for all lles based on if_lladdr event. After these changes,
arpresolve()/nd6_resolve() returns full pre-calculated header for
supported interfaces thus simplifying if_output().
Move these lookups to separate ether_resolve_addr() function which ether
returs error or fully-prepared link header. Add <arp|nd6_>resolve_addr()
compat versions to return link addresses instead of pre-calculated data.

BPF changes:
Raw bpf writes occupied _two_ cases: AF_UNSPEC and pseudo_AF_HDRCMPLT.
Despite the naming, both of there have ther header "complete". The only
difference is that interface source mac has to be filled by OS for
AF_UNSPEC (controlled via BIOCGHDRCMPLT). This logic has to stay inside
BPF and not pollute if_output() routines. Convert BPF to pass prepend data
via new 'struct route' mechanism. Note that it does not change
non-optimized if_output(): ro_prepend handling is purely optional.
Side note: hackish pseudo_AF_HDRCMPLT is supported for ethernet and FDDI.
It is not needed for ethernet anymore. The only remaining FDDI user is
dev/pdq mostly untouched since 2007. FDDI support was eliminated from
OpenBSD in 2013 (sys/net/if_fddisubr.c rev 1.65).

Flowtable changes:
Flowtable violates layering by saving (and not correctly managing)
rtes/lles. Instead of passing lle pointer, pass pointer to pre-calculated
header data from that lle.

Differential Revision: https://reviews.freebsd.org/D4102


# 331dff07 08-Aug-2015 Alexander V. Chernikov <melifaro@FreeBSD.org>

Simplify ip[6] simploop:
Do not pass 'dst' sockaddr to ip[6]_mloopback:
- We have explicit check for AF_INET in ip_output()
- We assume ip header inside passed mbuf in ip_mloopback
- We assume ip6 header inside passed mbuf in ip6_mloopback


# 8f980c01 03-Aug-2015 Mark Johnston <markj@FreeBSD.org>

The mbuf parameter to ip_output_pfil() must be an output parameter since
pfil(9) hooks may modify the chain.

X-MFC-With: r286028


# 3c402323 29-Jul-2015 Ermal Luçi <eri@FreeBSD.org>

Avoid double reference decrement when firewalls force relooping of packets

When firewalls force a reloop of packets and the caller supplied a route the reference to the route might be reduced twice creating issues.
This is especially the scenario when a packet is looped because of operation in the firewall but the new route lookup gives a down route.

Differential Revision: https://reviews.freebsd.org/D3037
Reviewed by: gnn
Approved by: gnn(mentor)


# d9f2a782 29-Jul-2015 Ermal Luçi <eri@FreeBSD.org>

ip_output normalization and fixes

ip_output has a big chunk of code used to handle special cases with pfil consumers which also forces a reloop on it.
Gather all this code together to make it readable and properly handle the reloop cases.

Some of the issues identified:

M_IP_NEXTHOP is not handled properly in existing code.
route reference leaking is possible with in FIB number change
route flags checking is not consistent in the function

Differential Revision: https://reviews.freebsd.org/D3022
Reviewed by: gnn
Approved by: gnn(mentor)
MFC after: 4 weeks


# cc0a3c8c 29-Jul-2015 Andrey V. Elsukov <ae@FreeBSD.org>

Convert in_ifaddr_lock and in6_ifaddr_lock to rmlock.

Both are used to protect access to IP addresses lists and they can be
acquired for reading several times per packet. To reduce lock contention
it is better to use rmlock here.

Reviewed by: gnn (previous version)
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D3149


# c4c4346f 02-Apr-2015 Hans Petter Selasky <hselasky@FreeBSD.org>

Extend fixes made in r278103 and r38754 by copying the complete packet
header and not only partial flags and fields. Firewalls can attach
classification tags to the outgoing mbufs which should be copied to
all the new fragments. Else only the first fragment will be let
through by the firewall. This can easily be tested by sending a large
ping packet through a firewall. It was also discovered that VLAN
related flags and fields should be copied for packets traversing
through VLANs. This is all handled by "m_dup_pkthdr()".

Regarding the MAC policy check in ip_fragment(), the tag provided by
the originating mbuf is copied instead of using the default one
provided by m_gethdr().

Tested by: Karim Fodil-Lemelin <fodillemlinkarim at gmail.com>
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
PR: 7802


# 6d947416 01-Apr-2015 Gleb Smirnoff <glebius@FreeBSD.org>

o Use new function ip_fillid() in all places throughout the kernel,
where we want to create a new IP datagram.
o Add support for RFC6864, which allows to set IP ID for atomic IP
datagrams to any value, to improve performance. The behaviour is
controlled by net.inet.ip.rfc6864 sysctl knob, which is enabled by
default.
o In case if we generate IP ID, use counter(9) to improve performance.
o Gather all code related to IP ID into ip_id.c.

Differential Revision: https://reviews.freebsd.org/D2177
Reviewed by: adrian, cy, rpaulo
Tested by: Emeric POUPON <emeric.poupon stormshield.eu>
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
Relnotes: yes


# d612b95e 27-Mar-2015 Fabien Thomas <fabient@FreeBSD.org>

On multi CPU systems, we may emit successive packets with the same id.
Fix the race by using an atomic operation.

Differential Revision: https://reviews.freebsd.org/D2141
Obtained from: emeric.poupon@stormshield.eu
MFC after: 1 week
Sponsored by: Stormshield


# 9c0f6aa7 25-Feb-2015 Hans Petter Selasky <hselasky@FreeBSD.org>

Fix a special case in ip_fragment() to produce a more sensible chain
of packets. When the data payload length excluding any headers, of an
outgoing IPv4 packet exceeds PAGE_SIZE bytes, a special case in
ip_fragment() can kick in to optimise the outgoing payload(s). The
code which was added in r98849 as part of zero copy socket support
assumes that the beginning of any MTU sized payload is aligned to
where a MBUF's "m_data" pointer points. This is not always the case
and can sometimes cause large IPv4 packets, as part of ping replies,
to be split more than needed.

Instead of iterating the MBUFs to figure out how much data is in the
current chain, use the value already in the "m_pkthdr.len" field of
the first MBUF in the chain.

Reviewed by: ken @
Differential Revision: https://reviews.freebsd.org/D1893
MFC after: 2 weeks
Sponsored by: Mellanox Technologies


# 609752f0 02-Feb-2015 Hans Petter Selasky <hselasky@FreeBSD.org>

The flowid and hashtype should be copied from the originating packet
when fragmenting IP packets to preserve the order of the packets in a
stream. Else the resulting fragments can be sent out of order when the
hardware supports multiple transmit rings.

Reviewed by: glebius @
MFC after: 1 week
Sponsored by: Mellanox Technologies


# b2bdc62a 18-Jan-2015 Adrian Chadd <adrian@FreeBSD.org>

Refactor / restructure the RSS code into generic, IPv4 and IPv6 specific
bits.

The motivation here is to eventually teach netisr and potentially
other networking subsystems a bit more about how RSS work queues / buckets
are configured so things have a hope of auto-configuring in the future.

* net/rss_config.[ch] takes care of the generic bits for doing
configuration, hash function selection, etc;
* topelitz.[ch] is now in net/ rather than netinet/;
* (and would be in libkern if it didn't directly include RSS_KEYSIZE;
that's a later thing to fix up.)
* netinet/in_rss.[ch] now just contains the IPv4 specific methods;
* and netinet/in6_rss.[ch] now just contains the IPv6 specific methods.

This should have no functional impact on anyone currently using
the RSS support.

Differential Revision: D1383
Reviewed by: gnn, jfv (intel driver bits)


# 0275b2e3 11-Dec-2014 Andrey V. Elsukov <ae@FreeBSD.org>

Remove flag/flags argument from the following functions:
ipsec_getpolicybyaddr()
ipsec4_checkpolicy()
ip_ipsec_output()
ip6_ipsec_output()

The only flag used here was IP_FORWARDING.

Obtained from: Yandex LLC
Sponsored by: Yandex LLC


# c2529042 01-Dec-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Start process of removing the use of the deprecated "M_FLOWID" flag
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.

This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.

"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.

Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.

MFC after: 1 month
Sponsored by: Mellanox Technologies


# 7f948f12 16-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Finish r274175: do control plane MTU tracking.

Update route MTU in case of ifnet MTU change.
Add new RTF_FIXEDMTU to track explicitly specified MTU.

Old behavior:
ifconfig em0 mtu 1500->9000 -> all routes traversing em0 do not change MTU.
User has to manually update all routes.
ifconfig em0 mtu 9000->1500 -> all routes traversing em0 do not change MTU.
However, if ip[6]_output finds route with rt_mtu > interface mtu, rt_mtu
gets updated.

New behavior:
ifconfig em0 mtu 1500->9000 -> all interface routes in all fibs gets updated
with new MTU unless RTF_FIXEDMTU flag set on them.
ifconfig em0 mtu 9000->1500 -> all routes in all fibs gets updated with new
MTU unless RTF_FIXEDMTU flag set on them AND rt_mtu is less than ifp mtu.

route add ... -mtu XXX automatically sets RTF_FIXEDMTU flag.
route change .. -mtu 0 automatically removes RTF_FIXEDMTU flag.

PR: 194238
MFC after: 1 month
CR: D1125


# 603eaf79 09-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Renove faith(4) and faithd(8) from base. It looks like industry
have chosen different (and more traditional) stateless/statuful
NAT64 as translation mechanism. Last non-trivial commits to both
faith(4) and faithd(8) happened more than 12 years ago, so I assume
it is time to drop RFC3142 in FreeBSD.

No objections from: net@


# 257480b8 04-Nov-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Convert netinet6/ to use new routing API.

* Remove &ifpp from ip6_output() in favor of ri->ri_nh_info
* Provide different wrappers to in6_selectsrc:
Currently it is used by 2 differenct type of customers:
- socket-based one, which all are unsure about provided
address scope and
- in-kernel ones (ND code mostly), which don't have
any sockets, options, crededentials, etc.
So, we provide two different wrappers to in6_selectsrc()
returning select source.
* Make different versions of selectroute():
Currenly selectroute() is used in two scenarios:
- SAS, via in6_selecsrc() -> in6_selectif() -> selectroute()
- output, via in6_output -> wrapper -> selectroute()
Provide different versions for each customer:
- fib6_lookup_nh_basic()-based in6_selectif() which is
capable of returning interface only, without MTU/NHOP/L2
calculations
- full-blown fib6_selectroute() with cached route/multipath/
MTU/L2
* Stop using routing table for link-local address lookups
* Add in6_ifawithifp_lla() to make for-us check faster for link-local
* Add in6_splitscope / in6_setllascope for faster embed/deembed scopes


# 9f65116c 25-Oct-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

* Increase nh_flags to be u16 thus reducing nhop payload to be 48 bytes
* Use NHF_ namespace for all nhop flags
* Rename nhop_data -> nhop_prepend
* Rename fib4_lookup_nh_extended -> fib4_lookup_nh_ext
* Add "flags" argument to fib4_lookup_nh_ext() to specify whether we want
returned nh_ext structure to be refcounted or not.


# 2bb83c79 23-Oct-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Rename ip_sendmbuf to fib4_sendmbuf() and move it to rt_nhops api.
Convert IPv4 SAS to use new routing api.


# b4e8f808 19-Oct-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Switch IPv4 output path to use new routing api.

The goals of the new API is to provide consumers with minimal
needed information, but as fast as possible. So we provide
full nexthop info copied into alighed on-cache structure
instead of rte/ia pointers, their refcounts and locks.
This does not provide solution for protecting from egress
ifp destruction, but does not make it any worse.

Current changes:

nhops:
Add fib4_lookup_prepend() function which stores either full
L2+L3 prepend info (e.g. MAC header in case of plain IPv4) or
L3 info with NH_FLAGS_L2_INCOMPLETE flag indicating that no valid L2
info exists and we have to take "slow" path.

ip_output:
Currently ip[ 46]_output consumers use 'struct route' for
the following purposes:
1) double lookup avoidance(route caching)
2) plain route caching
3) get path MTU to be able to notify source.
The former pattern is mostly used by various tunnels
(gif, gre, stf). (Actually, gre is the only remaining,
others were already converted. Their locking model did
not scale good enogh to benefit from such caching, so
we have (temporarily) removed it without any performance
loss).
Plain route caching used by SCTP is simply wrong and should be removed.
Temporary break it for now just to be able to compile.
Optimize path mtu reporting by providing it in new 'route_info' stucture.

Minimize games with @ia locking/refcounting for route lookup:
add special nhop[46]_extended structure to store more route attributes.
Pointer to given structure can be passed to fib4_lookup_prepend() to indicate
we want this info (we actually needs it for UDP and raw IP).

ether_output:
Provide light-weight ether_output2() call to deal with
transmitting L2 frame (e.g. properly handle broadcast/simloop/bridge/
other L2 hooks before actually transmitting frame by if_transmit()).
Add a hack based on new RT_NHOP ro_flag to distinguish which version should
we call. Better way is probably to add a new "if_output_frame" driver
callbacks.

Next steps:
* Convert ip_fastfwd part
* Implement auto-growing array for per-radix nexthops
* Implement LLE tracking for nexthop calculations to be able to
immediately provide all necessary info in single route lookup
for gateway routes
* Switch radix locking scheme to runtime/cfg lock
* Implement multipath support for rtsock
* Implement "tracked nexthops" for tunnels (e.g. _proper_
nexthop caching)
* Add IPv6 support for remaining parts (postponed not to
interfere with user/ae/inet6 branch)
* Consider adding "if_output_frame" driver call to
ease logical frame pushing.


# f0cace5d 12-Oct-2014 Robert Watson <rwatson@FreeBSD.org>

When deciding whether to call m_pullup() even though there is adequate
data in an mbuf, use M_WRITABLE() instead of a direct test of M_EXT;
the latter both unnecessarily exposes mbuf-allocator internals in the
protocol stack and is also insufficient to catch all cases of
non-writability.

(NB: m_pullup() does not actually guarantee that a writable mbuf is
returned, so further refinement of all of these code paths continues to
be required.)

Reviewed by: bz
MFC after: 3 days
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D900


# 9c57a5b6 01-Oct-2014 Hiroki Sato <hrs@FreeBSD.org>

Add an additional routing table lookup when m->m_pkthdr.fibnum is changed
at a PFIL hook in ip{,6}_output(). IPFW setfib rule did not perform
a routing table lookup when the destination address was not changed.

CR: D805


# 22bfa4f5 18-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Remove disabled code, that is very unlikely to be ever enabled again,
as well as the comment that explains why is it disabled.


# 58a39d8c 16-Sep-2014 Alan Somers <asomers@FreeBSD.org>

Fix source address selection on unbound sockets in the presence of multiple
fibs. Use the mbuf's or the socket's fib instead of RT_ALL_FIBS. Fixes PR
187553. Also fixes netperf's UDP_STREAM test on a nondefault fib.

sys/netinet/ip_output.c
In ip_output, lookup the source address using the mbuf's fib instead
of RT_ALL_FIBS.

sys/netinet/in_pcb.c
in in_pcbladdr, lookup the source address using the socket's fib,
because we don't seem to have the mbuf fib. They should be the same,
though.

tests/sys/net/fibs_test.sh
Clear the expected failure on udp_dontroute.

PR: 187553
CR: https://reviews.freebsd.org/D772
MFC after: 3 weeks
Sponsored by: Spectra Logic


# 4f8585e0 11-Sep-2014 Alan Somers <asomers@FreeBSD.org>

Revisions 264905 and 266860 added a "int fib" argument to ifa_ifwithnet and
ifa_ifwithdstaddr. For the sake of backwards compatibility, the new
arguments were added to new functions named ifa_ifwithnet_fib and
ifa_ifwithdstaddr_fib, while the old functions became wrappers around the
new ones that passed RT_ALL_FIBS for the fib argument. However, the
backwards compatibility is not desired for FreeBSD 11, because there are
numerous other incompatible changes to the ifnet(9) API. We therefore
decided to remove it from head but leave it in place for stable/9 and
stable/10. In addition, this commit adds the fib argument to
ifa_ifwithbroadaddr for consistency's sake.

sys/sys/param.h
Increment __FreeBSD_version

sys/net/if.c
sys/net/if_var.h
sys/net/route.c
Add fibnum argument to ifa_ifwithbroadaddr, and remove the _fib
versions of ifa_ifwithdstaddr, ifa_ifwithnet, and ifa_ifwithroute.

sys/net/route.c
sys/net/rtsock.c
sys/netinet/in_pcb.c
sys/netinet/ip_options.c
sys/netinet/ip_output.c
sys/netinet6/nd6.c
Fixup calls of modified functions.

share/man/man9/ifnet.9
Document changed API.

CR: https://reviews.freebsd.org/D458
MFC after: Never
Sponsored by: Spectra Logic


# 9d3ddf43 08-Sep-2014 Adrian Chadd <adrian@FreeBSD.org>

Add support for receiving and setting flowtype, flowid and RSS bucket
information as part of recvmsg().

This is primarily used for debugging/verification of the various
processing paths in the IP, PCB and driver layers.

Unfortunately the current implementation of the control message path
results in a ~10% or so drop in UDP frame throughput when it's used.

Differential Revision: https://reviews.freebsd.org/D527
Reviewed by: grehan


# 061a4b4c 08-Sep-2014 Adrian Chadd <adrian@FreeBSD.org>

Add a flag to ip_output() - IP_NODEFAULTFLOWID - which prevents it from
overriding an existing flowid/flowtype field in the outbound mbuf with
the inp_flowid/inp_flowtype details.

The upcoming RSS UDP support calculates a valid RSS value for outbound
mbufs and since it may change per send, it doesn't cache it in the inpcb.
So overriding it here would be wrong.

Differential Revision: https://reviews.freebsd.org/D527
Reviewed by: grehan


# bf7dcda3 03-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Clean up unused CSUM_FRAGMENT.

Sponsored by: Nginx, Inc.


# 0a100a6f 09-Jul-2014 Adrian Chadd <adrian@FreeBSD.org>

Implement the first stage of multi-bind listen sockets and RSS socket
awareness.

* Introduce IP_BINDMULTI - indicating that it's okay to bind multiple
sockets on the same bind details.

Although the PCB code has been taught about this (see below) this patch
doesn't introduce the rest of the PCB changes necessary to distribute
lookups among multiple PCB entries in the global wildcard table.

* Introduce IP_RSS_LISTEN_BUCKET - placing an listen socket into the
given RSS bucket (and thus a single PCBGROUP hash.)

* Modify the PCB add path to be aware of IP_BINDMULTI:
+ Only allow further PCB entries to be added if the owner credentials
and IP_BINDMULTI has been specified. Ie, only allow further
IP_BINDMULTI sockets to appear if the first bind() was IP_BINDMULTI.

* Teach the PCBGROUP code about IP_RSS_LISTE_BUCKET marked PCB entries.
Instead of using the wildcard logic and hashing, these sockets are
simply placed into the PCBGROUP and _not_ in the wildcard hash.

* When doing a PCBGROUP lookup, also do a wildcard match as well.
This allows for an RSS bucket PCB entry to appear in a PCBGROUP
rather than having to exist in the wildcard list.

Tested:

* TCP IPv4 server testing with igb(4)
* TCP IPv4 server testing with ix(4)

TODO:

* The pcbgroup lookup code duplicated the wildcard and wildcard-PCB
logic. This could be refactored into a single function.

* This doesn't yet work for IPv6 (The PCBGROUP code in netinet6/ doesn't
yet know about this); nor does it yet fully work for UDP.


# fe82cbe8 09-Jul-2014 Gleb Smirnoff <glebius@FreeBSD.org>

In several cases in ip_output() we obtain reference on ifa. Do not
leak it.

Together with: asomers, np
Sponsored by: Nginx, Inc.


# 81a99d38 01-Jul-2014 Adrian Chadd <adrian@FreeBSD.org>

Remove old reference to IP_RSSCPUID.

Submitted by: Eggert, Lars <lars@netapp.com>


# dc847eb6 27-Jun-2014 Adrian Chadd <adrian@FreeBSD.org>

Add missing variable declarations when using RSS.

Reported by: bryanv@


# 7847796a 25-Jun-2014 Adrian Chadd <adrian@FreeBSD.org>

Retire IP_RSSCPUID ; the right thing to do is query the RSS bucket;
map the bucket to an RSS queue, then map the queue to a CPU ID.
This way the bucket->queue and queue->CPU mapping can change
over time.

Introduce IP_RSSBUCKETID - which instead looks up the RSS bucket.
User applications can then map the RSS bucket to a CPU.


# 2f308a34 29-May-2014 Alan Somers <asomers@FreeBSD.org>

Fix unintended KBI change from r264905. Add _fib versions of
ifa_ifwithnet() and ifa_ifwithdstaddr() The legacy functions will call the
_fib() versions with RT_ALL_FIBS, preserving legacy behavior.

sys/net/if_var.h
sys/net/if.c
Add legacy-compatible functions as described above. Ensure legacy
behavior when RT_ALL_FIBS is passed as fibnum.

sys/netinet/in_pcb.c
sys/netinet/ip_output.c
sys/netinet/ip_options.c
sys/net/route.c
sys/net/rtsock.c
sys/netinet6/nd6.c
Call with _fib() functions if we must use a specific fib, or the
legacy functions otherwise.

tests/sys/netinet/fibs_test.sh
tests/sys/netinet/udp_dontroute.c
Improve the udp_dontroute test. The bug that this test exercises is
that ifa_ifwithnet() will return the wrong address, if multiple
interfaces have addresses on the same subnet but with different
fibs. The previous version of the test only considered one possible
failure mode: that ifa_ifwithnet_fib() might fail to find any
suitable address at all. The new version also checks whether
ifa_ifwithnet_fib() finds the correct address by checking where the
ARP request goes.

Reported by: bz, hrs
Reviewed by: hrs
MFC after: 1 week
X-MFC-with: 264905
Sponsored by: Spectra Logic


# 9c423972 18-May-2014 Adrian Chadd <adrian@FreeBSD.org>

* When copying the flowid from inp -> outbound mbuf, also assign the
hashtype to to the outbound mbuf as well as the flowid.

* Add in socket options to fetch the hashid, the hashtype and RSS CPU
ID for a given socket.


# 26461454 08-May-2014 Michael Tuexen <tuexen@FreeBSD.org>

Use KASSERTs as suggested by glebius@

MFC after: 3 days
X-MFC with: 265691


# 8e1d0a56 08-May-2014 Michael Tuexen <tuexen@FreeBSD.org>

For some UDP packets (for example with 200 byte payload) and IP options,
the IP header and the UDP header are not in the same mbuf.
Add code to in_delayed_cksum() to deal with this case.

MFC after: 3 days


# 0cfee0c2 24-Apr-2014 Alan Somers <asomers@FreeBSD.org>

Fix subnet and default routes on different FIBs on the same subnet.

These two bugs are closely related. The root cause is that ifa_ifwithnet
does not consider FIBs when searching for an interface address.

sys/net/if_var.h
sys/net/if.c
Add a fib argument to ifa_ifwithnet and ifa_ifwithdstadddr. Those
functions will only return an address whose interface fib equals the
argument.

sys/net/route.c
Update calls to ifa_ifwithnet and ifa_ifwithdstaddr with fib
arguments.

sys/netinet/in.c
Update in_addprefix to consider the interface fib when adding
prefixes. This will prevent it from not adding a subnet route when
one already exists on a different fib.

sys/net/rtsock.c
sys/netinet/in_pcb.c
sys/netinet/ip_output.c
sys/netinet/ip_options.c
sys/netinet6/nd6.c
Add RT_DEFAULT_FIB arguments to ifa_ifwithdstaddr and ifa_ifwithnet.
In some cases it there wasn't a clear specific fib number to use.
In others, I was unable to test those functions so I chose
RT_DEFAULT_FIB to minimize divergence from current behavior. I will
fix some of the latter changes along with PR kern/187553.

tests/sys/netinet/fibs_test.sh
tests/sys/netinet/udp_dontroute.c
tests/sys/netinet/Makefile
Revert r263738. The udp_dontroute test was right all along.
However, bugs kern/187550 and kern/187553 cancelled each other out
when it came to this test. Because of kern/187553, ifa_ifwithnet
searched the default fib instead of the requested one, but because
of kern/187550, there was an applicable subnet route on the default
fib. The new test added in r263738 doesn't work right, however. I
can verify with dtrace that ifa_ifwithnet returned the wrong address
before I applied this commit, but route(8) miraculously found the
correct interface to use anyway. I don't know how.

Clear expected failure messages for kern/187550 and kern/187552.

PR: kern/187550
PR: kern/187552
Reviewed by: melifaro
MFC after: 3 weeks
Sponsored by: Spectra Logic


# e3a7aa6f 04-Mar-2014 Gleb Smirnoff <glebius@FreeBSD.org>

- Remove rt_metrics_lite and simply put its members into rtentry.
- Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This
removes another cache trashing ++ from packet forwarding path.
- Create zini/fini methods for the rtentry UMA zone. Via initialize
mutex and counter in them.
- Fix reporting of rmx_pksent to routing socket.
- Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode.

The change is mostly targeted for stable/10 merge. For head,
rt_pksent is expected to just disappear.

Discussed with: melifaro
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 2a7da729 04-Mar-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Remove ifa_ref()/ifa_free(), which are atomic(9), from ip_output().

The ifaddr is already referenced by the rtentry, and we are holding
reference on the rtentry throughout the function execution.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 0ff96b4f 17-Feb-2014 Gleb Smirnoff <glebius@FreeBSD.org>

o Remove at compile time the HASH_ALL code, that was never
tested and is unfinished. However, I've tested my version,
it works okay. As before it is unfinished: timeout aren't
driven by TCP session state. To enable the HASH_ALL mode,
one needs in kernel config:

options FLOWTABLE_HASH_ALL

o Reduce the alignment on flentry to 64 bytes. Without
the FLOWTABLE_HASH_ALL option, twice less memory would
be consumed by flows.
o API to ip_output()/ip6_output() got even more thin: 1 liner.
o Remove unused unions. Simply use fle->f_key[].
o Merge all IPv4 code into flowtable_lookup_ipv4(), and do same
flowtable_lookup_ipv6(). Stop copying data to on stack
sockaddr structures, simply use key[] on stack.
o Move code from flowtable_lookup_common() that actually works
on insertion into flowtable_insert().

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 5d6d7e75 07-Feb-2014 Gleb Smirnoff <glebius@FreeBSD.org>

o Revamp API between flowtable and netinet, netinet6.
- ip_output() and ip_output6() simply call flowtable_lookup(),
passing mbuf and address family. That's the only code under
#ifdef FLOWTABLE in the protocols code now.
o Revamp statistics gathering and export.
- Remove hand made pcpu stats, and utilize counter(9).
- Snapshot of statistics is available via 'netstat -rs'.
- All sysctls are moved into net.flowtable namespace, since
spreading them over net.inet isn't correct.
o Properly separate at compile time INET and INET6 parts.
o General cleanup.
- Remove chain of multiple flowtables. We simply have one for
IPv4 and one for IPv6.
- Flowtables are allocated in flowtable.c, symbols are static.
- With proper argument to SYSINIT() we no longer need flowtable_ready.
- Hash salt doesn't need to be per-VNET.
- Removed rudimentary debugging, which use quite useless in dtrace era.

The runtime behavior of flowtable shouldn't be changed by this commit.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 3c065f2f 15-Jan-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Cleanup comments and whitespace. No functional changes.


# 054692a4 16-Jan-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix ipfw fwd for IPv4 traffic broken by r249894.

Problem case:
Original lookup returns route with GW set, so gw points to
rte->rt_gateway.
After that we're changing dst and performing lookup another time.
Since fwd host is most probably directly reachable, resulting
rte does not contain rt_gateway, so gw is not set. Finally, we
end with packet transmitted to proper interface but wrong
link-layer address.

Found by: lstewart
Discussed with: ae,lstewart
MFC after: 2 weeks
Sponsored by: Yandex LLC


# 183e1c86 02-Jan-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Fix regression from r249894. Now we pass "gw" as argument to if_output
method, thus for multicast case we need it to point at "dst".

PR: 185395
Submitted by: ae


# ac7e1212 20-Dec-2013 Adrian Chadd <adrian@FreeBSD.org>

Disable the now unpredicably bogus check for whether we have
eneough queue space before queuing a bunch of IP fragments.

As the comment in the committed change says, in the post-if_transmit(),
post-SMP, post-preemption world, there's just too much overlapping
concurrent code paths and different approaches to driver transmit
queue management to have this code even remotely be effective.

The only specific place it could be useful is if ALTQ is enabled
but again it doesn't at all promise that all the fragments will be
transmitted anyway.

The main reason for committing this change is to disable a parallel
place where the drops counter is incremented. This is a side effect
of an upcoming change to ixgbe/cxgbe to handle the queue drops
counter slightly better.

Sponsored by: Netflix, Inc.


# 54366c0b 25-Nov-2013 Attilio Rao <attilio@FreeBSD.org>

- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip


# 76039bc8 26-Oct-2013 Gleb Smirnoff <glebius@FreeBSD.org>

The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 7caf4ab7 15-Oct-2013 Gleb Smirnoff <glebius@FreeBSD.org>

- Utilize counter(9) to accumulate statistics on interface addresses. Add
four counters to struct ifaddr. This kills '+=' on a variables shared
between processors for every packet.
- Nuke struct if_data from struct ifaddr.
- In ip_input() do not put a reference on ifaddr, instead update statistics
right now in place and do IN_IFADDR_RUNLOCK(). These removes atomic(9)
for every packet. [1]
- To properly support NET_RT_IFLISTL sysctl used by getifaddrs(3), in
rtsock.c fill if_data fields using counter_u64_fetch().
- Accidentially fix bug in COMPAT_32 version of NET_RT_IFLISTL, which
took if_data not from the ifaddr, but from ifaddr's ifnet. [2]

Submitted by: melifaro [1], pluknet[2]
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 57f60867 25-Aug-2013 Mark Johnston <markj@FreeBSD.org>

Implement the ip, tcp, and udp DTrace providers. The probe definitions use
dynamic translation so that their arguments match the definitions for
these providers in Solaris and illumos. Thus, existing scripts for these
providers should work unmodified on FreeBSD.

Tested by: gnn, hiren
MFC after: 1 month


# 86bd0491 19-Aug-2013 Andre Oppermann <andre@FreeBSD.org>

Add m_clrprotoflags() to clear protocol specific mbuf flags at up and
downwards layer crossings.

Consistently use it within IP, IPv6 and ethernet protocols.

Discussed with: trociny, glebius


# fb86dfcd 19-Aug-2013 Andre Oppermann <andre@FreeBSD.org>

Remove unused M_FRAG, M_FIRSTFRAG and M_LASTFRAG tagging from ip_fragment().
There wasn't any real driver (and hardware) support for it. Modern hardware
does full fragmentation/segmentation offload instead.


# efdf104b 04-Jul-2013 Mikolaj Golub <trociny@FreeBSD.org>

In r227207, to fix the issue with possible NULL inp_socket pointer
dereferencing, when checking for SO_REUSEPORT option (and SO_REUSEADDR
for multicast), INP_REUSEPORT flag was introduced to cache the socket
option. It was decided then that one flag would be enough to cache
both SO_REUSEPORT and SO_REUSEADDR: when processing SO_REUSEADDR
setsockopt(2), it was checked if it was called for a multicast address
and INP_REUSEPORT was set accordingly.

Unfortunately that approach does not work when setsockopt(2) is called
before binding to a multicast address: the multicast check fails and
INP_REUSEPORT is not set.

Fix this by adding INP_REUSEADDR flag to unconditionally cache
SO_REUSEADDR.

PR: 179901
Submitted by: Michael Gmelin freebsd grem.de (initial version)
Reviewed by: rwatson
MFC after: 1 week


# 47e8d432 25-Apr-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Add const qualifier to the dst parameter of the ifnet if_output method.


# 4c7a6059 24-Apr-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Introduce a pointer to const variable gw, which points either at the
same place as dst, or to the sockaddr in the routing table.

The const constraint of gw makes us safe from modifing routing table
accidentially. And "onstantness" of dst allows us to remove several
bandaids, when we switched it back at &ro->ro_dst, now it always
points there.

Reviewed by: rrs


# 0be23a54 24-Apr-2013 Randall Stewart <rrs@FreeBSD.org>

This fixes the issue with the "randomly changing" default
route. What it was is there are two places in ip_output.c
where we do a goto again. One place was fine, it
copies out the new address and then resets dst = ro->rt_dst;
But the other place does *not* do that, which means earlier
when we found the gateway, we have dst pointing there
aka dst = ro->rt_gateway is done.. then we do a
goto again.. bam now we clobber the default route.

The fix is just to move the again so we are always
doing dst = &ro->rt_dst; in the again loop.

PR: 174749,157796
MFC after: 1 week


# dc4ad05e 14-Mar-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Use m_get/m_gethdr instead of compat macros.

Sponsored by: Nginx, Inc.


# eb1b1807 05-Dec-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually


# da2299c5 27-Nov-2012 Andre Oppermann <andre@FreeBSD.org>

Remove unused and unnecessary CSUM_IP_FRAGS checksumming capability.
Checksumming the IP header of fragments is no different from doing
normal IP headers.

Discussed with: yongari
MFC after: 1 week


# ffdbf9da 01-Nov-2012 Andrey V. Elsukov <ae@FreeBSD.org>

Remove the recently added sysctl variable net.pfil.forward.
Instead, add protocol specific mbuf flags M_IP_NEXTHOP and
M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain
contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup
only when this flag is set.

Suggested by: andre


# 078468ed 26-Oct-2012 Gleb Smirnoff <glebius@FreeBSD.org>

o Remove last argument to ip_fragment(), and obtain all needed information
on checksums directly from mbuf flags. This simplifies code.
o Clear CSUM_IP from the mbuf in ip_fragment() if we did checksums in
hardware. Some driver may not announce CSUM_IP in theur if_hwassist,
although try to do checksums if CSUM_IP set on mbuf. Example is em(4).
o While here, consistently use CSUM_IP instead of its alias CSUM_DELAY_IP.
After this change CSUM_DELAY_IP vanishes from the stack.

Submitted by: Sebastian Kuzminsky <seb lineratesystems.com>


# c1de64a4 25-Oct-2012 Andrey V. Elsukov <ae@FreeBSD.org>

Remove the IPFIREWALL_FORWARD kernel option and make possible to turn
on the related functionality in the runtime via the sysctl variable
net.pfil.forward. It is turned off by default.

Sponsored by: Yandex LLC
Discussed with: net@
MFC after: 2 weeks


# 8f134647 22-Oct-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Switch the entire IPv4 stack to keep the IP packet header
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.

After this change a packet processed by the stack isn't
modified at all[2] except for TTL.

After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.

[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.

[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.

Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>


# 347d90ac 14-Oct-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Fix a miss from r241344: in ip_mloopback() we need to go to
net byte order prior to calling in_delayed_cksum().

Reported by: Olivier Cochard-Labbe <olivier cochard.me>


# 23e9c6dc 08-Oct-2012 Gleb Smirnoff <glebius@FreeBSD.org>

After r241245 it appeared that in_delayed_cksum(), which still expects
host byte order, was sometimes called with net byte order. Since we are
moving towards net byte order throughout the stack, the function was
converted to expect net byte order, and its consumers fixed appropriately:
- ip_output(), ipfilter(4) not changed, since already call
in_delayed_cksum() with header in net byte order.
- divert(4), ng_nat(4), ipfw_nat(4) now don't need to swap byte order
there and back.
- mrouting code and IPv6 ipsec now need to switch byte order there and
back, but I hope, this is temporary solution.
- In ipsec(4) shifted switch to net byte order prior to in_delayed_cksum().
- pf_route() catches up on r241245 changes to ip_output().


# 21d172a3 06-Oct-2012 Gleb Smirnoff <glebius@FreeBSD.org>

A step in resolving mess with byte ordering for AF_INET. After this change:

- All packets in NETISR_IP queue are in net byte order.
- ip_input() is entered in net byte order and converts packet
to host byte order right _after_ processing pfil(9) hooks.
- ip_output() is entered in host byte order and converts packet
to net byte order right _before_ processing pfil(9) hooks.
- ip_fragment() accepts and emits packet in net byte order.
- ip_forward(), ip_mloopback() use host byte order (untouched actually).
- ip_fastforward() no longer modifies packet at all (except ip_ttl).
- Swapping of byte order there and back removed from the following modules:
pf(4), ipfw(4), enc(4), if_bridge(4).
- Swapping of byte order added to ipfilter(4), based on __FreeBSD_version
- __FreeBSD_version bumped.
- pfil(9) manual page updated.

Reviewed by: ray, luigi, eri, melifaro
Tested by: glebius (LE), ray (BE)


# 3c73180f 18-Jul-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Plug a reference leak: before doing 'goto again' we need to unref
ia->ia_ifa if there is any.

Submitted by: Andrey Zonov <andrey zonov.org>


# bf984051 04-Jul-2012 Gleb Smirnoff <glebius@FreeBSD.org>

When ip_output()/ip6_output() is supplied a struct route *ro argument,
it skips FLOWTABLE lookup. However, the non-NULL ro has dual meaning
here: it may be supplied to provide route, and it may be supplied to
store and return to caller the route that ip_output()/ip6_output()
finds. In the latter case skipping FLOWTABLE lookup is pessimisation.

The difference between struct route filled by FLOWTABLE and filled
by rtalloc() family is that the former doesn't hold a reference on
its rtentry. Reference is hold by flow entry, and it is about to
be released in future. Thus, route filled by FLOWTABLE shouldn't
be passed to RTFREE() macro.

- Introduce new flag for struct route/route_in6, that marks route
not holding a reference on rtentry.
- Introduce new macro RO_RTFREE() that cleans up a struct route
depending on its kind.
- All callers to ip_output()/ip6_output() that do supply non-NULL
but empty route should use RO_RTFREE() to free results of
lookup.
- ip_output()/ip6_output() now do FLOWTABLE lookup always when
ro->ro_rt == NULL.

Tested by: tuexen (SCTP part)


# 3cca425b 12-Jun-2012 Michael Tuexen <tuexen@FreeBSD.org>

Add a IP_RECVTOS socket option to receive for received UDP/IPv4
packets a cmsg of type IP_RECVTOS which contains the TOS byte.
Much like IP_RECVTTL does for TTL. This allows to implement a
protocol on top of UDP and implementing ECN.

MFC after: 3 days


# fc06cd42 06-Nov-2011 Mikolaj Golub <trociny@FreeBSD.org>

Cache SO_REUSEPORT socket option in inpcb-layer in order to avoid
inp_socket->so_options dereference when we may not acquire the lock on
the inpcb.

This fixes the crash due to NULL pointer dereference in
in_pcbbind_setup() when inp_socket->so_options in a pcb returned by
in_pcblookup_local() was checked.

Reported by: dave jones <s.dave.jones@gmail.com>, Arnaud Lacombe <lacombar@gmail.com>
Suggested by: rwatson
Glanced by: rwatson
Tested by: dave jones <s.dave.jones@gmail.com>


# 05b9d121 14-Apr-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

The mbuf_frag_size always was and is file local and not queried from base
user space tools via kvm. Mark it static.

MFC after: 3 days


# c744cde4 31-Dec-2010 Bjoern A. Zeeb <bz@FreeBSD.org>

Try to catch a possible divide-by-zero as early as possible if "mtu" is 0
(also test for negative MTUs if checking it anyway).
An MTU of 0 is arguably a bug elsewhere, but this at least gives us some
more debugging hints.

Sponsored by: ISPsystem (Early 2010)
MFC after: 1 week


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 5f6bf451 24-Sep-2010 Attilio Rao <attilio@FreeBSD.org>

IP_BINDANY is not correctly handled in getsockopt() case.
Fix it by specifying the correct bits.

Sponsored by: Sandvine Incorporated
Reviewed by: bz, emaste, rstone
Obtained from: Sandvine Incorporated
MFC after: 10 days


# dd62f5c0 25-Jun-2010 Qing Li <qingli@FreeBSD.org>

MFC r208553

This patch fixes the problem where proxy ARP entries cannot be added
over the if_ng interface.

Approved by: re (bz)


# 0ed6142b 25-May-2010 Qing Li <qingli@FreeBSD.org>

This patch fixes the problem where proxy ARP entries cannot be added
over the if_ng interface.

MFC after: 3 days


# 54bb4167 05-Apr-2010 Randall Stewart <rrs@FreeBSD.org>

MFC of 2 items to fix the csum for v6 issue:
Revision 205075 and 205104:

---------205075----------
With the recent change of the sctp checksum to support offload,
no delayed checksum was added to the ip6 output code. This
causes cards that do not support SCTP checksum offload to
have SCTP packets that are IPv6 NOT have the sctp checksum
performed. Thus you could not communicate with a peer. This
adds the missing bits to make the checksum happen for these cards.
-------------------------
---------205104----------
The proper fix for the delayed SCTP checksum is to
have the delayed function take an argument as to the offset
to the SCTP header. This allows it to work for V4 and V6.
This of course means changing all callers of the function
to either pass the header len, if they have it, or create
it (ip_hl << 2 or sizeof(ip6_hdr)).
-------------------------
PR: 144529


# c951da56 01-Apr-2010 Qing Li <qingli@FreeBSD.org>

MFC 204902

One of the advantages of enabling ECMP (a.k.a RADIX_MPATH) is to
allow for connection load balancing across interfaces. Currently
the address alias handling method is colliding with the ECMP code.
For example, when two interfaces are configured on the same prefix,
only one prefix route is installed. So connection load balancing
among the available interfaces is not possible.

The other advantage of ECMP is for failover. The issue with the
current code, is that the interface link-state is not reflected
in the route entry. For example, if there are two interfaces on
the same prefix, the cable on one interface is unplugged, new and
existing connections should switch over to the other interface.
This is not done today and packets go into a black hole.

Also, there is a small bug in the kernel where deleting ECMP routes
in the userland will always return an error even though the command
is successfully executed.


# ca2d42b2 01-Apr-2010 Qing Li <qingli@FreeBSD.org>

MFC 201131

introduce a local variable rte acting as a cache of ro->ro_rt
within ip_output, achieving (in random order of importance):
- a reduction of the number of 'r's in the source code;
- improved legibility;
- a reduction of 64 bytes in the .text


# e952596a 31-Mar-2010 Kip Macy <kmacy@FreeBSD.org>

MFC 205066, 205069, 205093, 205097, 205488:

r205066:

Log:
- restructure flowtable to support ipv6
- add a name argument to flowtable_alloc for printing with ddb commands
- extend ddb commands to print destination address or 4-tuples
- don't parse ports in ulp header if FL_HASH_ALL is not passed
- add kern_flowtable_insert to enable more generic use of flowtable
(e.g. system calls for adding entries)
- don't hash loopback addresses
- cleanup whitespace
- keep statistics per-cpu for per-cpu flowtables to avoid cache line contention
- add sysctls to accumulate stats and report aggregate

r205069:
Log:
fix stats reporting sysctl

r205093:
Log:
re-update copyright to 2010
pointed out by danfe@

r205097:

Log:
flowtable_get_hashkey is only used by a DDB function - move under #ifdef DDB

pointed out by jkim@

r205488:

Log:
- boot-time size the ipv4 flowtable and the maximum number of flows
- increase flow cleaning frequency and decrease flow caching time
when near the flow limit
- stop allocating new flows when within 3% of maxflows don't start
allocating again until below 12.5%


# 1966e5b5 12-Mar-2010 Randall Stewart <rrs@FreeBSD.org>

The proper fix for the delayed SCTP checksum is to
have the delayed function take an argument as to the offset
to the SCTP header. This allows it to work for V4 and V6.
This of course means changing all callers of the function
to either pass the header len, if they have it, or create
it (ip_hl << 2 or sizeof(ip6_hdr)).
PR: 144529
MFC after: 2 weeks


# d4121a02 11-Mar-2010 Kip Macy <kmacy@FreeBSD.org>

- restructure flowtable to support ipv6
- add a name argument to flowtable_alloc for printing with ddb commands
- extend ddb commands to print destination address or 4-tuples
- don't parse ports in ulp header if FL_HASH_ALL is not passed
- add kern_flowtable_insert to enable more generic use of flowtable
(e.g. system calls for adding entries)
- don't hash loopback addresses
- cleanup whitespace
- keep statistics per-cpu for per-cpu flowtables to avoid cache line contention
- add sysctls to accumulate stats and report aggregate

MFC after: 7 days


# c7ea0aa6 08-Mar-2010 Qing Li <qingli@FreeBSD.org>

One of the advantages of enabling ECMP (a.k.a RADIX_MPATH) is to
allow for connection load balancing across interfaces. Currently
the address alias handling method is colliding with the ECMP code.
For example, when two interfaces are configured on the same prefix,
only one prefix route is installed. So connection load balancing
among the available interfaces is not possible.

The other advantage of ECMP is for failover. The issue with the
current code, is that the interface link-state is not reflected
in the route entry. For example, if there are two interfaces on
the same prefix, the cable on one interface is unplugged, new and
existing connections should switch over to the other interface.
This is not done today and packets go into a black hole.

Also, there is a small bug in the kernel where deleting ECMP routes
in the userland will always return an error even though the command
is successfully executed.

MFC after: 5 days


# 2ae7ec29 07-Feb-2010 Julian Elischer <julian@FreeBSD.org>

MFC of 197952 and 198075

Virtualize the pfil hooks so that different jails may chose different
packet filters. ALso allows ipfw to be enabled on on ejail and disabled
on another. In 8.0 it's a global setting.
and
Unbreak the VIMAGE build with IPSEC, broken with r197952 by
virtualizing the pfil hooks.
For consistency add the V_ to virtualize the pfil hooks in here as well.


# fc74d005 28-Dec-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Make the compiler happy after r201125:
- + remove two unnecessary initializations in ip_output;
+ + remove one unnecessary initializations in ip_output;


# ec396e61 28-Dec-2009 Luigi Rizzo <luigi@FreeBSD.org>

introduce a local variable rte acting as a cache of ro->ro_rt
within ip_output, achieving (in random order of importance):
- a reduction of the number of 'r's in the source code;
- improved legibility;
- a reduction of 64 bytes in the .text


# ca8b83b0 28-Dec-2009 Luigi Rizzo <luigi@FreeBSD.org>

+ remove an unused #define print_ip;
+ remove two unnecessary initializations in ip_output;
+ localize 'len';
+ introduce a temporary variable n to count the number of fragments,
the compiler seems unable to identify a common subexpression
(written 3 times, used twice);
+ document some assumptions on ip_len and ip_hl


# 4f7418a0 09-Nov-2009 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove ifdefed out part of code, which seems to have originated a decade ago
in OpenBSD. As it is now, there is no way for this to be useful, since IPsec
is free to forward packets via whatever interface it wants, so checking
capabilities of the interface passed from ip_output (fetched from the routing
table) serves no purpose.

Discussed with: sam@


# 0b4b0b0f 10-Oct-2009 Julian Elischer <julian@FreeBSD.org>

Virtualize the pfil hooks so that different jails may chose different
packet filters. ALso allows ipfw to be enabled on on ejail and disabled
on another. In 8.0 it's a global setting.

Sitting aroung in tree waiting to commit for: 2 months
MFC after: 2 months


# d84f95cd 30-Aug-2009 Qing Li <qingli@FreeBSD.org>

MFC r196608

Do not try to free the rt_lle entry of the cached route in
ip_output() if the cached route was not initialized from the
flow-table. The rt_lle entry is invalid unless it has been
initialized through the flow-table.

Reviewed by: kmacy, rwatson
Approved by: re


# 0437a933 27-Aug-2009 Qing Li <qingli@FreeBSD.org>

Do not try to free the rt_lle entry of the cached route in
ip_output() if the cached route was not initialized from the
flow-table. The rt_lle entry is invalid unless it has been
initialized through the flow-table.

Reviewed by: kmacy, rwatson
MFC after: immediately


# 670151d0 18-Aug-2009 Kip Macy <kmacy@FreeBSD.org>

MFC 196368
- change the interface to flowtable_lookup so that we don't rely on
the mbuf for obtaining the fib index
- check that a cached flow corresponds to the same fib index as the
packet for which we are doing the lookup
- at interface detach time flush any flows referencing stale rtentrys
associated with the interface that is going away (fixes reported
panics)
- reduce the time between cleans in case the cleaner is running at
the time the eventhandler is called and the wakeup is missed less
time will elapse before the eventhandler returns
- separate per-vnet initialization from global initialization
(pointed out by jeli@)

Reviewed by: sam@
Approved by: re@


# 3ee42584 18-Aug-2009 Kip Macy <kmacy@FreeBSD.org>

- change the interface to flowtable_lookup so that we don't rely on
the mbuf for obtaining the fib index
- check that a cached flow corresponds to the same fib index as the
packet for which we are doing the lookup
- at interface detach time flush any flows referencing stale rtentrys
associated with the interface that is going away (fixes reported
panics)
- reduce the time between cleans in case the cleaner is running at
the time the eventhandler is called and the wakeup is missed less
time will elapse before the eventhandler returns
- separate per-vnet initialization from global initialization
(pointed out by jeli@)

Reviewed by: sam@
Approved by: re@


# 6e12c675 14-Aug-2009 Qing Li <qingli@FreeBSD.org>

MFC 196234

In function ip_output(), the cached route is flushed when there is a
mismatch between the cached entry and the intended destination. The
cached rtentry{} is flushed but the associated llentry{} is not. This
causes the wrong destination MAC address being used in the output
packets. The fix is to flush the llentry{} when rtentry{} is cleared.

Reviewed by: kmacy, rwatson
Approved by: re


# 3ef5e21d 14-Aug-2009 Qing Li <qingli@FreeBSD.org>

In function ip_output(), the cached route is flushed when there is a
mismatch between the cached entry and the intended destination. The
cached rtentry{} is flushed but the associated llentry{} is not. This
causes the wrong destination MAC address being used in the output
packets. The fix is to flush the llentry{} when rtentry{} is cleared.

Reviewed by: kmacy, rwatson
Approved by: re


# 530c0060 01-Aug-2009 Robert Watson <rwatson@FreeBSD.org>

Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by: bz
Approved by: re (vimage blanket)


# eddfbb76 14-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)


# 8c0fec80 23-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Modify most routines returning 'struct ifaddr *' to return references
rather than pointers, requiring callers to properly dispose of those
references. The following routines now return references:

ifaddr_byindex
ifa_ifwithaddr
ifa_ifwithbroadaddr
ifa_ifwithdstaddr
ifa_ifwithnet
ifaof_ifpforaddr
ifa_ifwithroute
ifa_ifwithroute_fib
rt_getifa
rt_getifa_fib
IFP_TO_IA
ip_rtaddr
in6_ifawithifp
in6ifa_ifpforlinklocal
in6ifa_ifpwithaddr
in6_ifadd
carp_iamatch6
ip6_getdstifaddr

Remove unused macro which didn't have required referencing:

IFP_TO_IA6

This closes many small races in which changes to interface
or address lists while an ifaddr was in use could lead to use of freed
memory (etc). In a few cases, add missing if_addr_list locking
required to safely acquire references.

Because of a lack of deep copying support, we accept a race in which
an in6_ifaddr pointed to by mbuf tags and extracted with
ip6_getdstifaddr() doesn't hold a reference while in transmit. Once
we have mbuf tag deep copy support, this can be fixed.

Reviewed by: bz
Obtained from: Apple, Inc. (portions)
MFC after: 6 weeks (portions)


# fa057b15 22-Jun-2009 Marko Zec <zec@FreeBSD.org>

V_irtualize flowtable state.

This change should make options VIMAGE kernel builds usable again,
to some extent at least.

Note that the size of struct vnet_inet has changed, though in
accordance with one-bump-per-day policy we didn't update the
__FreeBSD_version number, given that it has already been touched
by r194640 a few hours ago.
Reviewed by: bz
Approved by: julian (mentor)


# 53be8fca 12-Jun-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Move the kernel option FLOWTABLE chacking from the header file to the
actual implementation.
Remove the accessor functions for the compiled out case, just returning
"unavail" values. Remove the kernel conditional from the header file as
it is no longer needed, only leaving the externs.
Hide the improperly virtualized SYSCTL/TUNABLE for the flowtable size
under the kernel option as well.

Reviewed by: rwatson


# 42a36133 05-Jun-2009 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Only four out of nine arguments for ip_ipsec_output() are actually used.
Kill unused arguments except for 'ifp' as it might be used in the future
for detecting IPsec-capable interfaces.


# bcf11e8d 05-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# f44270e7 01-Jun-2009 Pawel Jakub Dawidek <pjd@FreeBSD.org>

- Rename IP_NONLOCALOK IP socket option to IP_BINDANY, to be more consistent
with OpenBSD (and BSD/OS originally). We can't easly do it SOL_SOCKET option
as there is no more space for more SOL_SOCKET options, but this option also
fits better as an IP socket option, it seems.
- Implement this functionality also for IPv6 and RAW IP sockets.
- Always compile it in (don't use additional kernel options).
- Remove sysctl to turn this functionality on and off.
- Introduce new privilege - PRIV_NETINET_BINDANY, which allows to use this
functionality (currently only unjail root can use it).

Discussed with: julian, adrian, jhb, rwatson, kmacy


# 62e1ba83 21-May-2009 Robert Watson <rwatson@FreeBSD.org>

Consolidate and clean up the first section of ip_output.c in light of the
last year or two's work on routing:

- Combine iproute initialization and flowtable lookup blocks, eliminating
unnecessary tests for known-zero'd iproute fields.

- Add a comment indicating (a) why the route entry returned by the
flowtable is considered stable and (b) that the flowtable lookup must
occur after the setup of the mbuf flow ID.

- Assert the inpcb lock before any use of inpcb fields.

Reviewed by: kmacy


# 1a499816 28-Apr-2009 Edward Tomasz Napierala <trasz@FreeBSD.org>

Don't require packet to match a route (any route; this information wasn't
used anyway, so a typical workaround was to add a dummy route) if it's going
to be sent through IPSec tunnel.

Reviewed by: bz


# 65111ec7 18-Apr-2009 Kip Macy <kmacy@FreeBSD.org>

- Allocate a small flowtable in ip_input.c (changeable by tuneable)
- Use for accelerating ip_output


# 279aa3d4 16-Apr-2009 Kip Macy <kmacy@FreeBSD.org>

Change if_output to take a struct route as its fourth argument in order
to allow passing a cached struct llentry * down to L2

Reviewed by: rwatson


# 86425c62 11-Apr-2009 Robert Watson <rwatson@FreeBSD.org>

Update stats in struct ipstat using four new macros, IPSTAT_ADD(),
IPSTAT_INC(), IPSTAT_SUB(), and IPSTAT_DEC(), rather than directly
manipulating the fields across the kernel. This will make it easier
to change the implementation of these statistics, such as using
per-CPU versions of the data structures.

MFC after: 3 days


# 80cb9f21 10-Apr-2009 Kip Macy <kmacy@FreeBSD.org>

Import "flowid" support for serializing flows across transmit queues

Reviewed by: rwatson and jeli


# 8b889dbb 03-Mar-2009 Bruce M Simpson <bms@FreeBSD.org>

In ip_output(), do not acquire the IN_MULTI_LOCK(),
and do not attempt to perform a group lookup.
This is a socket layer lock, and the bottom half of IP
really has no business taking it.

Use the value of the in_mcast_loop sysctl to determine
if we should loop back by default, in the absence of
any multicast socket options. Because the check on
group membership is now deferred to the input path,
an m_copym() is now required.

This should increase multicast send performance where the
source has not requested loopback, although this has not been
benchmarked or measured.

It is also a necessary change for IN_MULTI_LOCK to become
non-recursive, which is required in order to implement IGMPv3
in a thread-safe way.


# 33553d6e 27-Feb-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

For all files including net/vnet.h directly include opt_route.h and
net/route.h.

Remove the hidden include of opt_route.h and net/route.h from net/vnet.h.

We need to make sure that both opt_route.h and net/route.h are included
before net/vnet.h because of the way MRT figures out the number of FIBs
from the kernel option. If we do not, we end up with the default number
of 1 when including net/vnet.h and array sizes are wrong.

This does not change the list of files which depend on opt_route.h
but we can identify them now more easily.


# 97aa4a51 08-Feb-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Try to remove/assimilate as much of formerly IPv4/6 specific
(duplicate) code in sys/netipsec/ipsec.c and fold it into
common, INET/6 independent functions.

The file local functions ipsec4_setspidx_inpcb() and
ipsec6_setspidx_inpcb() were 1:1 identical after the change
in r186528. Rename to ipsec_setspidx_inpcb() and remove the
duplicate.

Public functions ipsec[46]_get_policy() were 1:1 identical.
Remove one copy and merge in the factored out code from
ipsec_get_policy() into the other. The public function left
is now called ipsec_get_policy() and callers were adapted.

Public functions ipsec[46]_set_policy() were 1:1 identical.
Rename file local ipsec_set_policy() function to
ipsec_set_policy_internal().
Remove one copy of the public functions, rename the other
to ipsec_set_policy() and adapt callers.

Public functions ipsec[46]_hdrsiz() were logically identical
(ignoring one questionable assert in the v6 version).
Rename the file local ipsec_hdrsiz() to ipsec_hdrsiz_internal(),
the public function to ipsec_hdrsiz(), remove the duplicate
copy and adapt the callers.
The v6 version had been unused anyway. Cleanup comments.

Public functions ipsec[46]_in_reject() were logically identical
apart from statistics. Move the common code into a file local
ipsec46_in_reject() leaving vimage+statistics in small AF specific
wrapper functions. Note: unfortunately we already have a public
ipsec_in_reject().

Reviewed by: sam
Discussed with: rwatson (renaming to *_internal)
MFC after: 26 days
X-MFC: keep wrapper functions for public symbols?


# 2f4afd21 03-Feb-2009 Randall Stewart <rrs@FreeBSD.org>

Adds support for SCTP checksum offload. This means
we, like TCP and UDP, move the checksum calculation
into the IP routines when there is no hardware support
we call into the normal SCTP checksum routine.

The next round of SCTP updates will use
this functionality. Of course the IGB driver needs
a few updates to support the new intel controller set
that actually does SCTP csum offload too.

Reviewed by: gnn, rwatson, kmacy


# cef27294 09-Jan-2009 Adrian Chadd <adrian@FreeBSD.org>

Fix indentation; add FALLTHROUGH.

Thanks Max!


# be9347e3 09-Jan-2009 Adrian Chadd <adrian@FreeBSD.org>

Implement a new IP option (not compiled/enabled by default) to allow
applications to specify a non-local IP address when bind()'ing a socket
to a local endpoint.

This allows applications to spoof the client IP address of connections
if (obviously!) they somehow are able to receive the traffic normally
destined to said clients.

This patch doesn't include any changes to ipfw or the bridging code to
redirect the client traffic through the PCB checks so TCP gets a shot
at it. The normal behaviour is that packets with a non-local destination
IP address are not handled locally. This can be dealth with some IPFW hackery;
modifications to IPFW to make this less hacky will occur in subsequent
commmits.

Thanks to Julian Elischer and others at Ironport. This work was approved
and donated before Cisco acquired them.

Obtained from: Julian Elischer and others
MFC after: 2 weeks


# a603c811 03-Jan-2009 Robert Watson <rwatson@FreeBSD.org>

Allow the IP_MINTTL socket option to be set to 0 so that it can be
disabled entirely, which is its default state before set to a
non-zero value.

PR: 128790
Submitted by: Nick Hilliard <nick at foobar dot org>
MFC after: 3 weeks


# 6e6b3f7c 14-Dec-2008 Qing Li <qingli@FreeBSD.org>

This main goals of this project are:
1. separating L2 tables (ARP, NDP) from the L3 routing tables
2. removing as much locking dependencies among these layers as
possible to allow for some parallelism in the search operations
3. simplify the logic in the routing code,

The most notable end result is the obsolescent of the route
cloning (RTF_CLONING) concept, which translated into code reduction
in both IPv4 ARP and IPv6 NDP related modules, and size reduction in
struct rtentry{}. The change in design obsoletes the semantics of
RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland
applications such as "arp" and "ndp" have been modified to reflect
those changes. The output from "netstat -r" shows only the routing
entries.

Quite a few developers have contributed to this project in the
past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and
Andre Oppermann. And most recently:

- Kip Macy revised the locking code completely, thus completing
the last piece of the puzzle, Kip has also been conducting
active functional testing
- Sam Leffler has helped me improving/refactoring the code, and
provided valuable reviews
- Julian Elischer setup the perforce tree for me and has helped
me maintaining that branch before the svn conversion


# 385195c0 10-Dec-2008 Marko Zec <zec@FreeBSD.org>

Conditionally compile out V_ globals while instantiating the appropriate
container structures, depending on VIMAGE_GLOBALS compile time option.

Make VIMAGE_GLOBALS a new compile-time option, which by default will not
be defined, resulting in instatiations of global variables selected for
V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be
effectively compiled out. Instantiate new global container structures
to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0,
vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0.

Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_
macros resolve either to the original globals, or to fields inside
container structures, i.e. effectively

#ifdef VIMAGE_GLOBALS
#define V_rt_tables rt_tables
#else
#define V_rt_tables vnet_net_0._rt_tables
#endif

Update SYSCTL_V_*() macros to operate either on globals or on fields
inside container structs.

Extend the internal kldsym() lookups with the ability to resolve
selected fields inside the virtualization container structs. This
applies only to the fields which are explicitly registered for kldsym()
visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently
this is done only in sys/net/if.c.

Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code,
and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in
turn result in proper code being generated depending on VIMAGE_GLOBALS.

De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c
which were prematurely V_irtualized by automated V_ prepending scripts
during earlier merging steps. PF virtualization will be done
separately, most probably after next PF import.

Convert a few variable initializations at instantiation to
initialization in init functions, most notably in ipfw. Also convert
TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in
initializer functions.

Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 4b79449e 02-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation


# 97021c24 26-Nov-2008 Marko Zec <zec@FreeBSD.org>

Merge more of currently non-functional (i.e. resolving to
whitespace) macros from p4/vimage branch.

Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.

De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.

Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# bc97ba51 19-Nov-2008 Julian Elischer <julian@FreeBSD.org>

Fix a scope problem in the multiple routing table code that stopped the
SO_SETFIB socket option from working correctly.

Obtained from: Ironport
MFC after: 3 days


# 44e33a07 19-Nov-2008 Marko Zec <zec@FreeBSD.org>

Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions. As a rule,
initialization at instatiation for such variables should never be
introduced again from now on. Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact. In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 8b615593 02-Oct-2008 Marko Zec <zec@FreeBSD.org>

Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# e4762f75 29-Aug-2008 George V. Neville-Neil <gnn@FreeBSD.org>

Fix a bug whereby multicast packets that are looped back locally
wind up with the incorrect checksum on the wire when transmitted via
devices that do checksum offloading.

PR: kern/119635
Reviewed by: rwatson
MFC after: 5 days


# 5060346d 21-Aug-2008 Robert Watson <rwatson@FreeBSD.org>

Remove comments and #ifdef notyet'd code relating to directly dispatching
the IP multicast input code from the output path; we don't allow
reentrance of the input path from the IP output path, it must use the
netisr due to potential lock recursion.

MFC after: 3 days


# 603724d3 17-Aug-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch


# 8b07e49a 09-May-2008 Julian Elischer <julian@FreeBSD.org>

Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.

Constraints:
------------

I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.

One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".

One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.

This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.

To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.

The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.

The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.

In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.

One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).

You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.

This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.

Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.

Packets fall into one of a number of classes.

1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..

setfib -3 ping target.example.com # will use fib 3 for ping.

It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.

2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)

3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).

4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.

5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.

6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.

Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)

In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.

In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.

Early testing experience:
-------------------------

Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.

For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.

Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.

ipfw has grown 2 new keywords:

setfib N ip from anay to any
count ip from any to any fib N

In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.

SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.

Where to next:
--------------------

After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.

Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.

My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.

When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.

Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.

This work was sponsored by Ironport Systems/Cisco

Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco


# baa45840 19-Apr-2008 Robert Watson <rwatson@FreeBSD.org>

In ip_output(), allow a read lock as well as a write lock when asserting
a lock on the passed inpcb.

MFC after: 3 months


# 8501a69c 17-Apr-2008 Robert Watson <rwatson@FreeBSD.org>

Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.

This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.

MFC after: 3 months
Tested by: kris (superset of committered patch)


# e440aed9 12-Apr-2008 Qing Li <qingli@FreeBSD.org>

This patch provides the back end support for equal-cost multi-path
(ECMP) for both IPv4 and IPv6. Previously, multipath route insertion
is disallowed. For example,

route add -net 192.103.54.0/24 10.9.44.1
route add -net 192.103.54.0/24 10.9.44.2

The second route insertion will trigger an error message of
"add net 192.103.54.0/24: gateway 10.2.5.2: route already in table"

Multiple default routes can also be inserted. Here is the netstat
output:

default 10.2.5.1 UGS 0 3074 bge0 =>
default 10.2.5.2 UGS 0 0 bge0

When multipath routes exist, the "route delete" command requires
a specific gateway to be specified or else an error message would
be displayed. For example,

route delete default

would fail and trigger the following error message:

"route: writing to routing socket: No such process"
"delete net default: not in table"

On the other hand,

route delete default 10.2.5.2

would be successful: "delete net default: gateway 10.2.5.2"

One does not have to specify a gateway if there is only a single
route for a particular destination.

I need to perform more testings on address aliases and multiple
interfaces that have the same IP prefixes. This patch as it
stands today is not yet ready for prime time. Therefore, the ECMP
code fragments are fully guarded by the RADIX_MPATH macro.
Include the "options RADIX_MPATH" in the kernel configuration
to enable this feature.

Reviewed by: robert, sam, gnn, julian, kmacy


# ea26d587 25-Mar-2008 Ruslan Ermilov <ru@FreeBSD.org>

Replaced the misleading uses of a historical artefact M_TRYWAIT with M_WAIT.
Removed dead code that assumed that M_TRYWAIT can return NULL; it's not true
since the advent of MBUMA.

Reviewed by: arch

There are ongoing disputes as to whether we want to switch to directly using
UMA flags M_WAITOK/M_NOWAIT for mbuf(9) allocation.


# c26fe973 02-Feb-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Rather than passing around a cached 'priv', pass in an ucred to
ipsec*_set_policy and do the privilege check only if needed.

Try to assimilate both ip*_ctloutput code blocks calling ipsec*_set_policy.

Reviewed by: rwatson


# 30d239bc 24-Oct-2007 Robert Watson <rwatson@FreeBSD.org>

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# 4b421e2d 07-Oct-2007 Mike Silbersack <silby@FreeBSD.org>

Add FBSDID to all files in netinet so that people can more
easily include file version information in bug reports.

Approved by: re (kensmith)


# b2630c29 02-Jul-2007 George V. Neville-Neil <gnn@FreeBSD.org>

Commit the change from FAST_IPSEC to IPSEC. The FAST_IPSEC
option is now deprecated, as well as the KAME IPsec code.
What was FAST_IPSEC is now IPSEC.

Approved by: re
Sponsored by: Secure Computing


# 2cb64cb2 01-Jul-2007 George V. Neville-Neil <gnn@FreeBSD.org>

Commit IPv6 support for FAST_IPSEC to the tree.
This commit includes only the kernel files, the rest of the files
will follow in a second commit.

Reviewed by: bz
Approved by: re
Supported by: Secure Computing


# 71498f30 12-Jun-2007 Bruce M Simpson <bms@FreeBSD.org>

Import rewrite of IPv4 socket multicast layer to support source-specific
and protocol-independent host mode multicast. The code is written to
accomodate IPv6, IGMPv3 and MLDv2 with only a little additional work.

This change only pertains to FreeBSD's use as a multicast end-station and
does not concern multicast routing; for an IGMPv3/MLDv2 router
implementation, consider the XORP project.

The work is based on Wilbert de Graaf's IGMPv3 code drop for FreeBSD 4.6,
which is available at: http://www.kloosterhof.com/wilbert/igmpv3.html

Summary
* IPv4 multicast socket processing is now moved out of ip_output.c
into a new module, in_mcast.c.
* The in_mcast.c module implements the IPv4 legacy any-source API in
terms of the protocol-independent source-specific API.
* Source filters are lazy allocated as the common case does not use them.
They are part of per inpcb state and are covered by the inpcb lock.
* struct ip_mreqn is now supported to allow applications to specify
multicast joins by interface index in the legacy IPv4 any-source API.
* In UDP, an incoming multicast datagram only requires that the source
port matches the 4-tuple if the socket was already bound by source port.
An unbound socket SHOULD be able to receive multicasts sent from an
ephemeral source port.
* The UDP socket multicast filter mode defaults to exclusive, that is,
sources present in the per-socket list will be blocked from delivery.
* The RFC 3678 userland functions have been added to libc: setsourcefilter,
getsourcefilter, setipv4sourcefilter, getipv4sourcefilter.
* Definitions for IGMPv3 are merged but not yet used.
* struct sockaddr_storage is now referenced from <netinet/in.h>. It
is therefore defined there if not already declared in the same way
as for the C99 types.
* The RFC 1724 hack (specify 0.0.0.0/8 addresses to IP_MULTICAST_IF
which are then interpreted as interface indexes) is now deprecated.
* A patch for the Rhyolite.com routed in the FreeBSD base system
is available in the -net archives. This only affects individuals
running RIPv1 or RIPv2 via point-to-point and/or unnumbered interfaces.
* Make IPv6 detach path similar to IPv4's in code flow; functionally same.
* Bump __FreeBSD_version to 700048; see UPDATING.

This work was financially supported by another FreeBSD committer.

Obtained from: p4://bms_netdev
Submitted by: Wilbert de Graaf (original work)
Reviewed by: rwatson (locking), silence from fenner,
net@ (but with encouragement)


# f2565d68 10-May-2007 Robert Watson <rwatson@FreeBSD.org>

Move universally to ANSI C function declarations, with relatively
consistent style(9)-ish layout.


# 73ec8173 23-Mar-2007 Bruce M Simpson <bms@FreeBSD.org>

Purge two redundant case labels.


# a3fd02d8 01-Mar-2007 Bruce M Simpson <bms@FreeBSD.org>

Fix undirected broadcast sends for the case where SO_DONTROUTE has also
been set at the socket layer, in our somewhat convoluted IPv4 source
selection logic in ip_output().

IP_ONESBCAST is actually a special case of SO_DONTROUTE, as 255.255.255.255
must always be delivered on a local link with a TTL of 1.

If IP_ONESBCAST has been set at the socket layer, also perform destination
interface lookup for point-to-point interfaces based on the destination
address of the link; previously it was not possible to use the option with
such interfaces; also, the destination/broadcast address fields map to the
same field within struct ifnet, which doesn't help matters.

One more valid fix going forward for these issues is to treat 255.255.255.255
as a destination in its own right in the forwarding trie. Other
implementations do this. It fits with the use of multiple paths, though
it then becomes necessary to specify interface preference.
This hack will eventually go away when that comes to pass.

Reviewed by: andre
MFC after: 1 week


# 3dbee59b 10-Dec-2006 Bruce M Simpson <bms@FreeBSD.org>

Back out revision 1.264.

Fixing the IP accounting issue, if we plan to do so, needs to be better
thought out; the 'fix' introduces a hash lookup and a possible kernel panic.

Reported by: Mark Tinguely


# acd3428b 06-Nov-2006 Robert Watson <rwatson@FreeBSD.org>

Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges. These may
require some future tweaking.

Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>


# aed55708 22-Oct-2006 Robert Watson <rwatson@FreeBSD.org>

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# 6a7c943c 29-Sep-2006 Andre Oppermann <andre@FreeBSD.org>

Remove stone-aged and irrelevant "#ifndef notdef".


# 07ea6709 25-Sep-2006 Bruce M Simpson <bms@FreeBSD.org>

Account for output IP datagrams on the ifaddr where they originated from,
*not* the first ifaddr on the ifp. This is similar to what NetBSD does.

PR: kern/72936
Submitted by: alfred
Reviewed by: andre


# 384a05bf 11-Sep-2006 Andre Oppermann <andre@FreeBSD.org>

Fix a NULL pointer dereference of ro->ro_rt->rt_flags by checking for the
validity of ro->ro_rt first. This prevents crashing on any non-normally
routed IP packet.

Coverity CID: 162 (incorrectly, it was re-introduced by previous commit)


# 3ae2ad08 10-Sep-2006 John-Mark Gurney <jmg@FreeBSD.org>

make use of the host route's mtu for processing. This means we can now
support a network w/ split mtu's by assigning each host route the correct
mtu. an aspiring programmer could write a daemon to probe hosts and find
out if they support a larger mtu.


# 233dcce1 06-Sep-2006 Andre Oppermann <andre@FreeBSD.org>

First step of TSO (TCP segmentation offload) support in our network stack.

o add IFCAP_TSO[46] for drivers to announce this capability for IPv4 and IPv6
o add CSUM_TSO flag to mbuf pkthdr csum_flags field
o add tso_segsz field to mbuf pkthdr
o enhance ip_output() packet length check to allow for large TSO packets
o extend tcp_maxmtu[46]() with a flag pointer to pass interface capabilities
o adjust all callers of tcp_maxmtu[46]() accordingly

Discussed on: -current, -net
Sponsored by: TCP/IP Optimization Fundraise 2005


# 773725a2 06-Sep-2006 Andre Oppermann <andre@FreeBSD.org>

Fix the socket option IP_ONESBCAST by giving it its own case in ip_output()
and skip over the normal IP processing.

Add a supporting function ifa_ifwithbroadaddr() to verify and validate the
supplied subnet broadcast address.

PR: kern/99558
Tested by: Andrey V. Elsukov <bu7cher-at-yandex.ru>
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 days


# b7522c27 16-Aug-2006 Julian Elischer <julian@FreeBSD.org>

Remove the IPFIREWALL_FORWARD_EXTENDED option and make it on by default as it always was
in older versions of FreeBSD. This option is pointless as it is needed in just
about every interesting usage of forward that I have ever seen. It doesn't make
the system any safer and just wastes huge amounts of develper time
when the system doesn't behave as expected when code is moved from
4.x to 6.x It doesn't make
the system any safer and just wastes huge amounts of develper time
when the system doesn't behave as expected when code is moved from
4.x to 6.x or 7.x
Reviewed by: glebius
MFC after: 1 week


# 4d09f5a0 29-Jun-2006 Gleb Smirnoff <glebius@FreeBSD.org>

Fix URL to Bellovin's paper.

Submitted by: Anton Yuzhaninov <citrin rambler-co.ru>


# 635354c4 21-May-2006 Maxim Konovalov <maxim@FreeBSD.org>

o Add missed error check: in ip_ctloutput() sooptcopyin() returns a
result but we never examine it.

Reviewed by: rwatson
MFC after: 2 weeks


# 3548bfc9 14-May-2006 Bruce M Simpson <bms@FreeBSD.org>

Fix a long-standing limitation in IPv4 multicast group membership.

By making the imo_membership array a dynamically allocated vector,
this minimizes disruption to existing IPv4 multicast code. This
change breaks the ABI for the kernel module ip_mroute.ko, and may
cause a small amount of churn for folks working on the IGMPv3 merge.

Previously, sockets were subject to a compile-time limitation on
the number of IPv4 group memberships, which was hard-coded to 20.
The imo_membership relationship, however, is 1:1 with regards to
a tuple of multicast group address and interface address. Users who
ran routing protocols such as OSPF ran into this limitation on machines
with a large system interface tree.


# 604afec4 01-Feb-2006 Christian S.J. Peron <csjp@FreeBSD.org>

Somewhat re-factor the read/write locking mechanism associated with the packet
filtering mechanisms to use the new rwlock(9) locking API:

- Drop the variables stored in the phil_head structure which were specific to
conditions and the home rolled read/write locking mechanism.
- Drop some includes which were used for condition variables
- Drop the inline functions, and convert them to macros. Also, move these
macros into pfil.h
- Move pfil list locking macros intp phil.h as well
- Rename ph_busy_count to ph_nhooks. This variable will represent the number
of IN/OUT hooks registered with the pfil head structure
- Define PFIL_HOOKED macro which evaluates to true if there are any
hooks to be ran by pfil_run_hooks
- In the IP/IP6 stacks, change the ph_busy_count comparison to use the new
PFIL_HOOKED macro.
- Drop optimization in pfil_run_hooks which checks to see if there are any
hooks to be ran, and returns if not. This check is already performed by the
IP stacks when they call:

if (!PFIL_HOOKED(ph))
goto skip_hooks;

- Drop in assertion which makes sure that the number of hooks never drops
below 0 for good measure. This in theory should never happen, and if it
does than there are problems somewhere
- Drop special logic around PFIL_WAITOK because rw_wlock(9) does not sleep
- Drop variables which support home rolled read/write locking mechanism from
the IPFW firewall chain structure.
- Swap out the read/write firewall chain lock internal to use the rwlock(9)
API instead of our home rolled version
- Convert the inlined functions to macros

Reviewed by: mlaier, andre, glebius
Thanks to: jhb for the new locking API


# 1dfcf0d2 01-Feb-2006 Andre Oppermann <andre@FreeBSD.org>

Move the IPSEC related code blocks to their own file to unclutter
and signifincantly improve the readability of ip_input() and
ip_output() again.

The resulting IPSEC hooks in ip_input() and ip_output() may be
used later on for making IPSEC loadable.

This move is mostly mechanical and should preserve current IPSEC
behaviour as-is. Nothing shall prevent improvements in the way
IPSEC interacts with the IPv4 stack.

Discussed with: bz, gnn, rwatson; (earlier version)


# 8f8d29f6 18-Jan-2006 Andre Oppermann <andre@FreeBSD.org>

In in_delayed_cksum() we can't perform a m_pullup() as it may
change the mbuf pointer and we don't have any way of passing
it back to the callers. Instead just fail silently without
updating the checksum but leaving the mbuf+chain intact.

A search in our GNATS database did not turn up any match for
the existing warning message when this case is encountered.

Found by: Coverity Prevent(tm)
Coverity ID: CID779
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 days


# 39550088 18-Jan-2006 Andre Oppermann <andre@FreeBSD.org>

Prevent dereferencing a NULL route pointer when trying to update the
route MTU.

This bug is very difficult to reach and not remotely exploitable.

Found by: Coverity Prevent(tm)
Coverity ID: CID162
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 days


# bbce982b 06-Dec-2005 Gleb Smirnoff <glebius@FreeBSD.org>

When we drop packet due to no space in output interface output queue, also
increase the ifp->if_snd.ifq_drops.

PR: 72440
Submitted by: ikob


# ef39adf0 18-Nov-2005 Andre Oppermann <andre@FreeBSD.org>

Consolidate all IP Options handling functions into ip_options.[ch] and
include ip_options.h into all files making use of IP Options functions.

From ip_input.c rev 1.306:
ip_dooptions(struct mbuf *m, int pass)
save_rte(m, option, dst)
ip_srcroute(m0)
ip_stripoptions(m, mopt)

From ip_output.c rev 1.249:
ip_insertoptions(m, opt, phlen)
ip_optcopy(ip, jp)
ip_pcbopts(struct inpcb *inp, int optname, struct mbuf *m)

No functional changes in this commit.

Discussed with: rwatson
Sponsored by: TCP/IP Optimization Fundraise 2005


# 147f74d1 18-Nov-2005 Andre Oppermann <andre@FreeBSD.org>

Purge layer specific mbuf flags on layer crossings to avoid confusing
upper or lower layers.

Sponsored by: TCP/IP Optimization Fundraise 2005


# 34333b16 02-Nov-2005 Andre Oppermann <andre@FreeBSD.org>

Retire MT_HEADER mbuf type and change its users to use MT_DATA.

Having an additional MT_HEADER mbuf type is superfluous and redundant
as nothing depends on it. It only adds a layer of confusion. The
distinction between header mbuf's and data mbuf's is solely done
through the m->m_flags M_PKTHDR flag.

Non-native code is not changed in this commit. For compatibility
MT_HEADER is mapped to MT_DATA.

Sponsored by: TCP/IP Optimization Fundraise 2005


# b2828ad2 26-Sep-2005 Andre Oppermann <andre@FreeBSD.org>

Implement IP_DONTFRAG IP socket option enabling the Don't Fragment
flag on IP packets. Currently this option is only repected on udp
and raw ip sockets. On tcp sockets the DF flag is controlled by the
path MTU discovery option.

Sending a packet larger than the MTU size of the egress interface
returns an EMSGSIZE error.

Discussed with: rwatson
Sponsored by: TCP/IP Optimization Fundraise 2005


# e0aec682 30-Aug-2005 Andre Oppermann <andre@FreeBSD.org>

Use the correct mbuf type for MGET().


# 936cd18d 22-Aug-2005 Andre Oppermann <andre@FreeBSD.org>

Add socketoption IP_MINTTL. May be used to set the minimum acceptable
TTL a packet must have when received on a socket. All packets with a
lower TTL are silently dropped. Works on already connected/connecting
and listening sockets for RAW/UDP/TCP.

This option is only really useful when set to 255 preventing packets
from outside the directly connected networks reaching local listeners
on sockets.

Allows userland implementation of 'The Generalized TTL Security Mechanism
(GTSM)' according to RFC3682. Examples of such use include the Cisco IOS
BGP implementation command "neighbor ttl-security".

MFC after: 2 weeks
Sponsored by: TCP/IP Optimization Fundraise 2005


# a2dc1f50 09-Aug-2005 Robert Watson <rwatson@FreeBSD.org>

Add helper function ip_findmoptions(), which accepts an inpcb, and attempts
to atomically return either an existing set of IP multicast options for the
PCB, or a newlly allocated set with default values. The inpcb is returned
locked. This function may sleep.

Call ip_moptions() to acquire a reference to a PCB's socket options, and
perform the update of the options while holding the PCB lock. Release the
lock before returning.

Remove garbage collection of multicast options when values return to the
default, as this complicates locking substantially. Most applications
allocate a socket either to be multicast, or not, and don't tend to keep
around sockets that have previously been used for multicast, then used for
unicast.

This closes a number of race conditions involving multiple threads or
processes modifying the IP multicast state of a socket simultaenously.

MFC after: 7 days


# dd5a318b 03-Aug-2005 Robert Watson <rwatson@FreeBSD.org>

Introduce in_multi_mtx, which will protect IPv4-layer multicast address
lists, as well as accessor macros. For now, this is a recursive mutex
due code sequences where IPv4 multicast calls into IGMP calls into
ip_output(), which then tests for a multicast forwarding case.

For support macros in in_var.h to check multicast address lists, assert
that in_multi_mtx is held.

Acquire in_multi_mtx around iteration over the IPv4 multicast address
lists, such as in ip_input() and ip_output().

Acquire in_multi_mtx when manipulating the IPv4 layer multicast addresses,
as well as over the manipulation of ifnet multicast address lists in order
to keep the two layers in sync.

Lock down accesses to IPv4 multicast addresses in IGMP, or assert the
lock when performing IGMP join/leave events.

Eliminate spl's associated with IPv4 multicast addresses, portions of
IGMP that weren't previously expunged by IGMP locking.

Add in_multi_mtx, igmp_mtx, and if_addr_mtx lock order to hard-coded
lock order in WITNESS, in that order.

Problem reported by: Ed Maste <emaste at phaedrus dot sandvine dot ca>
MFC after: 10 days


# 3c308b09 05-Jul-2005 Robert Watson <rwatson@FreeBSD.org>

Eliminate MAC entry point mac_create_mbuf_from_mbuf(), which is
redundant with respect to existing mbuf copy label routines. Expose
a new mac_copy_mbuf() routine at the top end of the Framework and
use that; use the existing mpo_copy_mbuf_label() routine on the
bottom end.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA, SPAWAR
Approved by: re (scottl)


# fc74a9f9 10-Jun-2005 Brooks Davis <brooks@FreeBSD.org>

Stop embedding struct ifnet at the top of driver softcs. Instead the
struct ifnet or the layer 2 common structure it was embedded in have
been replaced with a struct ifnet pointer to be filled by a call to the
new function, if_alloc(). The layer 2 common structure is also allocated
via if_alloc() based on the interface type. It is hung off the new
struct ifnet member, if_l2com.

This change removes the size of these structures from the kernel ABI and
will allow us to better manage them as interfaces come and go.

Other changes of note:
- Struct arpcom is no longer referenced in normal interface code.
Instead the Ethernet address is accessed via the IFP2ENADDR() macro.
To enforce this ac_enaddr has been renamed to _ac_enaddr.
- The second argument to ether_ifattach is now always the mac address
from driver private storage rather than sometimes being ac_enaddr.

Reviewed by: sobomax, sam


# 099dd043 22-Feb-2005 Andre Oppermann <andre@FreeBSD.org>

Bring back the full packet destination manipulation for 'ipfw fwd'
with the kernel compile time option:

options IPFIREWALL_FORWARD_EXTENDED

This option has to be specified in addition to IPFIRWALL_FORWARD.

With this option even packets targeted for an IP address local
to the host can be redirected. All restrictions to ensure proper
behaviour for locally generated packets are turned off. Firewall
rules have to be carefully crafted to make sure that things like
PMTU discovery do not break.

Document the two kernel options.

PR: kern/71910
PR: kern/73129
MFC after: 1 week


# 7258e968 23-Jan-2005 Alan Cox <alc@FreeBSD.org>

Correctly move the packet header in ip_insertoptions().

Reported by: Anupam Chanda
Reviewed by: sam@
MFC after: 2 weeks


# c398230b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# 74d4630b 25-Dec-2004 Robert Watson <rwatson@FreeBSD.org>

Remove an errant blank line apparently introduced in
ip_output.c:1.194.


# 89924e58 05-Dec-2004 Robert Watson <rwatson@FreeBSD.org>

Pass the inpcb reference into ip_getmoptions() rather than just the
inp->inp_moptions pointer, so that ip_getmoptions() can perform
necessary locking when doing non-atomic reads.

Lock the inpcb by default to copy any data to local variables, then
unlock before performing sooptcopyout().

MFC after: 2 weeks


# 5c918b56 05-Dec-2004 Robert Watson <rwatson@FreeBSD.org>

Push the inpcb argument into ip_setmoptions() when setting IP multicast
socket options, so that it is available for locking.


# 993d9505 05-Dec-2004 Robert Watson <rwatson@FreeBSD.org>

Start working through inpcb locking for ip_ctloutput() by cleaning up
modifications to the inpcb IP options mbuf:

- Lock the inpcb before passing it into ip_pcbopts() in order to prevent
simulatenous reads and read-modify-writes that could result in races.
- Pass the inpcb reference into ip_pcbopts() instead of the option chain
pointer in the inpcb.
- Assert the inpcb lock in ip_pcbots.
- Convert one or two uses of a pointer as a boolean or an integer
comparison to a comparison with NULL for readability.


# d6a8d588 28-Sep-2004 Max Laier <mlaier@FreeBSD.org>

Add an additional struct inpcb * argument to pfil(9) in order to enable
passing along socket information. This is required to work around a LOR with
the socket code which results in an easy reproducible hard lockup with
debug.mpsafenet=1. This commit does *not* fix the LOR, but enables us to do
so later. The missing piece is to turn the filter locking into a leaf lock
and will follow in a seperate (later) commit.

This will hopefully be MT5'ed in order to fix the problem for RELENG_5 in
forseeable future.

Suggested by: rwatson
A lot of work by: csjp (he'd be even more helpful w/o mentor-reviews ;)
Reviewed by: rwatson, csjp
Tested by: -pf, -ipfw, LINT, csjp and myself
MFC after: 3 days

LOR IDs: 14 - 17 (not fixed yet)


# f4fca2d8 13-Sep-2004 Andre Oppermann <andre@FreeBSD.org>

Make comments more clear for the packet changed cases after pfil hooks.


# cb459254 06-Sep-2004 John-Mark Gurney <jmg@FreeBSD.org>

revert comment from rev1.158 now that rev1.225 backed it out..

MFC after: 3 days


# a9c92b54 27-Aug-2004 Andre Oppermann <andre@FreeBSD.org>

In the case the destination of a packet was changed by the packet filter
to point to a local IP address; and the packet was sourced from this host
we fill in the m_pkthdr.rcvif with a pointer to the loopback interface.

Before the function ifunit("lo0") was used to obtain the ifp. However
this is sub-optimal from a performance point of view and might be dangerous
if the loopback interface has been renamed. Use the global variable 'loif'
instead which always points to the loopback interface.

Submitted by: brooks


# c21fd232 27-Aug-2004 Andre Oppermann <andre@FreeBSD.org>

Always compile PFIL_HOOKS into the kernel and remove the associated kernel
compile option. All FreeBSD packet filters now use the PFIL_HOOKS API and
thus it becomes a standard part of the network stack.

If no hooks are connected the entire packet filter hooks section and related
activities are jumped over. This removes any performance impact if no hooks
are active.

Both OpenBSD and DragonFlyBSD have integrated PFIL_HOOKS permanently as well.


# ca7a789a 22-Aug-2004 Max Laier <mlaier@FreeBSD.org>

Allow early drop for non-ALTQ enabled queues in an ALTQ-enabled kernel.
Previously the early drop was disabled unconditionally for ALTQ-enabled
kernels.

This should give some benefit for the normal gateway + LAN-server case with
a busy LAN leg and an ALTQ managed uplink.

Reviewed and style help from: cperciva, pjd


# 1e5cc10d 17-Aug-2004 Peter Wemm <peter@FreeBSD.org>

Make the kernel compile again if you are not using PFIL_HOOKS


# 9b932e9e 17-Aug-2004 Andre Oppermann <andre@FreeBSD.org>

Convert ipfw to use PFIL_HOOKS. This is change is transparent to userland
and preserves the ipfw ABI. The ipfw core packet inspection and filtering
functions have not been changed, only how ipfw is invoked is different.

However there are many changes how ipfw is and its add-on's are handled:

In general ipfw is now called through the PFIL_HOOKS and most associated
magic, that was in ip_input() or ip_output() previously, is now done in
ipfw_check_[in|out]() in the ipfw PFIL handler.

IPDIVERT is entirely handled within the ipfw PFIL handlers. A packet to
be diverted is checked if it is fragmented, if yes, ip_reass() gets in for
reassembly. If not, or all fragments arrived and the packet is complete,
divert_packet is called directly. For 'tee' no reassembly attempt is made
and a copy of the packet is sent to the divert socket unmodified. The
original packet continues its way through ip_input/output().

ipfw 'forward' is done via m_tag's. The ipfw PFIL handlers tag the packet
with the new destination sockaddr_in. A check if the new destination is a
local IP address is made and the m_flags are set appropriately. ip_input()
and ip_output() have some more work to do here. For ip_input() the m_flags
are checked and a packet for us is directly sent to the 'ours' section for
further processing. Destination changes on the input path are only tagged
and the 'srcrt' flag to ip_forward() is set to disable destination checks
and ICMP replies at this stage. The tag is going to be handled on output.
ip_output() again checks for m_flags and the 'ours' tag. If found, the
packet will be dropped back to the IP netisr where it is going to be picked
up by ip_input() again and the directly sent to the 'ours' section. When
only the destination changes, the route's 'dst' is overwritten with the
new destination from the forward m_tag. Then it jumps back at the route
lookup again and skips the firewall check because it has been marked with
M_SKIP_FIREWALL. ipfw 'forward' has to be compiled into the kernel with
'option IPFIREWALL_FORWARD' to enable it.

DUMMYNET is entirely handled within the ipfw PFIL handlers. A packet for
a dummynet pipe or queue is directly sent to dummynet_io(). Dummynet will
then inject it back into ip_input/ip_output() after it has served its time.
Dummynet packets are tagged and will continue from the next rule when they
hit the ipfw PFIL handlers again after re-injection.

BRIDGING and IPFW_ETHER are not changed yet and use ipfw_chk() directly as
they did before. Later this will be changed to dedicated ETHER PFIL_HOOKS.

More detailed changes to the code:

conf/files
Add netinet/ip_fw_pfil.c.

conf/options
Add IPFIREWALL_FORWARD option.

modules/ipfw/Makefile
Add ip_fw_pfil.c.

net/bridge.c
Disable PFIL_HOOKS if ipfw for bridging is active. Bridging ipfw
is still directly invoked to handle layer2 headers and packets would
get a double ipfw when run through PFIL_HOOKS as well.

netinet/ip_divert.c
Removed divert_clone() function. It is no longer used.

netinet/ip_dummynet.[ch]
Neither the route 'ro' nor the destination 'dst' need to be stored
while in dummynet transit. Structure members and associated macros
are removed.

netinet/ip_fastfwd.c
Removed all direct ipfw handling code and replace it with the new
'ipfw forward' handling code.

netinet/ip_fw.h
Removed 'ro' and 'dst' from struct ip_fw_args.

netinet/ip_fw2.c
(Re)moved some global variables and the module handling.

netinet/ip_fw_pfil.c
New file containing the ipfw PFIL handlers and module initialization.

netinet/ip_input.c
Removed all direct ipfw handling code and replace it with the new
'ipfw forward' handling code. ip_forward() does not longer require
the 'next_hop' struct sockaddr_in argument. Disable early checks
if 'srcrt' is set.

netinet/ip_output.c
Removed all direct ipfw handling code and replace it with the new
'ipfw forward' handling code.

netinet/ip_var.h
Add ip_reass() as general function. (Used from ipfw PFIL handlers
for IPDIVERT.)

netinet/raw_ip.c
Directly check if ipfw and dummynet control pointers are active.

netinet/tcp_input.c
Rework the 'ipfw forward' to local code to work with the new way of
forward tags.

netinet/tcp_sack.c
Remove include 'opt_ipfw.h' which is not needed here.

sys/mbuf.h
Remove m_claim_next() macro which was exclusively for ipfw 'forward'
and is no longer needed.

Approved by: re (scottl)


# 1f44b0a1 14-Aug-2004 David Malone <dwmalone@FreeBSD.org>

Get rid of the RANDOM_IP_ID option and make it a sysctl. NetBSD
have already done this, so I have styled the patch on their work:

1) introduce a ip_newid() static inline function that checks
the sysctl and then decides if it should return a sequential
or random IP ID.

2) named the sysctl net.inet.ip.random_id

3) IPv6 flow IDs and fragment IDs are now always random.
Flow IDs and frag IDs are significantly less common in the
IPv6 world (ie. rarely generated per-packet), so there should
be smaller performance concerns.

The sysctl defaults to 0 (sequential IP IDs).

Reviewed by: andre, silby, mlaier, ume
Based on: NetBSD
MFC after: 2 months


# 0b17fba7 11-Aug-2004 Andre Oppermann <andre@FreeBSD.org>

Consistently use NULL for pointer comparisons.


# a5053398 09-Aug-2004 Andre Oppermann <andre@FreeBSD.org>

Make a comment that "ipfw forward" is not SMP and PREEMPTION safe.


# 81007fd4 03-Aug-2004 Andre Oppermann <andre@FreeBSD.org>

o Delayed checksums are now calculated in divert_packet() for diverted packets
Remove the XXX-escaped code that did it in ip_output()'s IPHACK section.


# a138d217 23-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

In ip_ctloutput(), acquire the inpcb lock around some of the basic
inpcb flag and status updates.


# 02b199f1 13-Jun-2004 Max Laier <mlaier@FreeBSD.org>

Link ALTQ to the build and break with ABI for struct ifnet. Please recompile
your (network) modules as well as any userland that might make sense of
sizeof(struct ifnet).
This does not change the queueing yet. These changes will follow in a
seperate commit. Same with the driver changes, which need case by case
evaluation.

__FreeBSD_version bump will follow.

Tested-by: (i386)LINT


# a49b2137 11-May-2004 Maxim Konovalov <maxim@FreeBSD.org>

o Calculate a number of bytes to copy (cnt) correctly:

+----+-+-+-+-+----+----+- - - - - - - - - - - - -+----+
| | |C| | | | | | |
| IP |N|O|L|P| | IP | | IP |
| #1 |O|D|E|T| | #2 | | #n |
| |P|E|N|R| | | | |
+----+-+-+-+-+----+----+- - - - - - - - - - - - -+----+
^ ^<---- cnt - (IPOPT_MINOFF - 1) ---->|
| |
src | +-- cp[IPOPT_OFF + 1] + sizeof(struct in_addr)
|
dst +-- cp[IPOPT_OFF + 1]

PR: kern/66386
Submitted by: Andrei Iltchenko
MFC after: 3 weeks


# 2f3f1e67 02-May-2004 Darren Reed <darrenr@FreeBSD.org>

Rename m_claim_next_hop() to m_claim_next(), as suggested by Max Laier.


# ab884d99 02-May-2004 Darren Reed <darrenr@FreeBSD.org>

Rename ip_claim_next_hop() to m_claim_next_hop(), give it an extra arg
(the type of tag to claim) and push it out of ip_var.h into mbuf.h alongside
all of the other macros that work ok mbuf's and tag's.


# e6e51f05 13-Apr-2004 Luigi Rizzo <luigi@FreeBSD.org>

In an effort to simplify the routing code, try to deprecate rtalloc()
in favour of rtalloc_ign(), which is what would end up being called
anyways.

There are 25 more instances of rtalloc() in net*/ and
about 10 instances of rtalloc_ign()


# f36cfd49 07-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson


# 390cdc6a 07-Apr-2004 Ruslan Ermilov <ru@FreeBSD.org>

Fixed a bug in previous revision: compute the payload checksum before
we convert ip_len into a network byte order; in_delayed_cksum() still
expects it in host byte order.

The symtom was the ``in_cksum_skip: out of data by %d'' complaints
from the kernel.

To add to the previous commit log. These fixes make tcpdump(1) happy
by not complaining about UDP/TCP checksum being bad for looped back
IP multicast when multicast router is deactivated.

Reported by: Vsevolod Lobko


# 26f16ebe 25-Mar-2004 Ruslan Ermilov <ru@FreeBSD.org>

Untangle IP multicast routing interaction with delayed payload checksums.

Compute the payload checksum for a locally originated IP multicast where
God intended, in ip_mloopback(), rather than doing it in ip_output() and
only when multicast router is active. This is more correct as we do not
fool ip_input() that the packet has the correct payload checksum when in
fact it does not (when multicast router is inactive). This is also more
efficient if we don't join the multicast group we send to, thus allowing
the hardware to checksum the payload.


# 4672d819 02-Mar-2004 Max Laier <mlaier@FreeBSD.org>

Two minor follow-ups on the MT_TAG removal:
ifp is now passed explicitly to ether_demux; no need to look it up again.
Make mtag a global var in ip_input.

Noticed by: rwatson
Approved by: bms(mentor)


# ac9d7e26 25-Feb-2004 Max Laier <mlaier@FreeBSD.org>

Re-remove MT_TAGs. The problems with dummynet have been fixed now.

Tested by: -current, bms(mentor), me
Approved by: bms(mentor), sam


# 36e8826f 17-Feb-2004 Max Laier <mlaier@FreeBSD.org>

Backout MT_TAG removal (i.e. bring back MT_TAGs) for now, as dummynet is
not working properly with the patch in place.

Approved by: bms(mentor)


# 70dbc6cb 16-Feb-2004 Hajimu UMEMOTO <ume@FreeBSD.org>

don't update outgoing ifp, if ipsec tunnel mode encapsulation
was not made.

Obtained from: KAME


# 1094bdca 13-Feb-2004 Max Laier <mlaier@FreeBSD.org>

This set of changes eliminates the use of MT_TAG "pseudo mbufs", replacing
them mostly with packet tags (one case is handled by using an mbuf flag
since the linkage between "caller" and "callee" is direct and there's no
need to incur the overhead of a packet tag).

This is (mostly) work from: sam

Silence from: -arch
Approved by: bms(mentor), sam, rwatson


# 1cfd4b53 10-Feb-2004 Bruce M Simpson <bms@FreeBSD.org>

Initial import of RFC 2385 (TCP-MD5) digest support.

This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.

For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.

Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.

There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.

Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.

This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.

Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.

Sponsored by: sentex.net


# f073c60f 03-Feb-2004 Hajimu UMEMOTO <ume@FreeBSD.org>

pass pcb rather than so. it is expected that per socket policy
works again.


# e0f630ea 08-Jan-2004 Andre Oppermann <andre@FreeBSD.org>

Do not set the ip_id to zero when DF is set on packet and
restore the general pre-randomid behaviour.

Setting the ip_id to zero causes several problems with
packet reassembly when a device along the path removes
the DF bit for some reason.

Other BSD and Linux have found and fixed the same issues.

PR: kern/60889
Tested by: Richard Wendland <richard@wendland.org.uk>
Approved by: re (scottl)


# 97d8d152 20-Nov-2003 Andre Oppermann <andre@FreeBSD.org>

Introduce tcp_hostcache and remove the tcp specific metrics from
the routing table. Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.

It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination. Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.

tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.

It removes significant locking requirements from the tcp stack with
regard to the routing table.

Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)


# 26d02ca7 20-Nov-2003 Andre Oppermann <andre@FreeBSD.org>

Remove RTF_PRCLONING from routing table and adjust users of it
accordingly. The define is left intact for ABI compatibility
with userland.

This is a pre-step for the introduction of tcp_hostcache. The
network stack remains fully useable with this change.

Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)


# 02c1c707 14-Nov-2003 Andre Oppermann <andre@FreeBSD.org>

Remove the global one-level rtcache variable and associated
complex locking and rework ip_rtaddr() to do its own rtlookup.
Adopt all its callers to this and make ip_output() callable
with NULL rt pointer.

Reviewed by: sam (mentor)


# 9188b4a1 14-Nov-2003 Andre Oppermann <andre@FreeBSD.org>

Introduce ip_fastforward and remove ip_flow.

Short description of ip_fastforward:

o adds full direct process-to-completion IPv4 forwarding code
o handles ip fragmentation incl. hw support (ip_flow did not)
o sends icmp needfrag to source if DF is set (ip_flow did not)
o supports ipfw and ipfilter (ip_flow did not)
o supports divert, ipfw fwd and ipfilter nat (ip_flow did not)
o returns anything it can't handle back to normal ip_input

Enable with sysctl -w net.inet.ip.fastforwarding=1

Reviewed by: sam (mentor)


# 2683ceb6 12-Nov-2003 Andre Oppermann <andre@FreeBSD.org>

Do not fragment a packet with hardware assistance if it has the DF
bit set.

Reviewed by: sam (mentor)


# 84843845 08-Nov-2003 Sam Leffler <sam@FreeBSD.org>

assert optional inpcb is passed in locked

Supported by: FreeBSD Foundation


# 0f9ade71 04-Nov-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

- cleanup SP refcnt issue.
- share policy-on-socket for listening socket.
- don't copy policy-on-socket at all. secpolicy no longer contain
spidx, which saves a lot of memory.
- deep-copy pcb policy if it is an ipsec policy. assign ID field to
all SPD entries. make it possible for racoon to grab SPD entry on
pcb.
- fixed the order of searching SA table for packets.
- fixed to get a security association header. a mode is always needed
to compare them.
- fixed that the incorrect time was set to
sadb_comb_{hard|soft}_usetime.
- disallow port spec for tunnel mode policy (as we don't reassemble).
- an user can define a policy-id.
- clear enc/auth key before freeing.
- fixed that the kernel crashed when key_spdacquire() was called
because key_spdacquire() had been implemented imcopletely.
- preparation for 64bit sequence number.
- maintain ordered list of SA, based on SA id.
- cleanup secasvar management; refcnt is key.c responsibility;
alloc/free is keydb.c responsibility.
- cleanup, avoid double-loop.
- use hash for spi-based lookup.
- mark persistent SP "persistent".
XXX in theory refcnt should do the right thing, however, we have
"spdflush" which would touch all SPs. another solution would be to
de-register persistent SPs from sptree.
- u_short -> u_int16_t
- reduce kernel stack usage by auto variable secasindex.
- clarify function name confusion. ipsec_*_policy ->
ipsec_*_pcbpolicy.
- avoid variable name confusion.
(struct inpcbpolicy *)pcb_sp, spp (struct secpolicy **), sp (struct
secpolicy *)
- count number of ipsec encapsulations on ipsec4_output, so that we
can tell ip_output() how to handle the packet further.
- When the value of the ul_proto is ICMP or ICMPV6, the port field in
"src" of the spidx specifies ICMP type, and the port field in "dst"
of the spidx specifies ICMP code.
- avoid from applying IPsec transport mode to the packets when the
kernel forwards the packets.

Tested by: nork
Obtained from: KAME


# 3de758d3 03-Nov-2003 Robert Watson <rwatson@FreeBSD.org>

Note that when ip_output() is called from ip_forward(), it will already
have its options inserted, so the opt argument to ip_output() must be
NULL.


# d1dd20be 03-Oct-2003 Sam Leffler <sam@FreeBSD.org>

Locking for updates to routing table entries. Each rtentry gets a mutex
that covers updates to the contents. Note this is separate from holding
a reference and/or locking the routing table itself.

Other/related changes:

o rtredirect loses the final parameter by which an rtentry reference
may be returned; this was never used and added unwarranted complexity
for locking.
o minor style cleanups to routing code (e.g. ansi-fy function decls)
o remove the logic to bump the refcnt on the parent of cloned routes,
we assume the parent will remain as long as the clone; doing this avoids
a circularity in locking during delete
o convert some timeouts to MPSAFE callouts

Notes:

1. rt_mtx in struct rtentry is guarded by #ifdef _KERNEL as user-level
applications cannot/do-no know about mutex's. Doing this requires
that the mutex be the last element in the structure. A better solution
is to introduce an externalized version of struct rtentry but this is
a major task because of the intertwining of rtentry and other data
structures that are visible to user applications.
2. There are known LOR's that are expected to go away with forthcoming
work to eliminate many held references. If not these will be resolved
prior to release.
3. ATM changes are untested.

Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS (partly)


# 134ea224 23-Sep-2003 Sam Leffler <sam@FreeBSD.org>

o update PFIL_HOOKS support to current API used by netbsd
o revamp IPv4+IPv6+bridge usage to match API changes
o remove pfil_head instances from protosw entries (no longer used)
o add locking
o bump FreeBSD version for 3rd party modules

Heavy lifting by: "Max Laier" <max@love2party.net>
Supported by: FreeBSD Foundation
Obtained from: NetBSD (bits of pfil.h and pfil.c)


# 3390d476 31-Aug-2003 Mike Silbersack <silby@FreeBSD.org>

Implement MBUF_STRESS_TEST mark II.

Changes from the original implementation:

- Fragmentation is handled by the function m_fragment, which can
be called from whereever fragmentation is needed. Note that this
function is wrapped in #ifdef MBUF_STRESS_TEST to discourage non-testing
use.

- m_fragment works slightly differently from the old fragmentation
code in that it allocates a seperate mbuf cluster for each fragment.
This defeats dma_map_load_mbuf/buffer's feature of coalescing adjacent
fragments. While that is a nice feature in practice, it nerfed the
usefulness of mbuf_stress_test.

- Add two modes of random fragmentation. Chains with fragments all of
the same random length and chains with fragments that are each uniquely
random in length may now be requested.


# 8afa2304 20-Aug-2003 Bruce M Simpson <bms@FreeBSD.org>

Add the IP_ONESBCAST option, to enable undirected IP broadcasts to be sent on
specific interfaces. This is required by aodvd, and may in future help us
in getting rid of the requirement for BPF from our import of isc-dhcp.

Suggested by: fenestro
Obtained from: BSD/OS
Reviewed by: mini, sam
Approved by: jake (mentor)


# 1e78ac21 07-Aug-2003 Jeffrey Hsu <hsu@FreeBSD.org>

1. Basic PIM kernel support
Disabled by default. To enable it, the new "options PIM" must be
added to the kernel configuration file (in addition to MROUTING):

options MROUTING # Multicast routing
options PIM # Protocol Independent Multicast

2. Add support for advanced multicast API setup/configuration and
extensibility.

3. Add support for kernel-level PIM Register encapsulation.
Disabled by default. Can be enabled by the advanced multicast API.

4. Implement a mechanism for "multicast bandwidth monitoring and upcalls".

Submitted by: Pavlin Radoslavov <pavlin@icir.org>


# 7dc7f031 18-Jul-2003 Mike Silbersack <silby@FreeBSD.org>

Minor fix to the MBUF_STRESS_TEST code so that it keeps
pkthdr.len consistant at all times. (Some debugging
code I'm working on is tripped otherwise.)

MFC after: 3 days


# 6e49b1fe 31-May-2003 Garrett Wollman <wollman@FreeBSD.org>

Don't generate an ip_id for packets with the DF bit set; ip_id is
only meaningful for fragments. Also don't bother to byte-swap the
ip_id when we do generate it; it is only used at the receiver as a
nonce. I tried several different permutations of this code with no
measurable difference to each other or to the unmodified version, so
I've settled on the one for which gcc seems to generate the best code.
(If anyone cares to microoptimize this differently for an architecture
where it actually matters, feel free.)

Suggested by: Steve Bellovin's paper in IMW'02


# 4957466b 29-Apr-2003 Matthew N. Dodd <mdodd@FreeBSD.org>

IP_RECVTTL socket option.

Reviewed by: Stuart Cheshire <cheshire@apple.com>


# 53dcc544 12-Apr-2003 Mike Silbersack <silby@FreeBSD.org>

Rename MBUF_FRAG_TEST to MBUF_STRESS_TEST as it will be extended
to include more than just frag tests.


# fe584538 08-Apr-2003 Dag-Erling Smørgrav <des@FreeBSD.org>

Introduce an M_ASSERTPKTHDR() macro which performs the very common task
of asserting that an mbuf has a packet header. Use it instead of hand-
rolled versions wherever applicable.

Submitted by: Hiten Pandya <hiten@unixdaemons.com>


# 212059bd 03-Apr-2003 Dag-Erling Smørgrav <des@FreeBSD.org>

Replace memcpy() and ovbcopy() with bcopy(); ditch some caddr_t usage.


# 2c56e246 02-Apr-2003 Matthew N. Dodd <mdodd@FreeBSD.org>

Back out support for RFC3514.

RFC3514 poses an unacceptale risk to compliant systems.


# 4f6425f7 02-Apr-2003 Matthew N. Dodd <mdodd@FreeBSD.org>

- Use the correct constant define.
- Add a missing break.


# 8faf6df9 02-Apr-2003 Matthew N. Dodd <mdodd@FreeBSD.org>

Sync constant define with NetBSD.

Requested by: Tom Spindler <dogcow@babymeat.com>


# 09139a45 01-Apr-2003 Matthew N. Dodd <mdodd@FreeBSD.org>

Implement support for RFC 3514 (The Security Flag in the IPv4 Header).
(See: ftp://ftp.rfc-editor.org/in-notes/rfc3514.txt)

This fulfills the host requirements for userland support by
way of the setsockopt() IP_EVIL_INTENT message.

There are three sysctl tunables provided to govern system behavior.

net.inet.ip.rfc3514:

Enables support for rfc3514. As this is an
Informational RFC and support is not yet widespread
this option is disabled by default.

net.inet.ip.hear_no_evil

If set the host will discard all received evil packets.

net.inet.ip.speak_no_evil

If set the host will discard all transmitted evil packets.

The IP statistics counter 'ips_evil' (available via 'netstat') provides
information on the number of 'evil' packets recieved.

For reference, the '-E' option to 'ping' has been provided to demonstrate
and test the implementation.


# 511e01e2 25-Mar-2003 Maxime Henrion <mux@FreeBSD.org>

Try to make the MBUF_FRAG_TEST code work better.

- Don't try to fragment the packet if it's smaller than mbuf_frag_size.
- Preserve the size of the mbuf chain which is modified by m_split().
- Check that m_split() didn't return NULL.
- Make it so we don't end up with two M_PKTHDR mbuf in the chain.
- Use m->m_pkthdr.len instead of m->m_len so that we fragment the whole
chain and not just the first mbuf.
- Fix a nearby style bug and rework the logic of the loops so that it's
more clear.

This is still not quite right, because we're clearly abusing m_split() to
do something it was not designed for, but at least it works now. We
should probably move this code into a m_fragment() function when it's
correct.


# 9d9edc56 24-Mar-2003 Mike Silbersack <silby@FreeBSD.org>

Add the MBUF_FRAG_TEST option. When compiled in, this option
allows you to tell ip_output to fragment all outgoing packets
into mbuf fragments of size net.inet.ip.mbuf_frag_size bytes.
This is an excellent way to test if network drivers can properly
handle long mbuf chains being passed to them.

net.inet.ip.mbuf_frag_size defaults to 0 (no fragmentation)
so that you can at least boot before your network driver dies. :)


# 8608c4c1 20-Feb-2003 Jonathan Lemon <jlemon@FreeBSD.org>

Remove unused variables in the IPSEC case.

Submitted by: Lars Eggert <larse@ISI.EDU>


# 340c35de 19-Feb-2003 Jonathan Lemon <jlemon@FreeBSD.org>

Add a TCP TIMEWAIT state which uses less space than a fullblown TCP
control block. Allow the socket and tcpcb structures to be freed
earlier than inpcb. Update code to understand an inp w/o a socket.

Reviewed by: hsu, silby, jayanth
Sponsored by: DARPA, NAI Labs


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 9359ad86 29-Jan-2003 Sam Leffler <sam@FreeBSD.org>

FAST_IPSEC bandaid: act like KAME and ignore ENOENT error codes from
ipsec4_process_packet; they happen when a packet is dropped because
an SA acquire is initiated

Submitted by: Doug Ambrisko <ambrisko@verniernetworks.com>


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 9d5abbdd 01-Jan-2003 Jens Schweikhardt <schweikh@FreeBSD.org>

Correct typos, mostly s/ a / an / where appropriate. Some whitespace cleanup,
especially in troff files.


# b375c9ec 20-Nov-2002 Luigi Rizzo <luigi@FreeBSD.org>

Back out the ip_fragment() code -- it is not urgent to have it in now,
I will put it back in in a better form after 5.0 is out.

Requested by: sam, rwatson, luigi (on second thought)
Approved by: re


# 3e372e14 17-Nov-2002 Luigi Rizzo <luigi@FreeBSD.org>

Move the ip_fragment code from ip_output() to a separate function,
so that it can be reused elsewhere (there is a number of places
where it can be useful). This also trims some 200 lines from
the body of ip_output(), which helps readability a bit.

(This change was discussed a few weeks ago on the mailing lists,
Julian agreed, silence from others. It is not a functional change,
so i expect it to be ok to commit it now but i am happy to back it
out if there are objections).

While at it, fix some function headers and replace m_copy() with
m_copypacket() where applicable.

MFC after: 1 week


# bbb4330b 15-Nov-2002 Luigi Rizzo <luigi@FreeBSD.org>

Massive cleanup of the ip_mroute code.

No functional changes, but:

+ the mrouting module now should behave the same as the compiled-in
version (it did not before, some of the rsvp code was not loaded
properly);
+ netinet/ip_mroute.c is now truly optional;
+ removed some redundant/unused code;
+ changed many instances of '0' to NULL and INADDR_ANY as appropriate;
+ removed several static variables to make the code more SMP-friendly;
+ fixed some minor bugs in the mrouting code (mostly, incorrect return
values from functions).

This commit is also a prerequisite to the addition of support for PIM,
which i would like to put in before DP2 (it does not change any of
the existing APIs, anyways).

Note, in the process we found out that some device drivers fail to
properly handle changes in IFF_ALLMULTI, leading to interesting
behaviour when a multicast router is started. This bug is not
corrected by this commit, and will be fixed with a separate commit.

Detailed changes:
--------------------
netinet/ip_mroute.c all the above.
conf/files make ip_mroute.c optional
net/route.c fix mrt_ioctl hook
netinet/ip_input.c fix ip_mforward hook, move rsvp_input() here
together with other rsvp code, and a couple
of indentation fixes.
netinet/ip_output.c fix ip_mforward and ip_mcast_src hooks
netinet/ip_var.h rsvp function hooks
netinet/raw_ip.c hooks for mrouting and rsvp functions, plus
interface cleanup.
netinet/ip_mroute.h remove an unused and optional field from a struct

Most of the code is from Pavlin Radoslavov and the XORP project

Reviewed by: sam
MFC after: 1 week


# ab94ca3c 08-Nov-2002 Sam Leffler <sam@FreeBSD.org>

correct fast ipsec logic: compare destination ip address against the
contents of the SA, not the SP

Submitted by: "Doug Ambrisko" <ambrisko@verniernetworks.com>


# 53be11f6 20-Oct-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Fix two instances of variant struct definitions in sys/netinet:

Remove the never completed _IP_VHL version, it has not caught on
anywhere and it would make us incompatible with other BSD netstacks
to retain this version.

Add a CTASSERT protecting sizeof(struct ip) == 20.

Don't let the size of struct ipq depend on the IPDIVERT option.

This is a functional no-op commit.

Approved by: re


# b9234faf 15-Oct-2002 Sam Leffler <sam@FreeBSD.org>

Tie new "Fast IPsec" code into the build. This involves the usual
configuration stuff as well as conditional code in the IPv4 and IPv6
areas. Everything is conditional on FAST_IPSEC which is mutually
exclusive with IPSEC (KAME IPsec implmentation).

As noted previously, don't use FAST_IPSEC with INET6 at the moment.

Reviewed by: KAME, rwatson
Approved by: silence
Supported by: Vernier Networks


# 5d846453 15-Oct-2002 Sam Leffler <sam@FreeBSD.org>

Replace aux mbufs with packet tags:

o instead of a list of mbufs use a list of m_tag structures a la openbsd
o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit
ABI/module number cookie
o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and
use this in defining openbsd-compatible m_tag_find and m_tag_get routines
o rewrite KAME use of aux mbufs in terms of packet tags
o eliminate the most heavily used aux mbufs by adding an additional struct
inpcb parameter to ip_output and ip6_output to allow the IPsec code to
locate the security policy to apply to outbound packets
o bump __FreeBSD_version so code can be conditionalized
o fixup ipfilter's call to ip_output based on __FreeBSD_version

Reviewed by: julian, luigi (silent), -arch, -net, darren
Approved by: julian, silence from everyone else
Obtained from: openbsd (mostly)
MFC after: 1 month


# cb7641e8 23-Sep-2002 Maxim Konovalov <maxim@FreeBSD.org>

Slightly rearrange a code in rev. 1.164:

o Move len initialization closer to place of its first usage.
o Compare len with 0 to improve readability.
o Explicitly zero out phlen in ip_insertoptions() in failure case.

Suggested by: jhb
Reviewed by: jhb
MFC after: 2 weeks


# e079ba8d 17-Sep-2002 Maxim Konovalov <maxim@FreeBSD.org>

In rare cases when there is no room for ip options ip_insertoptions()
can fail and corrupt a header length. Initialize len and check what
ip_insertoptions() returns.

Reviewed by: archie, silence on -net
MFC after: 5 days


# 4ed84624 31-Jul-2002 Robert Watson <rwatson@FreeBSD.org>

Introduce support for Mandatory Access Control and extensible
kernel access control.

When fragmenting an IP datagram, invoke an appropriate MAC entry
point so that MAC labels may be copied (...) to the individual
IP fragment mbufs by MAC policies.

When IP options are inserted into an IP datagram when leaving a
host, preserve the label if we need to reallocate the mbuf for
alignment or size reasons.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 3956b023 12-Jul-2002 Luigi Rizzo <luigi@FreeBSD.org>

Avoid dereferencing a null pointer in ro_rt.

This was always broken in HEAD (the offending statement was introduced
in rev. 1.123 for HEAD, while RELENG_4 included this fix (in rev.
1.99.2.12 for RELENG_4) and I inadvertently deleted it in 1.99.2.30.

So I am also restoring these two lines in RELENG_4 now.
We might need another few things from 1.99.2.30.


# 7627c6cb 27-Jun-2002 Maxime Henrion <mux@FreeBSD.org>

Warning fixes for 64 bits platforms. With this last fix,
I can build a GENERIC sparc64 kernel with -Werror.

Reviewed by: luigi


# 98cb733c 25-Jun-2002 Kenneth D. Merry <ken@FreeBSD.org>

At long last, commit the zero copy sockets code.

MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes.

ti.4: Update the ti(4) man page to include information on the
TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
and also include information about the new character
device interface and the associated ioctls.

man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated
links.

jumbo.9: New man page describing the jumbo buffer allocator
interface and operation.

zero_copy.9: New man page describing the general characteristics of
the zero copy send and receive code, and what an
application author should do to take advantage of the
zero copy functionality.

NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.

conf/files: Add uipc_jumbo.c and uipc_cow.c.

conf/options: Add the 5 options mentioned above.

kern_subr.c: Receive side zero copy implementation. This takes
"disposable" pages attached to an mbuf, gives them to
a user process, and then recycles the user's page.
This is only active when ZERO_COPY_SOCKETS is turned on
and the kern.ipc.zero_copy.receive sysctl variable is
set to 1.

uipc_cow.c: Send side zero copy functions. Takes a page written
by the user and maps it copy on write and assigns it
kernel virtual address space. Removes copy on write
mapping once the buffer has been freed by the network
stack.

uipc_jumbo.c: Jumbo disposable page allocator code. This allocates
(optionally) disposable pages for network drivers that
want to give the user the option of doing zero copy
receive.

uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are
enabled if ZERO_COPY_SOCKETS is turned on.

Add zero copy send support to sosend() -- pages get
mapped into the kernel instead of getting copied if
they meet size and alignment restrictions.

uipc_syscalls.c:Un-staticize some of the sf* functions so that they
can be used elsewhere. (uipc_cow.c)

if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
calling malloc() with M_WAITOK. Return an error if
the M_NOWAIT malloc fails.

The ti(4) driver and the wi(4) driver, at least, call
this with a mutex held. This causes witness warnings
for 'ifconfig -a' with a wi(4) or ti(4) board in the
system. (I've only verified for ti(4)).

ip_output.c: Fragment large datagrams so that each segment contains
a multiple of PAGE_SIZE amount of data plus headers.
This allows the receiver to potentially do page
flipping on receives.

if_ti.c: Add zero copy receive support to the ti(4) driver. If
TI_PRIVATE_JUMBOS is not defined, it now uses the
jumbo(9) buffer allocator for jumbo receive buffers.

Add a new character device interface for the ti(4)
driver for the new debugging interface. This allows
(a patched version of) gdb to talk to the Tigon board
and debug the firmware. There are also a few additional
debugging ioctls available through this interface.

Add header splitting support to the ti(4) driver.

Tweak some of the default interrupt coalescing
parameters to more useful defaults.

Add hooks for supporting transmit flow control, but
leave it turned off with a comment describing why it
is turned off.

if_tireg.h: Change the firmware rev to 12.4.11, since we're really
at 12.4.11 plus fixes from 12.4.13.

Add defines needed for debugging.

Remove the ti_stats structure, it is now defined in
sys/tiio.h.

ti_fw.h: 12.4.11 firmware.

ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13,
and my header splitting patches. Revision 12.4.13
doesn't handle 10/100 negotiation properly. (This
firmware is the same as what was in the tree previously,
with the addition of header splitting support.)

sys/jumbo.h: Jumbo buffer allocator interface.

sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to
indicate that the payload buffer can be thrown away /
flipped to a userland process.

socketvar.h: Add prototype for socow_setup.

tiio.h: ioctl interface to the character portion of the ti(4)
driver, plus associated structure/type definitions.

uio.h: Change prototype for uiomoveco() so that we'll know
whether the source page is disposable.

ufs_readwrite.c:Update for new prototype of uiomoveco().

vm_fault.c: In vm_fault(), check to see whether we need to do a page
based copy on write fault.

vm_object.c: Add a new function, vm_object_allocate_wait(). This
does the same thing that vm_object allocate does, except
that it gives the caller the opportunity to specify whether
it should wait on the uma_zalloc() of the object structre.

This allows vm objects to be allocated while holding a
mutex. (Without generating WITNESS warnings.)

vm_object_allocate() is implemented as a call to
vm_object_allocate_wait() with the malloc flag set to
M_WAITOK.

vm_object.h: Add prototype for vm_object_allocate_wait().

vm_page.c: Add page-based copy on write setup, clear and fault
routines.

vm_page.h: Add page based COW function prototypes and variable in
the vm_page structure.

Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.


# 51aed12e 23-Jun-2002 Luigi Rizzo <luigi@FreeBSD.org>

fix bad indentation and whitespace resulting from cut&paste


# 2b25acc1 22-Jun-2002 Luigi Rizzo <luigi@FreeBSD.org>

Remove (almost all) global variables that were used to hold
packet forwarding state ("annotations") during ip processing.
The code is considerably cleaner now.

The variables removed by this change are:

ip_divert_cookie used by divert sockets
ip_fw_fwd_addr used for transparent ip redirection
last_pkt used by dynamic pipes in dummynet

Removal of the first two has been done by carrying the annotations
into volatile structs prepended to the mbuf chains, and adding
appropriate code to add/remove annotations in the routines which
make use of them, i.e. ip_input(), ip_output(), tcp_input(),
bdg_forward(), ether_demux(), ether_output_frame(), div_output().

On passing, remove a bug in divert handling of fragmented packet.
Now it is the fragment at offset 0 which sets the divert status of
the whole packet, whereas formerly it was the last incoming fragment
to decide.

Removal of last_pkt required a change in the interface of ip_fw_chk()
and dummynet_io(). On passing, use the same mechanism for dummynet
annotations and for divert/forward annotations.

option IPFIREWALL_FORWARD is effectively useless, the code to
implement it is very small and is now in by default to avoid the
obfuscation of conditionally compiled code.

NOTES:
* there is at least one global variable left, sro_fwd, in ip_output().
I am not sure if/how this can be removed.

* I have deliberately avoided gratuitous style changes in this commit
to avoid cluttering the diffs. Minor stule cleanup will likely be
necessary

* this commit only focused on the IP layer. I am sure there is a
number of global variables used in the TCP and maybe UDP stack.

* despite the number of files touched, there are absolutely no API's
or data structures changed by this commit (except the interfaces of
ip_fw_chk() and dummynet_io(), which are internal anyways), so
an MFC is quite safe and unintrusive (and desirable, given the
improved readability of the code).

MFC after: 10 days


# db40007d 21-May-2002 Andrew R. Reiter <arr@FreeBSD.org>

- Change the newly turned INVARIANTS #ifdef blocks (they were changed from
DIAGNOSTIC yesterday) into KASSERT()'s as these help to increase code
readability.


# 4cb674c9 20-May-2002 Andrew R. Reiter <arr@FreeBSD.org>

- Turn a few DIAGNOSTIC into INVARIANTS since they are really sanity
checks.


# d60315be 09-May-2002 Luigi Rizzo <luigi@FreeBSD.org>

Cleanup the interface to ip_fw_chk, two of the input arguments
were totally useless and have been removed.

ip_input.c, ip_output.c:
Properly initialize the "ip" pointer in case the firewall does an
m_pullup() on the packet.

Remove some debugging code forgotten long ago.

ip_fw.[ch], bridge.c:
Prepare the grounds for matching MAC header fields in bridged packets,
so we can have 'etherfw' functionality without a lot of kernel and
userland bloat.


# 44731cab 01-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on: smp@


# e3f406b3 22-Mar-2002 Ruslan Ermilov <ru@FreeBSD.org>

Prevent icmp_reflect() from calling ip_output() with a NULL route
pointer which will then result in the allocated route's reference
count never being decremented. Just flood ping the localhost and
watch refcnt of the 127.0.0.1 route with netstat(1).

Submitted by: jayanth

Back out ip_output.c,v 1.143 and ip_mroute.c,v 1.69 that allowed
ip_output() to be called with a NULL route pointer. The previous
paragraph shows why this was a bad idea in the first place.

MFC after: 0 days


# 4d77a549 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# fd8e4ebc 18-Feb-2002 Mike Barcroft <mike@FreeBSD.org>

o Move NTOHL() and associated macros into <sys/param.h>. These are
deprecated in favor of the POSIX-defined lowercase variants.
o Change all occurrences of NTOHL() and associated marcros in the
source tree to use the lowercase function variants.
o Add missing license bits to sparc64's <machine/endian.h>.
Approved by: jake
o Clean up <machine/endian.h> files.
o Remove unused __uint16_swap_uint32() from i386's <machine/endian.h>.
o Remove prototypes for non-existent bswapXX() functions.
o Include <machine/endian.h> in <arpa/inet.h> to define the
POSIX-required ntohl() family of functions.
o Do similar things to expose the ntohl() family in libstand, <netinet/in.h>,
and <sys/param.h>.
o Prepend underscores to the ntohl() family to help deal with
complexities associated with having MD (asm and inline) versions, and
having to prevent exposure of these functions in other headers that
happen to make use of endian-specific defines.
o Create weak aliases to the canonical function name to help deal with
third-party software forgetting to include an appropriate header.
o Remove some now unneeded pollution from <sys/types.h>.
o Add missing <arpa/inet.h> includes in userland.

Tested on: alpha, i386
Reviewed by: bde, jake, tmm


# 51c8ec4a 14-Feb-2002 Ruslan Ermilov <ru@FreeBSD.org>

Moved the 127/8 check below so that IPF redirects have a chance of working.

MFC after: 1 day


# a4a6e773 21-Jan-2002 Hajimu UMEMOTO <ume@FreeBSD.org>

- Check the address family of the destination cached in a PCB.
- Clear the cached destination before getting another cached route.
Otherwise, garbage in the padding space (which might be filled in if it was
used for IPv4) could annoy rtalloc.

Obtained from: KAME


# 8c3f5566 21-Jan-2002 Ruslan Ermilov <ru@FreeBSD.org>

RFC1122 requires that addresses of the form { 127, <any> } MUST NOT
appear outside a host.

PR: 30792, 33996
Obtained from: ip_input.c
MFC after: 1 week


# 92bdb2fa 05-Jan-2002 Bill Fenner <fenner@FreeBSD.org>

Pre-calculate the checksum for multicast packets sourced on a
multicast router. This is overkill; it should be possible to
delay to hardware interfaces and only pre-calculate when forwarding
to a tunnel.


# 3efc3014 28-Dec-2001 Julian Elischer <julian@FreeBSD.org>

Fix ipfw fwd so that it acts as the docs say
when forwarding an incoming packet to another machine.

Obtained from: Vicor Production tree
MFC after: 3 weeks


# 3f9e3122 19-Dec-2001 Yaroslav Tykhiy <ytykhiy@gmail.com>

Don't try to free a NULL route when doing IPFIREWALL_FORWARD.
An old route will be NULL at that point if a packet were initially
routed to an interface (using the IP_ROUTETOIF flag.)

Submitted by: Igor Timkin <ivt@gamma.ru>


# aa1f5daa 14-Dec-2001 Jonathan Lemon <jlemon@FreeBSD.org>

whitespace and style fixes recovered from -stable.


# 04d59553 01-Dec-2001 Ruslan Ermilov <ru@FreeBSD.org>

Allow for ip_output() to be called with a NULL route pointer.
This fixes a panic I introduced yesterday in ip_icmp.c,v 1.64.


# 7b109fa4 04-Nov-2001 Luigi Rizzo <luigi@FreeBSD.org>

MFS: sync the ipfw/dummynet/bridge code with the one recently merged
into stable (mostly , but not only, formatting and comments changes).


# 3528d68f 30-Oct-2001 Bill Paul <wpaul@FreeBSD.org>

Fix a (long standing?) bug in ip_output(): if ip_insertoptions() is
called and ip_output() encounters an error and bails (i.e. host
unreachable), we will leak an mbuf. This is because the code calls
m_freem(m0) after jumping to the bad: label at the end of the function,
when it should be calling m_freem(m). (m0 is the original mbuf list
_without_ the options mbuf prepended.)

Obtained from: NetBSD


# 35609d45 30-Oct-2001 Jonathan Lemon <jlemon@FreeBSD.org>

When dropping a packet because there is no room in the queue (which itself
is somewhat bogus), update the statistics to indicate something was dropped.

PR: 13740


# db69a05d 04-Oct-2001 Paul Saab <ps@FreeBSD.org>

Make it so dummynet and bridge can be loaded as modules.

Submitted by: billf


# ca925d9c 28-Sep-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Add a hash table that contains the list of internet addresses, and use
this in place of the in_ifaddr list when appropriate. This improves
performance on hosts which have a large number of IP aliases.


# 9a10980e 28-Sep-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Centralize satosin(), sintosa() and ifatoia() macros in <netinet/in.h>
Remove local definitions.


# 830cc178 27-Sep-2001 Luigi Rizzo <luigi@FreeBSD.org>

Two main changes here:
+ implement "limit" rules, which permit to limit the number of sessions
between certain host pairs (according to masks). These are a special
type of stateful rules, which might be of interest in some cases.
See the ipfw manpage for details.

+ merge the list pointers and ipfw rule descriptors in the kernel, so
the code is smaller, faster and more readable. This patch basically
consists in replacing "foo->rule->bar" with "rule->bar" all over
the place.
I have been willing to do this for ages!

MFC after: 1 week


# 9494d596 25-Sep-2001 Brooks Davis <brooks@FreeBSD.org>

Make faith loadable, unloadable, and clonable.


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# f9132ceb 05-Sep-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Wrap array accesses in macros, which also happen to be lvalues:

ifnet_addrs[i - 1] -> ifaddr_byindex(i)
ifindex2ifnet[i] -> ifnet_byindex(i)

This is intended to ease the conversion to SMPng.


# 07203494 03-Aug-2001 Daniel C. Sobral <dcs@FreeBSD.org>

MFS: Avoid dropping fragments in the absence of an interface address.

Noticed by: fenner
Submitted by: iedowse
Not committed to current by: iedowse ;-)


# 38c1bc35 23-Jul-2001 Ruslan Ermilov <ru@FreeBSD.org>

Avoid a NULL pointer derefence introduced in rev. 1.129.

Problem noticed by: bde, gcc(1)
Panic caught by: mjacob
Patch tested by: mjacob


# f2c2962e 19-Jul-2001 Ruslan Ermilov <ru@FreeBSD.org>

Backout non-functional changes from revision 1.128.

Not objected to by: dcs


# 3afefa39 17-Jul-2001 Daniel C. Sobral <dcs@FreeBSD.org>

Skip the route checking in the case of multicast packets with known
interfaces.

Reviewed by: people at that channel
Approved by: silence on -net


# 33841545 10-Jun-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

Sync with recent KAME.
This work was based on kame-20010528-freebsd43-snap.tgz and some
critical problem after the snap was out were fixed.
There are many many changes since last KAME merge.

TODO:
- The definitions of SADB_* in sys/net/pfkeyv2.h are still different
from RFC2407/IANA assignment because of binary compatibility
issue. It should be fixed under 5-CURRENT.
- ip6po_m member of struct ip6_pktopts is no longer used. But, it
is still there because of binary compatibility issue. It should
be removed under 5-CURRENT.

Reviewed by: itojun
Obtained from: KAME
MFC after: 3 weeks


# 64dddc18 01-Jun-2001 Kris Kennaway <kris@FreeBSD.org>

Add ``options RANDOM_IP_ID'' which randomizes the ID field of IP packets.
This closes a minor information leak which allows a remote observer to
determine the rate at which the machine is generating packets, since the
default behaviour is to increment a counter for each packet sent.

Reviewed by: -net
Obtained from: OpenBSD


# 206a3274 13-Mar-2001 Ruslan Ermilov <ru@FreeBSD.org>

RFC768 (UDP) requires that "if the computed checksum is zero, it
is transmitted as all ones". This got broken after introduction
of delayed checksums as follows. Some guys (including Jonathan)
think that it is allowed to transmit all ones in place of a zero
checksum for TCP the same way as for UDP. (The discussion still
takes place on -net.) Thus, the 0 -> 0xffff checksum fixup was
first moved from udp_output() (see udp_usrreq.c, 1.64 -> 1.65)
to in_cksum_skip() (see sys/i386/i386/in_cksum.c, 1.17 -> 1.18,
INVERT expression). Besides that I disagree that it is valid for
TCP, there was no real problem until in_cksum.c,v 1.20, where the
in_cksum() was made just a special version of in_cksum_skip().
The side effect was that now every incoming IP datagram failed to
pass the checksum test (in_cksum() returned 0xffff when it should
actually return zero). It was fixed next day in revision 1.21,
by removing the INVERT expression. The latter also broke the
0 -> 0xffff fixup for UDP checksums.

Before this change:
: tcpdump: listening on lo0
: 127.0.0.1.33005 > 127.0.0.1.33006: udp 0 (ttl 64, id 1)
: 4500 001c 0001 0000 4011 7cce 7f00 0001
: 7f00 0001 80ed 80ee 0008 0000

After this change:
: tcpdump: listening on lo0
: 127.0.0.1.33005 > 127.0.0.1.33006: udp 0 (ttl 64, id 1)
: 4500 001c 0001 0000 4011 7cce 7f00 0001
: 7f00 0001 80ed 80ee 0008 ffff


# 5d936aa1 11-Mar-2001 Ian Dowse <iedowse@FreeBSD.org>

In ip_output(), initialise `ia' in the case where the packet has
come from a dummynet pipe. Without this, the code which increments
the per-ifaddr stats can dereference an uninitialised pointer. This
should make dummynet usable again.

Reported by: "Dmitry A. Yanko" <fm@astral.ntu-kpi.kiev.ua>
Reviewed by: luigi, joe


# 05f15c3d 26-Feb-2001 Jeroen Ruigrok van der Werven <asmodai@FreeBSD.org>

Remove conditionals for vax support.
People who care much about this are welcomed to try 2.11BSD. :)

Noticed by: luigi
Reviewed by: jesper


# 37d40066 04-Feb-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Another round of the <sys/queue.h> FOREACH transmogriffer.

Created with: sed(1)
Reviewed by: md5(1)


# fc2ffbe6 04-Feb-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)


# 507b4b54 01-Feb-2001 Luigi Rizzo <luigi@FreeBSD.org>

MFS: bridge/ipfw/dummynet fixes (bridge.c will be committed separately)


# 7a726a2d 24-Jan-2001 Luigi Rizzo <luigi@FreeBSD.org>

Pass up errors returned by dummynet. The same should be done with
divert.


# 2a0c503e 21-Dec-2000 Bosko Milekic <bmilekic@FreeBSD.org>

* Rename M_WAIT mbuf subsystem flag to M_TRYWAIT.
This is because calls with M_WAIT (now M_TRYWAIT) may not wait
forever when nothing is available for allocation, and may end up
returning NULL. Hopefully we now communicate more of the right thing
to developers and make it very clear that it's necessary to check whether
calls with M_(TRY)WAIT also resulted in a failed allocation.
M_TRYWAIT basically means "try harder, block if necessary, but don't
necessarily wait forever." The time spent blocking is tunable with
the kern.ipc.mbuf_wait sysctl.
M_WAIT is now deprecated but still defined for the next little while.

* Fix a typo in a comment in mbuf.h

* Fix some code that was actually passing the mbuf subsystem's M_WAIT to
malloc(). Made it pass M_WAITOK instead. If we were ever to redefine the
value of the M_WAIT flag, this could have became a big problem.


# ffa37b3f 31-Oct-2000 Josef Karthauser <joe@FreeBSD.org>

It's no longer true that "nobody uses ia beyond here"; it's now
used to keep address based if_data statistics in.

Submitted by: ru


# cf9fa8e7 29-Oct-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Move suser() and suser_xxx() prototypes and a related #define from
<sys/proc.h> to <sys/systm.h>.

Correctly document the #includes needed in the manpage.

Add one now needed #include of <sys/systm.h>.
Remove the consequent 48 unused #includes of <sys/proc.h>.


# fe937674 28-Oct-2000 Josef Karthauser <joe@FreeBSD.org>

Count per-address statistics for IP fragments.

Requested by: ru
Obtained from: BSD/OS


# cc22c7a7 20-Oct-2000 Ruslan Ermilov <ru@FreeBSD.org>

Save a few CPU cycles in IP fragmentation code.


# 5da9f8fa 19-Oct-2000 Josef Karthauser <joe@FreeBSD.org>

Augment the 'ifaddr' structure with a 'struct if_data' to keep
statistics on a per network address basis.

Teach the IPv4 and IPv6 input/output routines to log packets/bytes
against the network address connected to the flow.

Teach netstat to display the per-address stats for IP protocols
when 'netstat -i' is evoked, instead of displaying the per-interface
stats.


# e30177e0 14-Sep-2000 Ruslan Ermilov <ru@FreeBSD.org>

Follow BSD/OS and NetBSD, keep the ip_id field in network order all the time.

Requested by: wollman


# 04287599 31-Aug-2000 Ruslan Ermilov <ru@FreeBSD.org>

Fixed broken ICMP error generation, unified conversion of IP header
fields between host and network byte order. The details:

o icmp_error() now does not add IP header length. This fixes the problem
when icmp_error() is called from ip_forward(). In this case the ip_len
of the original IP datagram returned with ICMP error was wrong.

o icmp_error() expects all three fields, ip_len, ip_id and ip_off in host
byte order, so DTRT and convert these fields back to network byte order
before sending a message. This fixes the problem described in PR 16240
and PR 20877 (ip_id field was returned in host byte order).

o ip_ttl decrement operation in ip_forward() was moved down to make sure
that it does not corrupt the copy of original IP datagram passed later
to icmp_error().

o A copy of original IP datagram in ip_forward() was made a read-write,
independent copy. This fixes the problem I first reported to Garrett
Wollman and Bill Fenner and later put in audit trail of PR 16240:
ip_output() (not always) converts fields of original datagram to network
byte order, but because copy (mcopy) and its original (m) most likely
share the same mbuf cluster, ip_output()'s manipulations on original
also corrupted the copy.

o ip_output() now expects all three fields, ip_len, ip_off and (what is
significant) ip_id in host byte order. It was a headache for years that
ip_id was handled differently. The only compatibility issue here is the
raw IP socket interface with IP_HDRINCL socket option set and a non-zero
ip_id field, but ip.4 manual page was unclear on whether in this case
ip_id field should be in host or network byte order.


# c4ac87ea 31-Jul-2000 Darren Reed <darrenr@FreeBSD.org>

activate pfil_hooks and covert ipfilter to use it


# 686cdd19 04-Jul-2000 Jun-ichiro itojun Hagino <itojun@FreeBSD.org>

sync with kame tree as of july00. tons of bug fixes/improvements.

API changes:
- additional IPv6 ioctls
- IPsec PF_KEY API was changed, it is mandatory to upgrade setkey(8).
(also syntax change)


# 707d00a3 02-Jun-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Add boundary checks against IP options.

Obtained from: OpenBSD


# 50c6dc99 24-May-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Mark the checksum as complete when looping back multicast packets.

Submitted by: Jeff Gibbons <jgibbons@n2.net>


# 06a429a3 24-May-2000 Archie Cobbs <archie@FreeBSD.org>

Just need to pass the address family to if_simloop(), not the whole sockaddr.


# 1c238475 21-May-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Compute the checksum before handing the packet off to IPFilter.

Tested by: Cy Schubert <Cy.Schubert@uumail.gov.bc.ca>


# 7cba257a 02-Apr-2000 Yoshinobu Inoue <shin@FreeBSD.org>

Move htons() ip_len to after the in_delayed_cksum() call.
This should stop cksum error messages on IPsec communication
which was reported on freebsd-current.

Reviewed by: jlemon


# ea53ecd9 01-Apr-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Calculate any delayed checksums before handing an mbuf off to a
divert socket. This fixes a problem with ppp/natd.

Reviewed by: bsd (Brian Dean, gotta love that login name)


# 20c822f3 29-Mar-2000 Jonathan Lemon <jlemon@FreeBSD.org>

If `ipfw fwd' loops an mbuf back to ip_input from ip_output and the
mbuf is marked for delayed checksums, then additionally mark the
packet as having it's checksums computed. This allows us to bypass
computing/checking the checksum entirely, which isn't really needeed
as the packet has never hit the wire.

Reviewed by: green


# db4f9cc7 27-Mar-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Add support for offloading IP/TCP/UDP checksums to NIC hardware which
supports them.


# f63e7634 09-Mar-2000 Yoshinobu Inoue <shin@FreeBSD.org>

Initialize mbuf pointer at getting ipsec policy.
Without this, kernel will panic at getsockopt() of IPSEC_POLICY.
Also make compilable libipsec/test-policy.c which tries getsockopt() of
IPSEC_POLICY.

Approved by: jkh

Submitted by: sakane@kame.net


# 6d37c73e 23-Feb-2000 Guido van Rooij <guido@FreeBSD.org>

Remove option IPFILTER_KLD. In case you wanted to kldload ipfilter,
the module would only work in kernels built with this option.

Approved by: jkh


# 6bc748b0 10-Feb-2000 Luigi Rizzo <luigi@FreeBSD.org>

Support the net.inet.ip.fw.enable variable, part of
the recent ipfw modifications.

Approved-by: jordan


# 5db1e34e 10-Jan-2000 Ruslan Ermilov <ru@FreeBSD.org>

MGETHDR() does not initialize m_pkthdr.rcvif, do it here.

This fixes page fault panic observed when diverting packets
with IP options (e.g. ping -R remoteIP over natd).

PR: kern/8596, kern/11199


# d0a98d79 08-Jan-2000 Yoshinobu Inoue <shin@FreeBSD.org>

enable IPsec over DUMMYNET again

Submitted by: luigi
Reviewed by: luigi


# d1f04b29 08-Jan-2000 Luigi Rizzo <luigi@FreeBSD.org>

Cleanup dummynet call interface so it should now work on the Alpha
as well. Also (probably) fix a bug introduced during the IPv6 import.


# 6a800098 22-Dec-1999 Yoshinobu Inoue <shin@FreeBSD.org>

IPSEC support in the kernel.
pr_input() routines prototype is also changed to support IPSEC and IPV6
chained protocol headers.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


# 8948e4ba 05-Dec-1999 Archie Cobbs <archie@FreeBSD.org>

Miscellaneous fixes/cleanups relating to ipfw and divert(4):

- Implement 'ipfw tee' (finally)
- Divert packets by calling new function divert_packet() directly instead
of going through protosw[].
- Replace kludgey global variable 'ip_divert_port' with a function parameter
to divert_packet()
- Replace kludgey global variable 'frag_divert_port' with a function parameter
to ip_reass()
- style(9) fixes

Reviewed by: julian, green


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# eaa726be 04-May-1999 Luigi Rizzo <luigi@FreeBSD.org>

Free the dummynet descriptor in ip_dummynet, not in the called
routines. The descriptor contains parameters which could be used
within those routines (eg. ip_output() ).

On passing, add IPPROTO_PGM entry to netinet/in.h


# a7c21949 04-May-1999 Luigi Rizzo <luigi@FreeBSD.org>

forgot passing the right pointer to dst to dummynet_io().
(-stable and releng2 were already safe).
Debugged-By: phk


# 66e55756 20-Apr-1999 Peter Wemm <peter@FreeBSD.org>

Tidy up some stray / unused stuff in the IPFW package and friends.
- unifdef -DCOMPAT_IPFW (this was on by default already)
- remove traces of in-kernel ip_nat package, it was never committed.
- Make IPFW and DUMMYNET initialize themselves rather than depend on
compiled-in hooks in ip_init(). This means they initialize the same
way both in-kernel and as kld modules. (IPFW initializes now :-)


# f0a53591 15-Mar-1999 Luigi Rizzo <luigi@FreeBSD.org>

Fix a dummynet bug caused by passing a bad next hop address (the
symptom was the msg "arp failure -- host is not on local network" that
some user have seen on multihomed machines.
Bug tracked down by Emmanuel Duros


# 17458d35 19-Feb-1999 Luigi Rizzo <luigi@FreeBSD.org>

avoid panic with pkts larger than MTU and DF set coming out of a pipe.


# f0f6d643 21-Dec-1998 Luigi Rizzo <luigi@FreeBSD.org>

Restore 1.82->1.83 change deleted by mistake< per Bruce suggestion


# b715f178 14-Dec-1998 Luigi Rizzo <luigi@FreeBSD.org>

Last bits (i think) of dummynet for -current.


# 1c5bb3ea 10-Nov-1998 Peter Wemm <peter@FreeBSD.org>

add #include <sys/kernel.h> where it's needed by MALLOC_DEFINE()


# db028362 02-Sep-1998 Garrett Wollman <wollman@FreeBSD.org>

Properly fragment multicast packets.

PR: 7802
Submitted by: Steve McCanne <mccanne@cs.berkeley.edu>


# cfe8b629 22-Aug-1998 Garrett Wollman <wollman@FreeBSD.org>

Yow! Completely change the way socket options are handled, eliminating
another specialized mbuf type in the process. Also clean up some
of the cruft surrounding IPFW, multicast routing, RSVP, and other
ill-explored corners.


# 9de9737f 01-Aug-1998 Peter Wemm <peter@FreeBSD.org>

Fix a compile error if IPFIREWALL_FORWARD active without IPDIVERT.


# 0c8d2590 12-Jul-1998 Bruce Evans <bde@FreeBSD.org>

Fixed some longs that should have been fixed-sized types.


# 1f7e052c 05-Jul-1998 Julian Elischer <julian@FreeBSD.org>

Don't expect the new code to be used without the right option file being
included.


# d4295c32 05-Jul-1998 Julian Elischer <julian@FreeBSD.org>

Fix braino in switching to TAILQ macro.


# f9e354df 05-Jul-1998 Julian Elischer <julian@FreeBSD.org>

Support for IPFW based transparent forwarding.
Any packet that can be matched by a ipfw rule can be redirected
transparently to another port or machine. Redirection to another port
mostly makes sense with tcp, where a session can be set up
between a proxy and an unsuspecting client. Redirection to another machine
requires that the other machine also be expecting to receive the forwarded
packets, as their headers will not have been modified.

/sbin/ipfw must be recompiled!!!

Reviewed by: Peter Wemm <peter@freebsd.org>
Submitted by: Chrisy Luke <chrisy@flix.net>


# e5b19842 21-Jun-1998 Bruce Evans <bde@FreeBSD.org>

Removed unused includes.


# 2b8a366c 14-Jun-1998 Julian Elischer <julian@FreeBSD.org>

fix another typo


# 201c2527 14-Jun-1998 Julian Elischer <julian@FreeBSD.org>

Try narrow down the culprit sending undefined packet types through the loopback


# ed7509ac 11-Jun-1998 Julian Elischer <julian@FreeBSD.org>

Go through the loopback code with a broom..
Remove lots'o'hacks.
looutput is now static.

Other callers who want to use loopback to allow shortcutting
should call the special entrypoint for this, if_simloop(), which is
specifically designed for this purpose. Using looutput for this purpose
was problematic, particularly with bpf and trying to keep track
of whether one should be using the charateristics of the loopback interface
or the interface (e.g. if_ethersubr.c) that was requesting the loopback.
There was a whole class of errors due to this mis-use each of which had
hacks to cover them up.

Consists largly of hack removal :-)


# b8760493 06-Jun-1998 Julian Elischer <julian@FreeBSD.org>

Make sure the default value of a dummy variable is 0
so that it doesn't do anything.


# 3ed81d03 06-Jun-1998 Julian Elischer <julian@FreeBSD.org>

Fix wrong data type for a pointer.


# c977d4c7 06-Jun-1998 Julian Elischer <julian@FreeBSD.org>

clean up the changes made to ipfw over the last weeks
(should make the ipfw lkm work again)


# e256a933 05-Jun-1998 Julian Elischer <julian@FreeBSD.org>

Reverse the default sense of the IPFW/DIVERT reinjection code
so that the new behaviour is now default.
Solves the "infinite loop in diversion" problem when more than one diversion
is active.
Man page changes follow.

The new code is in -stable as the NON default option.


# bb60f459 25-May-1998 Julian Elischer <julian@FreeBSD.org>

Add optional code to change the way that divert and ipfw work together.
Prior to this change, Accidental recursion protection was done by
the diverted daemon feeding back the divert port number it got
the packet on, as the port number on a sendto(). IPFW knew not to
redivert a packet to this port (again). Processing of the ruleset
started at the beginning again, skipping that divert port.

The new semantic (which is how we should have done it the first time)
is that the port number in the sendto() is the rule number AFTER which
processing should restart, and on a recvfrom(), the port number is the
rule number which caused the diversion. This is much more flexible,
and also more intuitive. If the user uses the same sockaddr received
when resending, processing resumes at the rule number following that
that caused the diversion. The user can however select to resume rule
processing at any rule. (0 is restart at the beginning)

To enable the new code use

option IPFW_DIVERT_RESTART

This should become the default as soon as people have looked at it a bit


# 1ee25934 21-Mar-1998 Peter Wemm <peter@FreeBSD.org>

Make this compile.. There are some unpleasing hacks in here.
A major unifdef session is sorely tempting but would destroy any remaining
chance of tracking the original sources.


# d68fa50c 20-Feb-1998 Bruce Evans <bde@FreeBSD.org>

Don't depend on "implicit int".


# 0b08f5f7 05-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Back out DIAGNOSTIC changes.


# 47cfdb16 04-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Turn DIAGNOSTIC into a new-style option.


# 0abc78a6 07-Nov-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Rename some local variables to avoid shadowing other local variables.

Found by: -Wshadow


# fbd1372a 05-Nov-1997 Joerg Wunsch <joerg@FreeBSD.org>

Make IPDIVERT a supported option. Alas, in_var.h depends on it, i
hope i've found out all files that actually depend on this dependancy.
IMHO, it's not very good practice to change the size of internal
structs depending on kernel options.


# a1c995b6 12-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Last major round (Unless Bruce thinks of somthing :-) of malloc changes.

Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.

A couple of finer points by: bde


# 55166637 11-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Distribute and statizice a lot of the malloc M_* types.

Substantial input from: bde


# 1fd0b058 02-Aug-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# e4676ba6 01-Jun-1997 Julian Elischer <julian@FreeBSD.org>

Submitted by: Whistle Communications (archie Cobbs)

these are quite extensive additions to the ipfw code.
they include a change to the API because the old method was
broken, but the user view is kept the same.

The new code allows a particular match to skip forward to a particular
line number, so that blocks of rules can be
used without checking all the intervening rules.
There are also many more ways of rejecting
connections especially TCP related, and
many many more ...

see the man page for a complete description.


# 86b1d6d2 06-May-1997 Bill Fenner <fenner@FreeBSD.org>

Pull up the IP header in ip_mloopback(). This makes sure that the
operations on the header inside ip_mloopback() are performed on
a private copy instead of a shared cluster.

PR: kern/3410


# a29f300e 27-Apr-1997 Garrett Wollman <wollman@FreeBSD.org>

The long-awaited mega-massive-network-code- cleanup. Part I.

This commit includes the following changes:
1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility
glue for them is deleted, and the kernel will panic on boot if any are compiled
in.

2) Certain protocol entry points are modified to take a process structure,
so they they can easily tell whether or not it is possible to sleep, and
also to access credentials.

3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt()
call. Protocols should use the process pointer they are now passed.

4) The PF_LOCAL and PF_ROUTE families have been updated to use the new
style, as has the `raw' skeleton family.

5) PF_LOCAL sockets now obey the process's umask when creating a socket
in the filesystem.

As a result, LINT is now broken. I'm hoping that some enterprising hacker
with a bit more time will either make the broken bits work (should be
easy for netipx) or dike them out.


# beec8214 03-Apr-1997 Darren Reed <darrenr@FreeBSD.org>

Resolve conflicts created by import.


# ca98b82c 02-Apr-1997 David Greenman <dg@FreeBSD.org>

Reorganize elements of the inpcb struct to take better advantage of
cache lines. Removed the struct ip proto since only a couple of chars
were actually being used in it. Changed the order of compares in the
PCB hash lookup to take advantage of partial cache line fills (on PPro).

Discussed-with: wollman


# e1596dff 28-Feb-1997 Bill Fenner <fenner@FreeBSD.org>

Fix a comment and some commented-out code in ip_mloopback to
reflect how multicast loopback really works.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# f1743588 19-Feb-1997 Darren Reed <darrenr@FreeBSD.org>

change IP Filter hooks to match new 3.1.8 patches for FreeBSD


# afed1b49 10-Feb-1997 Darren Reed <darrenr@FreeBSD.org>

Add IP Filter hooks (from patches).


# d81e4043 02-Feb-1997 Brian Somers <brian@FreeBSD.org>

Reset ip_divert_ignore to zero immediately after use - also,
set it in the first place, independent of whether sin->sin_port
is set.

The result is that diverted packets that are being forwarded
will be diverted once and only once on the way in (ip_input())
and again, once and only once on the way out (ip_output()) -
twice in total. ICMP packets that don't contain a port will
now also be diverted.


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 59562606 13-Dec-1996 Garrett Wollman <wollman@FreeBSD.org>

Convert the interface address and IP interface address structures
to TAILQs. Fix places which referenced these for no good reason
that I can see (the references remain, but were fixed to compile
again; they are still questionable).


# 82c23eba 10-Nov-1996 Bill Fenner <fenner@FreeBSD.org>

Add the IP_RECVIF socket option, which supplies a packet's incoming interface
using a sockaddr_dl.

Fix the other packet-information socket options (SO_TIMESTAMP, IP_RECVDSTADDR)
to work for multicast UDP and raw sockets as well. (They previously only
worked for unicast UDP).


# 6713d4a7 22-Oct-1996 Søren Schmidt <sos@FreeBSD.org>

Changed args to the nat functions.


# 58938916 07-Oct-1996 Garrett Wollman <wollman@FreeBSD.org>

All three files: make COMPAT_IPFW==0 case work again.
ip_input.c:
- delete some dusty code
- _IP_VHL
- use fast inline header checksum when possible


# fed1c7e9 21-Aug-1996 Søren Schmidt <sos@FreeBSD.org>

Add hooks for an IP NAT module, much like the firewall stuff...
Move the sockopt definitions for the firewall code from
ip_fw.h to in.h where it belongs.


# 93e0e116 10-Jul-1996 Julian Elischer <julian@FreeBSD.org>

Adding changes to ipfw and the kernel to support ip packet diversion..
This stuff should not be too destructive if the IPDIVERT is not compiled in..
be aware that this changes the size of the ip_fw struct
so ipfw needs to be recompiled to use it.. more changes coming to clean this up.


# 0453d3cb 08-Jun-1996 Bruce Evans <bde@FreeBSD.org>

Changed some memcpy()'s back to bcopy()'s.

gcc only inlines memcpy()'s whose count is constant and didn't inline
these. I want memcpy() in the kernel go away so that it's obvious that
it doesn't need to be optimized. Now it is only used for one struct
copy in si.c.


# f9493383 22-May-1996 Garrett Wollman <wollman@FreeBSD.org>

Conditionalize calls to IPFW code on COMPAT_IPFW. This is done slightly
unconventionally:
If COMPAT_IPFW is not defined, or if it is defined to 1, enable;
otherwise, disable.

This means that these changes actually have no effect on anyone at the
moment. (It just makes it easier for me to keep my code in sync.)
In the future, the `not defined' part of the hack should be eliminated,
but doing this now would require everyone to change their config files.

The same conditionals need to be made in ip_input.c as well for this to
ave any useful effect, but I'm not ready to do that right now.


# ce8c72b1 21-May-1996 Peter Wemm <peter@FreeBSD.org>

Fix an embarresing error on my part that made the IP_PORTRANGE options
return a failure code (even though it worked).
This commit brought to you by the 'C' keyword "break".. :-)


# 9f9b3dc4 06-May-1996 Garrett Wollman <wollman@FreeBSD.org>

Add three new route flags to help determine what sort of address
the destination represents. For IP:

- Iff it is a host route, RTF_LOCAL and RTF_BROADCAST indicate local
(belongs to this host) and broadcast addresses, respectively.

- For all routes, RTF_MULTICAST is set if the destination is multicast.

The RTF_BROADCAST flag is used by ip_output() to eliminate a call to
in_broadcast() in a common case; this gives about 1% in our packet-generation
experiments. All three flags might be used (although they aren't now)
to determine whether a packet can be forwarded; a given host route can
represent a forwardable address if:

(rt->rt_flags & (RTF_HOST | RTF_LOCAL | RTF_BROADCAST | RTF_MULTICAST))
== RTF_HOST

Obviously, one still has to do all the work if a host route is not present,
but this code allows one to cache the results of such a lookup if rtalloc1()
is called without masking RTF_PRCLONING.


# e2184122 21-Apr-1996 Bruce Evans <bde@FreeBSD.org>

Fixed in-line IP header checksumming. It was performed on the wrong header
in one case.


# 9c9137ea 18-Apr-1996 Garrett Wollman <wollman@FreeBSD.org>

Three speed-ups in the output path (two small, one substantial):

1) Require all callers to pass a valid route pointer to ip_output()
so that we don't have to check and allocate one off the stack
as was done before. This eliminates one test and some stack
bloat from the common (UDP and TCP) case.

2) Perform the IP header checksum in-line if it's of the usual length.
This results in about a 5% speed-up in my packet-generation test.

3) Use ip_vhl field rather than ip_v and ip_hl bitfields.


# 23bf9953 03-Apr-1996 Poul-Henning Kamp <phk@FreeBSD.org>

Add feature for tcp "established".
Change interface between netinet and ip_fw to be more general, and thus
hopefully also support other ip filtering implementations.


# fbc6ab00 26-Mar-1996 Bill Fenner <fenner@FreeBSD.org>

Add missing splx(s) in IP_MULTICAST_IF

Submitted by: Jim Binkley <jrb@cs.pdx.edu>


# 072b9b24 13-Mar-1996 Paul Traina <pst@FreeBSD.org>

Fix ip option processing for raw IP sockets. This whole thing is a compromise
between ignoring options specified in the setsockopt call if IP_HDRINCL is set
(the UCB choice when VJ's code was brought in) vs allowing them (what everyone
else did, and what is assumed by programs everywhere...sigh).

Also perform some checking of the passed down packet to avoid running off
the end of a mbuf chain.

Reviewed by: fenner


# 2ee45d7d 11-Mar-1996 David Greenman <dg@FreeBSD.org>

Move or add #include <queue.h> in preparation for upcoming struct socket
changes.


# b83e4314 23-Feb-1996 Poul-Henning Kamp <phk@FreeBSD.org>

The new firewall functionality:
Filter on the direction (in/out).
Filter on fragment/not fragment.


# e7319bab 23-Feb-1996 Poul-Henning Kamp <phk@FreeBSD.org>

Big sweep over the IPFIREWALL and IPACCT code.

Close the ip-fragment hole.
Waste less memory.
Rewrite to contemporary more readable style.
Kill separate IPACCT facility, use "accept" rules in IPFIREWALL.
Filter incoming >and< outgoing packets.
Replace "policy" by sticky "deny all" rule.
Rules have numbers used for ordering and deletion.
Remove "rerorder" code entirely.
Count packet & bytecount matches for rules.

Code in -current & -stable is now the same.


# 33b3ac06 22-Feb-1996 Peter Wemm <peter@FreeBSD.org>

Make the default behavior of local port assignment match traditional
systems (my last change did not mix well with some firewall
configurations). As much as I dislike firewalls, this is one thing I
I was not prepared to break by default.. :-)

Allow the user to nominate one of three ranges of port numbers as
candidates for selecting a local address to replace a zero port number.
The ranges are selected via a setsockopt(s, IPPROTO_IP, IP_PORTRANGE, &arg)
call. The three ranges are: default, high (to bypass firewalls) and
low (to get a port below 1024).

The default and high port ranges are sysctl settable under sysctl
net.inet.ip.portrange.*

This code also fixes a potential deadlock if the system accidently ran out
of local port addresses. It'd drop into an infinite while loop.

The secure port selection (for root) should reduce overheads and increase
reliability of rlogin/rlogind/rsh/rshd if they are modified to take
advantage of it.

Partly suggested by: pst
Reviewed by: wollman


# 994fdef9 19-Dec-1995 Garrett Wollman <wollman@FreeBSD.org>

Added a comment about why trying to make a one-behind cache for
the route in ip_output() is a bad idea.


# b7a44e34 05-Dec-1995 Garrett Wollman <wollman@FreeBSD.org>

Path MTU Discovery is now standard.


# 0312fbe9 14-Nov-1995 Poul-Henning Kamp <phk@FreeBSD.org>

New style sysctl & staticize alot of stuff.


# 3d1f141b 16-Oct-1995 Garrett Wollman <wollman@FreeBSD.org>

The ability to administratively change the MTU of an interface presents
a few new wrinkles for MTU discovery which tcp_output() had better
be prepared to handle. ip_output() is also modified to do something
helpful in this case, since it has already calculated the information
we need.


# b124e4f2 26-Jul-1995 Garrett Wollman <wollman@FreeBSD.org>

Fix test for determining when RSVP is inactive in a router. (In this
case, multicast options are not passed to ip_mforward().) The previous
version had a wrong test, thus causing RSVP mrouters to forward RSVP messages
in violation of the spec.


# 40a63d93 02-Jul-1995 Joerg Wunsch <joerg@FreeBSD.org>

Slightly modify my previous change to return EINVAL instead of
EFAULT.

Submitted by: Peter Wemm


# d700586c 01-Jul-1995 Joerg Wunsch <joerg@FreeBSD.org>

I saw a very low-key commit message on the netbsd mailing lists and
figured out what the problem was.. Anyway, I rate it as "highly
serious".

Submitted by: peter@haywire.DIALix.COM (Peter Wemm)


# 1c5de19a 13-Jun-1995 Garrett Wollman <wollman@FreeBSD.org>

Kernel side of 3.5 multicast routing code, based on work by Bill Fenner
and other work done here. The LKM support is probably broken, but it
still compiles and will be fixed later.


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# 94a5d9b6 09-May-1995 David Greenman <dg@FreeBSD.org>

Replaced some bcopy()'s with memcpy()'s so that gcc while inline/optimize.


# f5fea3dd 26-Apr-1995 Paul Traina <pst@FreeBSD.org>

Cleanup loopback interface support.
Reviewed by: wollman


# 15bd2b43 08-Apr-1995 David Greenman <dg@FreeBSD.org>

Implemented PCB hashing. Includes new functions in_pcbinshash, in_pcbrehash,
and in_pcblookuphash.


# 20e8807c 20-Mar-1995 Garrett Wollman <wollman@FreeBSD.org>

This should be splimp() rather than splnet() since ifaddrs might go away
as a result of link-layer processing.


# 9b626c29 20-Mar-1995 Garrett Wollman <wollman@FreeBSD.org>

Fix race conditions involved in setting IP multicast options. This should
fix Dennis Fortin's problem for good, if I've got it figured out right.

(The problem was that a `struct ifaddr' could get deleted out from under
the current requester, thus leaving him with an invalid interface pointer
and causing even more bogus accesses.)


# b5e8ce9f 16-Mar-1995 Bruce Evans <bde@FreeBSD.org>

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 4dd1662b 12-Jan-1995 Ugen J.S. Antsilevich <ugen@FreeBSD.org>

Actual firewall change.
1) Firewall is not subdivided on forwarding / blocking chains
anymore.Actually only one chain left-it was the blocking one.
2) LKM support.ip_fwdef.c is function pointers definition and
goes into kernel along with all INET stuff.


# 2c17fe93 13-Dec-1994 Garrett Wollman <wollman@FreeBSD.org>

Call rtalloc_ign() so that protocol cloning will not occur at the IP layer.


# 10a642bb 12-Dec-1994 Ugen J.S. Antsilevich <ugen@FreeBSD.org>

Add match by interface from which packet arrived (via)
Handle right fragmented packets. Remove checking option
from kernel..


# 63f8d699 16-Nov-1994 Jordan K. Hubbard <jkh@FreeBSD.org>

Ugen J.S.Antsilevich's latest, happiest, IP firewall code.
Poul: Please take this into BETA. It's non-intrusive, and a rather
substantial improvement over what was there before.


# 5e9ae478 13-Sep-1994 Garrett Wollman <wollman@FreeBSD.org>

Shuffle some functions and variables around to make it possible for
multicast routing to be implemented as an LKM. (There's still a bit of
work to do in this area.)


# 01d6dc88 09-Sep-1994 Garrett Wollman <wollman@FreeBSD.org>

Disable IPMULTICAST_VIF socket option when MROUTING is not defined,
since it doesn'tmake any sense for non-routers.
CVS:


# f0068c4a 06-Sep-1994 Garrett Wollman <wollman@FreeBSD.org>

Initial get-the-easy-case-working upgrade of the multicast code
to something more recent than the ancient 1.2 release contained in
4.4. This code has the following advantages as compared to
previous versions (culled from the README file for the SunOS release):

- True multicast delivery
- Configurable rate-limiting of forwarded multicast traffic on each
physical interface or tunnel, using a token-bucket limiter.
- Simplistic classification of packets for prioritized dropping.
- Administrative scoping of multicast address ranges.
- Faster detection of hosts leaving groups.
- Support for multicast traceroute (code not yet available).
- Support for RSVP, the Resource Reservation Protocol.

What still needs to be done:

- The multicast forwarder needs testing.
- The multicast routing daemon needs to be ported.
- Network interface drivers need to have the `#ifdef MULTICAST' goop ripped
out of them.
- The IGMP code should probably be bogon-tested.

Some notes about the porting process:

In some cases, the Berkeley people decided to incorporate functionality from
later releases of the multicast code, but then had to do things differently.
As a result, if you look at Deering's patches, and then look at
our code, it is not always obvious whether the patch even applies. Let
the reader beware.

I ran ip_mroute.c through several passes of `unifdef' to get rid of
useless grot, and to permanently enable the RSVP support, which we will
include as standard.

Ported by: Garrett Wollman
Submitted by: Steve Deering and Ajit Thyagarajan (among others)


# f23b4c91 18-Aug-1994 Garrett Wollman <wollman@FreeBSD.org>

Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# b5390296 31-Jul-1994 David Greenman <dg@FreeBSD.org>

fixed bug where large amounts of unidirectional UDP traffic would fill
the interface output queue and further udp packets would be fragmented
and only partially sent - keeping the output queue full and jamming the
network, but not actually getting any real work done (because you can't
send just 'part' of a udp packet - if you fragment it, you must send
the whole thing). The fix involves adding a check to make sure that the
output queue has sufficient space for all of the fragments.


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources