Cross Reference: /freebsd-current/sys/net/rtsock.c

History log of /freebsd-current/sys/net/rtsock.c
Revision	Date	Author	Comments
# ce69e373	03-Feb-2024	Gleb Smirnoff <glebius@FreeBSD.org>	Revert "sockets: retire sorflush()" Provide a comment in sorflush() why the socket I/O sx(9) lock is actually important. This reverts commit 507f87a799cf0811ce30f0ae7f10ba19b2fd3db3.
# ab6d773d	22-Jan-2024	Gordon Bergling <gbe@FreeBSD.org>	rtsock: Fix a typo in a source code comment - s/adddress/address/ MFC after: 3 days
# 507f87a7	16-Jan-2024	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: retire sorflush() With removal of dom_dispose method the function boils down to two meaningful function calls: socantrcvmore() and sbrelease(). The latter is only relevant for protocols that use generic socket buffers. The socket I/O sx(9) lock acquisition in sorflush() is not relevant for shutdown(2) operation as it doesn't do any I/O that may interleave with read(2) or write(2). The socket buffer mutex acquisition inside sbrelease() is what guarantees thread safety. This sx(9) acquisition in soshutdown() can be tracked down to 4.4BSD times, where it used to be sblock(), and it was carried over through the years evolving together with sockets with no reconsideration of why do we carry it over. I can't tell if that sblock() made sense back then, but it doesn't make any today. Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D43415
# 5bba2728	16-Jan-2024	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: make pr_shutdown fully protocol specific method Disassemble a one-for-all soshutdown() into protocol specific methods. This creates a small amount of copy & paste, but makes code a lot more self documented, as protocol specific method would execute only the code that is relevant to that protocol and nothing else. This also fixes a couple recent regressions and reduces risk of future regressions. The extended KPI for the new pr_shutdown removes need for the extra pr_flush which was added for the sake of SCTP which could not perform its shutdown properly with the old one. Particularly for SCTP this change streamlines a lot of code. Some notes on why certain parts of code were copied or were not to certain protocols: * The (SS_ISCONNECTED \| SS_ISCONNECTING \| SS_ISDISCONNECTING) check is needed only for those protocols that may be connected or disconnected. * The above reduces into only SS_ISCONNECTED for those protocols that always connect instantly. * The ENOTCONN and continue processing hack is left only for datagram protocols. * The SOLISTENING(so) block is copied to those protocols that listen(2). * sorflush() on SHUT_RD is copied almost to every protocol, but that will be refactored later. * wakeup(&so->so_timeo) is copied to protocols that can make a non-instant connect(2), can SO_LINGER or can accept(2). There are three protocols (netgraph(4), Bluetooth, SDP) that did not have pr_shutdown, but old soshutdown() would still perform sorflush() on SHUT_RD for them and also wakeup(9). Those protocols partially supported shutdown(2) returning EOPNOTSUP for SHUT_WR/SHUT_RDWR, now they fully lost shutdown(2) support. I'm pretty sure netgraph(4) and Bluetooth are okay about that and SDP is almost abandoned anyway. Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D43413
# 29363fb4	23-Nov-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove ancient SCCS tags. Remove ancient SCCS tags from the tree, automated scripting, with two minor fixup to keep things compiling. All the common forms in the tree were removed with a perl script. Sponsored by: Netflix
# 21a722d9	25-Sep-2023	Zhenlei Huang <zlei@FreeBSD.org>	rtsock: Add sysctl flag CTLFLAG_TUN to loader tunable The sysctl variable `net.route.netisr_maxqlen` is actually a loader tunable. Add sysctl flag CTLFLAG_TUN to it so that `sysctl -T` will report it correctly. No functional change intended. Reviewed by: glebius MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D41928
# 2ff63af9	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove $FreeBSD$: one-line .h pattern Remove /^\s\+\s\$FreeBSD\$.$\n/
# 2cda6a2f	26-Mar-2023	Alexander V. Chernikov <melifaro@FreeBSD.org>	routing: add public rt_is_exportable() version to check if the route can be exported to userland when jailed. Differential Revision: https://reviews.freebsd.org/D39204 MFC after: 2 weeks
# 2c2b37ad	13-Jan-2023	Justin Hibbits <jhibbits@FreeBSD.org>	ifnet/API: Move struct ifnet definition to a <net/if_private.h> Hide the ifnet structure definition, no user serviceable parts inside, it's a netstack implementation detail. Include it temporarily in <net/if_var.h> until all drivers are updated to use the accessors exclusively. Reviewed by: glebius Sponsored by: Juniper Networks, Inc. Differential Revision: https://reviews.freebsd.org/D38046
# 42904794	15-Jan-2023	Alexander V. Chernikov <melifaro@FreeBSD.org>	rtsock: fix socket closure. Currently `close(2)` erroneously return `EOPNOTSUPP` for `PF_ROUTE` sockets. It happened after making rtsock socket implementation self-contained ( 36b10ac2cd18 ). Rtsock code marks socket as connected in `rts_attach()`. `soclose()` tries to disconnect such socket using `.pr_disconnect` callback. Rtsock does not implement this callback, resulting in the default method being substituted. This default method returns `ENOTSUPP`, failing `soclose()` logic. This diff restores the previous behaviour by adding custom `pr_disconnect()` returning `ENOTCONN`. Reviewed by: glebius Differential Revision: https://reviews.freebsd.org/D38059
# 1bcd230f	03-Dec-2022	Alexander V. Chernikov <melifaro@FreeBSD.org>	netlink: add interface notification on link status / flags change. * Add link-state change notifications by subscribing to ifnet_link_event. In the Linux netlink model, link state is reported in 2 places: first is the IFLA_OPERSTATE, which stores state per RFC2863. The second is an IFF_LOWER_UP interface flag. As many applications rely on the latter, reserve 1 bit from if_flags, named as IFF_NETLINK_1. This flag is mapped to IFF_LOWER_UP in the netlink headers. This is done to avoid making applications think this flag is actually supported / presented in non-netlink outputs. * Add flag change notifications, by hooking into rt_ifmsg(). In the netlink model, notification should include the bitmask for the change flags. Update rt_ifmsg() to include such bitmask. Differential Revision: https://reviews.freebsd.org/D37597
# 7e5bf684	20-Jan-2022	Alexander V. Chernikov <melifaro@FreeBSD.org>	netlink: add netlink support Netlinks is a communication protocol currently used in Linux kernel to modify, read and subscribe for nearly all networking state. Interfaces, addresses, routes, firewall, fibs, vnets, etc are controlled via netlink. It is async, TLV-based protocol, providing 1-1 and 1-many communications. The current implementation supports the subset of NETLINK_ROUTE family. To be more specific, the following is supported: * Dumps: - routes - nexthops / nexthop groups - interfaces - interface addresses - neighbors (arp/ndp) * Notifications: - interface arrival/departure - interface address arrival/departure - route addition/deletion * Modifications: - adding/deleting routes - adding/deleting nexthops/nexthops groups - adding/deleting neghbors - adding/deleting interfaces (basic support only) * Rtsock interaction - route events are bridged both ways The implementation also supports the NETLINK_GENERIC family framework. Implementation notes: Netlink is implemented via loadable/unloadable kernel module, not touching many kernel parts. Each netlink socket uses dedicated taskqueue to support async operations that can sleep, such as interface creation. All message processing is performed within these taskqueues. Compatibility: Most of the Netlink data models specified above maps to FreeBSD concepts nicely. Unmodified ip(8) binary correctly works with interfaces, addresses, routes, nexthops and nexthop groups. Some software such as net/bird require header-only modifications to compile and work with FreeBSD netlink. Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D36002 MFC after: 2 months
# 177f04d5	29-Aug-2022	Alexander V. Chernikov <melifaro@FreeBSD.org>	routing: constantify @rc in rib_decompose_notification(). Clarify the @rc immutability by explicitly marking @rc const. MFC after: 2 weeks
# e7d02be1	17-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	protosw: refactor protosw and domain static declaration and load o Assert that every protosw has pr_attach. Now this structure is only for socket protocols declarations and nothing else. o Merge struct pr_usrreqs into struct protosw. This was suggested in 1996 by wollman@ (see 7b187005d18ef), and later reiterated in 2006 by rwatson@ (see 6fbb9cf860dcd). o Make struct domain hold a variable sized array of protosw pointers. For most protocols these pointers are initialized statically. Those domains that may have loadable protocols have spacers. IPv4 and IPv6 have 8 spacers each (andre@ dff3237ee54ea). o For inetsw and inet6sw leave a comment noting that many protosw entries very likely are dead code. o Refactor pf_proto_[un]register() into protosw_[un]register(). o Isolate pr_*_notsupp() methods into uipc_domain.c Reviewed by: melifaro Differential revision: https://reviews.freebsd.org/D36232
# 036f1bc6	14-Aug-2022	Alexander V. Chernikov <melifaro@FreeBSD.org>	routing: retire rib_lookup_info() This function was added in pre-epoch era ( 9a1b64d5a0224 ) to provide public rtentry access interface & hide rtentry internals. The implementation is based on the large on-stack copying and refcounting of the referenced objects (ifa/ifp). It has become obsolete after epoch & nexthop introduction. Convert the last remaining user and remove the function itself. Differential Revision: https://reviews.freebsd.org/D36197
# f73e4f6c	11-Aug-2022	Mateusz Guzik <mjg@FreeBSD.org>	routing: unbreak the build of a bunch of kernels Sponsored by: Rubicon Communications, LLC ("Netgate")
# d8b42ddc	11-Aug-2022	Alexander V. Chernikov <melifaro@FreeBSD.org>	rtsock: subscribe to ifnet eventhandlers instead of direct calls. Stop treating rtsock as a "special" consumer and use already-provided ifaddr arrival/departure notifications. MFC after: 2 weeks Test Plan: ``` 21:05 [0] m@devel0 route -n monitor -> ifconfig vtnet0.2 create got message of size 24 on Tue Aug 9 21:05:44 2022 RTM_IFANNOUNCE: interface arrival/departure: len 24, if# 3, what: arrival got message of size 168 on Tue Aug 9 21:05:54 2022 RTM_IFINFO: iface status change: len 168, if# 3, link: up, flags:<BROADCAST,RUNNING,SIMPLEX,MULTICAST> -> ifconfig vtnet0.2 destroy got message of size 24 on Tue Aug 9 21:05:54 2022 RTM_IFANNOUNCE: interface arrival/departure: len 24, if# 3, what: departure ``` Reviewed By: glebius Differential Revision: https://reviews.freebsd.org/D36095 MFC after: 2 weeks
# 36b10ac2	11-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	rtsock: do not use raw socket code This makes routing socket implementation self contained and removes one of the last dependencies on the raw socket code and pr_output method. There are very subtle API visible changes: - now routing socket would return EOPNOTSUPP instead of EINVAL on syscalls that are not supposed to be called on a routing socket. - routing socket buffer sizes are now controlled by net.rtsock sysctls instead of net.raw. The latter were not documented anywhere, and even Internet search doesn't find any references or discussions related to these sysctls. Reviewed by: melifaro Differential revision: https://reviews.freebsd.org/D36122
# d94ec749	11-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	rtsock: do not allocate mbufs_tags(9) just to store a 8-bit value Use local storage of the mbuf packet header instead. Reviewed by: melifaro Differential revision: https://reviews.freebsd.org/D36121
# ae6bfd12	01-Aug-2022	Alexander V. Chernikov <melifaro@FreeBSD.org>	routing: refactor private KPI * Make nhgrp_get_nhops() return const struct weightened_nhop to indicate that the list is immutable * Make nhgrp_get_group() return the actual group, instead of group+weight. MFC after: 2 weeks
# 2717e958	27-Jul-2022	Alexander V. Chernikov <melifaro@FreeBSD.org>	routing: move route expiration time to its nexthop Expiration time is actually a path property, not a route property. Move its storage to nexthop to simplify upcoming nhop(9) KPI changes and netlink introduction. Differential Revision: https://reviews.freebsd.org/D35970 MFC after: 2 weeks
# 33a0803f	26-Jun-2022	Alexander V. Chernikov <melifaro@FreeBSD.org>	routing: fix debug headers added in 6fa8ed43ee0c #2. Move debug declaration out of COMPAT_FREEBSD32 in rtsock.c MFC after: 2 weeks
# 0e87bab6	25-Jun-2022	Alexander V. Chernikov <melifaro@FreeBSD.org>	routing: fix debug headers added in 6fa8ed43ee0c. - move debug headers out of COMPAT_FREEBSD32 in rtsock.c - remove accidentally-added LOG_ defines from syslog.h MFC after: 2 weeks
# 76179e40	25-Jun-2022	Alexander V. Chernikov <melifaro@FreeBSD.org>	routing: fix syslog include for rtsock.c MFC after: 2 weeks
# 6fa8ed43	25-Jun-2022	Alexander V. Chernikov <melifaro@FreeBSD.org>	routing: improve debugging. Use unified guidelines for the severity across the routing subsystem. Update severity for some of the already-used messages to adhere the guidelines. Convert rtsock logging to the new FIB_ reporting format. MFC after: 2 weeks
# c260d5cd	25-Jun-2022	Alexander V. Chernikov <melifaro@FreeBSD.org>	routing: fix crash when RTM_CHANGE results in no-op for the multipath route. Reporting logic assumed there is always some nhop change for every successful modification operation. Explicitly check that the changed nexthop indeed exists when reporting back to userland. MFC after: 2 weeks Reported by: Claudio Jeker <claudio.jeker@klarasystems.com> Tested by: Claudio Jeker <claudio.jeker@klarasystems.com>
# 9573cc35	13-May-2022	Kurosawa Takahiro <takahiro.kurosawa@gmail.com>	rtsock: fix a stack overflow struct sockaddr is not sufficient for buffer that can hold any sockaddr_* structure. struct sockaddr_storage should be used. Test: ifconfig epair create ifconfig epair0a inet6 add 2001:db8::1 up ndp -s 2001:db8::2 02:86:98:2e:96:0b proxy # this triggers kernel stack overflow Reviewed by: markj, kp Differential Revision: https://reviews.freebsd.org/D35188
# e606e5d1	04-Apr-2022	Warner Losh <imp@FreeBSD.org>	sysctl_dumpentry: move error to inner scope Sponsored by: Netflix
# 6abb5043	27-Jan-2022	Gleb Smirnoff <glebius@FreeBSD.org>	rtsock: always set m_pkthdr.rcvif when queueing on netisr netisr uses global workstreams and after dequeueing an mbuf it uses rcvif to get the VNET of the mbuf. Of course, this is not needed when kernel is compiled without VIMAGE. It came out that routing socket does not set rcvif if compiled without VIMAGE. Make this assignment not depending on VIMAGE option. Fixes: 6871de9363e5
# 644ca084	03-Jan-2022	Gleb Smirnoff <glebius@FreeBSD.org>	domains: make domain_init() initialize only global state Now that each module handles its global and VNET initialization itself, there is no VNET related stuff left to do in domain_init(). Differential revision: https://reviews.freebsd.org/D33541
# 89128ff3	03-Jan-2022	Gleb Smirnoff <glebius@FreeBSD.org>	protocols: init with standard SYSINIT(9) or VNET_SYSINIT The historical BSD network stack loop that rolls over domains and over protocols has no advantages over more modern SYSINIT(9). While doing the sweep, split global and per-VNET initializers. Getting rid of pr_init allows to achieve several things: o Get rid of ifdef's that protect against double foo_init() when both INET and INET6 are compiled in. o Isolate initializers statically to the module they init. o Makes code easier to understand and maintain. Reviewed by: melifaro Differential revision: https://reviews.freebsd.org/D33537
# 79554f2b	14-Nov-2021	Mateusz Guzik <mjg@FreeBSD.org>	net: whack "set but not used" warnings in net/rtsock.c ... except for one where the error is ignored. Sponsored by: Rubicon Communications, LLC ("Netgate")
# 0dcef81d	23-Jul-2021	Mark Johnston <markj@FreeBSD.org>	Add required sysctl name length checks to various handlers Reported by: KMSAN MFC after: 1 week Sponsored by: The FreeBSD Foundation
# a3c2c06b	06-Jun-2021	Bjoern A. Zeeb <bz@FreeBSD.org>	Make LINT NOINET and NOIP kernel builds warning free. Apply #ifdef INET or #if defined(INET6) \|\| defined(INET) to make universe NOINET and NOIP LINT kernels warning free as well again.
# 76cfc6fa	14-May-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix a use after free in update_rtm_from_rc(). update_rtm_from_rc() calls update_rtm_from_info() internally. The latter one may update provided prtm pointer with a new rtm. Reassign rtm from prtm afeter calling update_rtm_from_info() to avoid touching the freed rtm. PR: 255871 Submitted by: lylgood@foxmail.com MFC after: 3 days
# 25682e6a	27-Apr-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix rtsock sockaddr alignment. b31fbebeb3 introduced alloc_sockaddr_aligned() which, in fact, failed to produce aligned addresses. Reported by: Oskar Holmlund <oskar.holmlund at yahoo.com> MFC after: immediately
# 5d1403a7	23-Apr-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	[rtsock] Enforce netmask/RTF_HOST consistency. Traditionally we had 2 sources of information whether the added/delete route request targets network or a host route: netmask (RTA_NETMASK) and RTF_HOST flag. The former one is tricky: netmask can be empty or can explicitly specify the host netmask. Parsing netmask sockaddr requires per-family parsing and that's what rtsock code traditionally avoided. As a result, consistency was not enforced and it was possible to specify network with the RTF_HOST flag and vice versa. Continue normalization efforts from D29826 and D29826 and ensure that RTF_HOST flag always reflects host/network data from netmask field. Differential Revision: https://reviews.freebsd.org/D29958 MFC after: 2 days
# b31fbebe	19-Apr-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	Relax rtsock message restrictions. Address multiple issues with strict rtsock message validation. D28668 "normalisation" approach was based on the assumption that we always have at least "standard" sockaddr len. It turned out to be false - certain older applications like quagga or routed abuse sin[6]_len field and set it to the offset to the first fully-zero bit in the mask. It is impossible to normalise such sockaddrs without reallocation. With that in mind, change the approach to use a distinct memory buffer for the altered sockaddrs. This allows supporting the older software while maintaining the guarantee on the "standard" sockaddrs. PR: 255273,255089 Differential Revision: https://reviews.freebsd.org/D29826 MFC after: 3 days
# 758c9d54	19-Apr-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	Improve error reporting in rtsock.c MFC after: 3 days
# 7f5f3fcc	10-Apr-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix direct route installation with net/bird. Slighly relax the gateway validation rules imposed by the 2fe5a79425c7, by requiring only first 8 bytes (everyhing before sdl_data to be present in the AF_LINK gateway. Reported by: olivier
# e5b394f2	20-Feb-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix setting static entries for arp/ndp. rtsock message validation changes committed in 2fe5a79425c7 did not take llinfo messages into account. Add a special validation case for RTA_GATEWAY llinfo messages. MFC after: 2 days
# f9e1cd6c	19-Feb-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix arp/ndp deletion broken by 2fe5a79425c7. Changes in the 2fe5a79425c7 moved dst sockaddr masking from the routing control plane to the rtsock code. It broke arp/ndp deletion. It turns out, arp/ndp perform RTM_GET request first to get an interface index necessary for the deletion. Then they simply stamp the reply with RTF_LLDATA and set the command to RTM_DELETE. As a result, kernel receives request with non-empty RTA_NETMASK and clears RTA_DST host bits before passing the message to the lla code. De facto, the only needed bits are RTA_DST, RTA_GATEWAY and the subset of rtm_flags. With that in mind, fix the interace by clearing RTA_NETMASK for every messages with RTF_LLDATA. While here, cleanup arp/ndp code a bit. MFC after: 1 day Reviewed by: gnn Differential Revision: https://reviews.freebsd.org/D28804
# a4513bac	16-Feb-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix NOINET6 build broken by 2fe5a79425c7. Reported by: mjg
# 2fe5a794	16-Feb-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix dst/netmask handling in routing socket code. Traditionally routing socket code did almost zero checks on the input message except for the most basic size checks. This resulted in the unclear KPI boundary for the routing system code (`rtrequest` and now `rib_action()`) w.r.t message validness. Multiple potential problems and nuances exists: Host bits in RTAX_DST sockaddr. Existing applications do send prefixes with hostbits uncleared. Even `route(8)` does this, as they hope the kernel would do the job of fixing it. Code inside `rib_action()` needs to handle it on its own (see `rt_maskedcopy()` ugly hack). * There are multiple way of adding the host route: it can be DST without netmask or DST with /32(/128) netmask. Also, RTF_HOST has to be set correspondingly. Currently, these 2 options create 2 DIFFERENT routes in the kernel. * no sockaddr length/content checking for the "secondary" fields exists: nothing stops rtsock application to send sockaddr_in with length of 25 (instead of 16). Kernel will accept it, install to RIB as is and propagate to all rtsock consumers, potentially triggering bugs in their code. Same goes for sin_port, sin_zero, etc. The goal of this change is to make rtsock verify all sockaddr and prefix consistency. Said differently, `rib_action()` or internals should NOT require to change any of the sockaddrs supplied by `rt_addrinfo` structure due to incorrectness. To be more specific, this change implements the following: * sockaddr cleanup/validation check is added immediately after getting sockaddrs from rtm. * Per-family dst/netmask checks clears host bits in dst and zeros all dst/netmask "secondary" fields. * The same netmask checking code converts /32(/128) netmasks to "host" route case (NULL netmask, RTF_HOST), removing the dualism. * Instead of allowing ANY "known" sockaddr families (0<..<AF_MAX), allow only actually supported ones (inet, inet6, link). * Automatically convert `sockaddr_sdl` (AF_LINK) gateways to `sockaddr_sdl_short`. Reported by: Guy Yur <guyyur at gmail.com> Reviewed By: donner Differential Revision: https://reviews.freebsd.org/D28668 MFC after: 3 days
# 64d5c277	14-Feb-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	Remove now-unused RTF_RNH_LOCKED route flag. MFC after: 1 week
# 8ca99aec	12-Feb-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix various NOINET* builds broken by 145bf6c0af48. Reported by: mjg, bdragon
# 145bf6c0	08-Feb-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix blackhole/reject routes. Traditionally *BSD routing stack required to supply some interface data for blackhole/reject routes. This lead to varieties of hacks in routing daemons when inserting such routes. With the recent routeing stack changes, gateway sockaddr without RTF_GATEWAY started to be treated differently, purely as link identifier. This change broke net/bird, which installs blackhole routes with 127.0.0.1 gateway without RTF_GATEWAY flags. Fix this by automatically constructing necessary gateway data at rtsock level if RTF_REJECT/RTF_BLACKHOLE is set. Reported by: Marek Zarychta <zarychtam at plan-b.pwste.edu.pl> Reviewed by: donner MFC after: 1 week
# d68cf57b	07-Jan-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	Refactor rt_addrmsg() and rt_routemsg(). Summary: * Refactor rt_addrmsg(): make V_rt_add_addr_allfibs decision locally. * Fix rt_routemsg() and multipath by accepting nexthop instead of interface pointer. * Refactor rtsock_routemsg(): avoid accessing rtentry fields directly. * Simplify in_addprefix() by moving prefix search to a separate function. Reviewers: #network Subscribers: imp, ae, bz Differential Revision: https://reviews.freebsd.org/D28011
# 2fb4a03d	24-Dec-2020	Ryan Libby <rlibby@FreeBSD.org>	rtsock: quiet -Wunused-variable in LINT-NOIP kernels Fixup after r368769 / d68fb8d978bbd3cf1b5db35f309aaeab30636d43. Reported by: mjg Reviewed by: melifaro Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D27730
# 92be2847	23-Dec-2020	Mark Johnston <markj@FreeBSD.org>	rtsock: Avoid copying uninitialized padding bytes When copying sockaddrs out to userspace, we pad them to a multiple of the platform alignment (sizeof(long)). However, some sockaddr sizes, such as struct sockaddr_dl, are not an integer multiple of the alignment, so we may end up copying out uninitialized bytes. Fix this by always bouncing through a pre-zeroed sockaddr_storage. Reported by: KASAN Reviewed by: melifaro MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27729
# d68fb8d9	18-Dec-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Switch direct rt fields access in rtsock.c to newly-create field acessors. rtsock code was build around the assumption that each rtentry record in the system radix tree is a ready-to-use sockaddr. This assumptions turned out to be not quite true: * masks have their length tweaked, so we have rtsock_fix_netmask() hack * IPv6 addresses have their scope embedded, so we have another explicit deembedding hack. Change the code to decouple rtentry internals from rtsock code using newly-created rtentry accessors. This will allow to eventually eliminate both of the hacks and change rtentry dst/mask format. Differential Revision: https://reviews.freebsd.org/D27451
# d1d941c5	29-Nov-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Remove RADIX_MPATH config option. ROUTE_MPATH is the new config option controlling new multipath routing implementation. Remove the last pieces of RADIX_MPATH-related code and the config option. Reviewed by: glebius Differential Revision: https://reviews.freebsd.org/D27244
# 1b95005e	04-Oct-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix route flags update during RTM_CHANGE. Nexthop lookup was not consireding rt_flags when doing structure comparison, which lead to an original nexthop selection when changing flags. Fix the case by adding rt_flags field into comparison and rearranging nhop_priv fields to allow for efficient matching. Fix `route change X/Y flags` case - recent changes disallowed specifying RTF_GATEWAY flag without actual gateway. It turns out, route(8) fills in RTF_GATEWAY by default, unless -interface flag is specified. Fix regression by clearing RTF_GATEWAY flag instead of failing. Fix route flag reporting in RTM_CHANGE messages by explicitly updating rtm_flags after operation competion. Add IPv4/IPv6 tests for flag-only route changes.
# 9c584fa4	03-Oct-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Remove ROUTE_MPATH-related warnings introduced in r366390. Reported by: mjg
# fedeb08b	03-Oct-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Introduce scalable route multipath. This change is based on the nexthop objects landed in D24232. The change introduces the concept of nexthop groups. Each group contains the collection of nexthops with their relative weights and a dataplane-optimized structure to enable efficient nexthop selection. Simular to the nexthops, nexthop groups are immutable. Dataplane part gets compiled during group creation and is basically an array of nexthop pointers, compiled w.r.t their weights. With this change, `rt_nhop` field of `struct rtentry` contains either nexthop or nexthop group. They are distinguished by the presense of NHF_MULTIPATH flag. All dataplane lookup functions returns pointer to the nexthop object, leaving nexhop groups details inside routing subsystem. User-visible changes: The change is intended to be backward-compatible: all non-mpath operations should work as before with ROUTE_MPATH and net.route.multipath=1. All routes now comes with weight, default weight is 1, maximum is 2^24-1. Current maximum multipath group width is statically set to 64. This will become sysctl-tunable in the followup changes. Using functionality: * Recompile kernel with ROUTE_MPATH * set net.route.multipath to 1 route add -6 2001:db8::/32 2001:db8::2 -weight 10 route add -6 2001:db8::/32 2001:db8::3 -weight 20 netstat -6On Nexthop groups data Internet6: GrpIdx NhIdx Weight Slots Gateway Netif Refcnt 1 ------- ------- ------- --------------------------------------- --------- 1 13 10 1 2001:db8::2 vlan2 14 20 2 2001:db8::3 vlan2 Next steps: * Land outbound hashing for locally-originated routes ( D26523 ). * Fix net/bird multipath (net/frr seems to work fine) * Add ROUTE_MPATH to GENERIC * Set net.route.multipath=1 by default Tested by: olivier Reviewed by: glebius Relnotes: yes Differential Revision: https://reviews.freebsd.org/D26449
# 2259a030	21-Sep-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Rework part of routing code to reduce difference to D26449. * Split rt_setmetrics into get_info_weight() and rt_set_expire_info(), as these two can be applied at different entities and at different times. * Start filling route weight in route change notifications * Pass flowid to UDP/raw IP route lookups * Rework nd6_subscription_cb() and sysctl_dumpentry() to prepare for the fact that rtentry can contain multiple nexthops. Differential Revision: https://reviews.freebsd.org/D26497
# 4fa9815a	05-Sep-2020	Ed Maste <emaste@FreeBSD.org>	rtsock.c: remove extraneous space Submitted by: Neel Chauhan <neel AT neelc DOT org> Differential Revision: https://reviews.freebsd.org/D26249
# 662c1305	01-Sep-2020	Mateusz Guzik <mjg@FreeBSD.org>	net: clean up empty lines in .c and .h files
# a624ca3d	28-Aug-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Move net/route/shared.h definitions to net/route/route_var.h. No functional changes. net/route/shared.h was created in the inital phases of nexthop conversion. It was intended to serve the same purpose as route_var.h - share definitions of functions and structures between the routing subsystem components. At that time route_var.h was included by many files external to the routing subsystem, which largerly defeats its purpose. As currently this is not the case anymore and amount of route_var.h includes is roughly the same as shared.h, retire the latter in favour of the former.
# 592d300e	24-Aug-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Remove RT_LOCK mutex from rte. rtentry lock traditionally served 2 purposed: first was protecting refcounts, the second was assuring consistent field access/changes. Since route nexthop introduction, the need for the former disappeared and the need for the latter reduced. To be more precise, the following rte field are mutable: rt_nhop (nexthop pointer, updated with RIB_WLOCK, passed in rib_cmd_info) rte_flags (only RTF_HOST and RTF_UP, where RTF_UP gets changed at rte removal) rt_weight (relative weight, updated with RIB_WLOCK, passed in rib_cmd_info) rt_expire (time when rte deletion is scheduled, updated with RIB_WLOCK) rt_chain (deletion chain pointer, updated with RIB_WLOCK) All of them are updated under RIB_WLOCK, so the only remaining concern is the reading. rt_nhop and rt_weight (addressed in this review) are read under rib lock and stored in the rib_cmd_info, so the caller has no problem with consitency. rte_flags is currently read unlocked in rtsock reporting (however the scope is only RTF_UP flag, which is pretty static). rt_expire is currently read unlocked in rtsock reporting. rt_chain accesses are safe, as this is only used at route deletion. rt_expire and rte_flags reads will be dealt in a separate reviews soon. Differential Revision: https://reviews.freebsd.org/D26162
# 93bfd365	22-Aug-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Rename rt_flags to rte_flags && reduce number of rt_nhop accesses. No functional changes. Most of the routing flags are stored in the netxtop instead of rtentry. Rename rt->rt_flags to rt->rte_flags to simplify reading/modifying code checking routing flags. In the new multipath code, rt->rt_nhop may actually point to nexthop group instead of nhop. To ease transition, reduce the amount of rt->rt_nhop->... accesses. Differential Revision: https://reviews.freebsd.org/D26156
# bec053ff	15-Aug-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Make net.inet6.ip6.deembed_scopeid behaviour default & remove sysctl. Submitted by: Neel Chauhan <neel AT neelc DOT org> Differential Revision: https://reviews.freebsd.org/D25637
# a287a973	10-Jun-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Switch rtsock code to using newly-create rib_action() KPI call. This simplifies the code and allows to further split rtentry and nexthop, removing one of the blockers for multipath code introduction, described in D24141. Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D25192
# 2bbab0af	23-May-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Use epoch(9) for rtentries to simplify control plane operations. Currently the only reason of refcounting rtentries is the need to report the rtable operation details immediately after the execution. Delaying rtentry reclamation allows to stop refcounting and simplify the code. Additionally, this change allows to reimplement rib_lookup_info(), which is used by some of the customers to get the matching prefix along with nexthops, in more efficient way. The change keeps per-vnet rtzone uma zone. It adds nh_vnet field to nhop_priv to be able to reliably set curvnet even during vnet teardown. Rest of the reference counting code will be removed in the D24867 . Differential Revision: https://reviews.freebsd.org/D24866
# 656442a7	08-May-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Embed dst sockaddr into rtentry and remove rte packet counter Currently each rtentry has dst&gateway allocated separately from another zone, bloating cache accesses. Current 'struct rtentry' has 12 "mandatory" radix pointers in the beginning, leaving 4 usable pointers/32 bytes in the first 2 cache lines (amd64). Fields needed for the datapath are destination sockaddr and rt_nhop. So far it doesn't look like there is other routable addressing protocol other than IPv4/IPv6/MPLS, which uses keys longer than 20 bytes. With that in mind, embed dst into struct rtentry, making the first 24 bytes of rtentry within 128 bytes. That is enough to make IPv6 address within first 128 bytes. It is still pretty easy to add code for supporting separately-allocated dst, however it doesn't make a lot of sense in having such code without a use case. As rS359823 moved the gateway to the nexthop structure, the dst embedding change removes the need for any additional allocations done by rt_setgate(). Lastly, as a part of cleanup, remove counter(9) allocation code, as this field is not used in packet processing anymore. Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D24669
# 8c61eb21	29-Apr-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Convert more rtentry field accesses into nhop fields accesses. Continue routing subsystem conversion to nhop objects defined in r359823. Use fields from nhop structure instead of "struct rtentry" fields. This is one of the last changes prior to removing rt_ifp, rt_ifa, rt_gateway and rt_mtu from struct rtentry. Differential Revision: https://reviews.freebsd.org/D24609
# fc88ecd3	28-Apr-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Move route-specific ddb commands to route/route_ddb.c Currently functionality resides in rtsock.c, which is a controlling interface, partially external to the routing subsystem. Additionally, DDB-supporting functionality is > 100SLOC, which deserves a separate file. Given that, move this functionality to a newly-created net/route/ subdir. Reviewed by: cem Differential Revision: https://reviews.freebsd.org/D24561
# e7d8af4f	28-Apr-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Move route_temporal.c and route_var.h to net/route. Nexthop objects implementation, defined in r359823, introduced sys/net/route directory intended to hold all routing-related code. Move recently-introduced route_temporal.c and private route_var.h header there. Differential Revision: https://reviews.freebsd.org/D24597
# aaad3c4f	23-Apr-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Convert rtentry field accesses into nhop field accesses. One of the goals of the new routing KPI defined in r359823 is to entirely hide`struct rtentry` from the consumers. It will allow to improve routing subsystem internals and deliver more features much faster. This commit is mostly mechanical change to eliminate direct struct rtentry field accesses. The only notable difference is AF_LINK gateway encoding. AF_LINK gw is used in routing stack for operations with interface routes and host loopback routes. In the former case it indicates _some_ non-NULL gateway, as the interface is the same as in rt_ifp in kernel and rtm_ifindex in rtsock reporting. In the latter case the interface index inside gateway was used by the IPv6 datapath to verify address scope for link-local interfaces. Kernel uses struct sockaddr_dl for this type of gateway. This structure allows for specifying rich interface data, such as mac address and interface name. However, this results in relatively large structure size - 52 bytes. Routing stack fils in only 2 fields - sdl_index and sdl_type, which reside in the first 8 bytes of the structure. In the new KPI, struct nhop_object tries to be cache-efficient, hence embodies gateway address inside the structure. In the AF_LINK case it stores stortened version of the structure - struct sockaddr_dl_short, which occupies 16 bytes. After D24340 changes, the data inside AF_LINK gateway will not be used in the kernel at all, leaving rtsock as the only potential concern. The difference in rtsock reporting: (old) got message of size 240 on Thu Apr 16 03:12:13 2020 RTM_ADD: Add Route: len 240, pid: 0, seq 0, errno 0, flags:<UP,DONE,PINNED> locks: inits: sockaddrs: <DST,GATEWAY,NETMASK> 10.0.0.0 link#5 255.255.255.0 (new) got message of size 200 on Sun Apr 19 09:46:32 2020 RTM_ADD: Add Route: len 200, pid: 0, seq 0, errno 0, flags:<UP,DONE,PINNED> locks: inits: sockaddrs: <DST,GATEWAY,NETMASK> 10.0.0.0 link#5 255.255.255.0 Note 40 bytes different (52-16 + alignment). However, gateway is still a valid AF_LINK gateway with proper data filled in. It is worth noting that these particular messages (interface routes) are mostly ignored by routing daemons: * bird/quagga/frr uses RTM_NEWADDR and ignores prefix route addition messages. * quagga/frr ignores routes without gateway More detailed overview on how rtsock messages are used by the routing daemons to reconstruct the kernel view, can be found in D22974. Differential Revision: https://reviews.freebsd.org/D24519
# 2538a4b1	16-Apr-2020	Adrian Chadd <adrian@FreeBSD.org>	Remove an duplicate definition of nhops_dump_sysctl() One of the source files included both nhop.h and shared.h, leading to this clashing. Tested with: mips-gcc cross toolchain
# a6663252	12-Apr-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Introduce nexthop objects and new routing KPI. This is the foundational change for the routing subsytem rearchitecture. More details and goals are available in https://reviews.freebsd.org/D24141 . This patch introduces concept of nexthop objects and new nexthop-based routing KPI. Nexthops are objects, containing all necessary information for performing the packet output decision. Output interface, mtu, flags, gw address goes there. For most of the cases, these objects will serve the same role as the struct rtentry is currently serving. Typically there will be low tens of such objects for the router even with multiple BGP full-views, as these objects will be shared between routing entries. This allows to store more information in the nexthop. New KPI: struct nhop_object fib4_lookup(uint32_t fibnum, struct in_addr dst, uint32_t scopeid, uint32_t flags, uint32_t flowid); struct nhop_object fib6_lookup(uint32_t fibnum, const struct in6_addr dst6, uint32_t scopeid, uint32_t flags, uint32_t flowid); These 2 function are intended to replace all all flavours of <in_\|in6_>rtalloc[1]<_ign><_fib>, mpath functions and the previous fib[46]-generation functions. Upon successful lookup, they return nexthop object which is guaranteed to exist within current NET_EPOCH. If longer lifetime is desired, one can specify NHR_REF as a flag and get a referenced version of the nexthop. Reference semantic closely resembles rtentry one, allowing sed-style conversion. Additionally, another 2 functions are introduced to support uRPF functionality inside variety of our firewalls. Their primary goal is to hide the multipath implementation details inside the routing subsystem, greatly simplifying firewalls implementation: int fib4_lookup_urpf(uint32_t fibnum, struct in_addr dst, uint32_t scopeid, uint32_t flags, const struct ifnet src_if); int fib6_lookup_urpf(uint32_t fibnum, const struct in6_addr dst6, uint32_t scopeid, uint32_t flags, const struct ifnet src_if); All functions have a separate scopeid argument, paving way to eliminating IPv6 scope embedding and allowing to support IPv4 link-locals in the future. Structure changes: * rtentry gets new 'rt_nhop' pointer, slightly growing the overall size. * rib_head gets new 'rnh_preadd' callback pointer, slightly growing overall sz. Old KPI: During the transition state old and new KPI will coexists. As there are another 4-5 decent-sized conversion patches, it will probably take a couple of weeks. To support both KPIs, fields not required by the new KPI (most of rtentry) has to be kept, resulting in the temporary size increase. Once conversion is finished, rtentry will notably shrink. More details: * architectural overview: https://reviews.freebsd.org/D24141 * list of the next changes: https://reviews.freebsd.org/D24232 Reviewed by: ae,glebius(initial version) Differential Revision: https://reviews.freebsd.org/D24232
# 7029da5c	26-Feb-2020	Pawel Biernacki <kaktus@FreeBSD.org>	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718
# e02d3fe7	07-Jan-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix rtsock route message generation for interface addresses. Reviewed by: olivier MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D22974
# d9302031	31-Dec-2019	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix NOINET6 build broken by r356236. MFC after: 2 weeks
# c83dda36	31-Dec-2019	Alexander V. Chernikov <melifaro@FreeBSD.org>	Split gigantic rtsock route_output() into smaller functions. Amount of changes to the original code has been intentionally minimised to ease diffing. The changes are mostly mechanical, with the following exceptions: * lltable handler is now called directly based of RTF_LLINFO flag presense. * "report" logic for updating rtm in RTM_GET/RTM_DELETE has been simplified, fixing several potential use-after-free cases in rt_addrinfo. * llable asserts has been replaced with error-returning, preventing kernel crashes when lltable gw af family is invalid (root required). MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D22864
# 8a9a28c4	07-Nov-2019	Gleb Smirnoff <glebius@FreeBSD.org>	sysctl_rtsock() has all necessary locking and doesn't need Giant to run. While here add description.
# b8a6e03f	07-Oct-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Widen NET_EPOCH coverage. When epoch(9) was introduced to network stack, it was basically dropped in place of existing locking, which was mutexes and rwlocks. For the sake of performance mutex covered areas were as small as possible, so became epoch covered areas. However, epoch doesn't introduce any contention, it just delays memory reclaim. So, there is no point to minimise epoch covered areas in sense of performance. Meanwhile entering/exiting epoch also has non-zero CPU usage, so doing this less often is a win. Not the least is also code maintainability. In the new paradigm we can assume that at any stage of processing a packet, we are inside network epoch. This makes coding both input and output path way easier. On output path we already enter epoch quite early - in the ip_output(), in the ip6_output(). This patch does the same for the input path. All ISR processing, network related callouts, other ways of packet injection to the network stack shall be performed in net_epoch. Any leaf function that walks network configuration now asserts epoch. Tricky part is configuration code paths - ioctls, sysctls. They also call into leaf functions, so some need to be changed. This patch would introduce more epoch recursions (see EPOCH_TRACE) than we had before. They will be cleaned up separately, as several of them aren't trivial. Note, that unlike a lock recursion the epoch recursion is safe and just wastes a bit of resources. Reviewed by: gallatin, hselasky, cy, adrian, kristof Differential Revision: https://reviews.freebsd.org/D19111
# 4835262b	10-Sep-2019	Li-Wen Hsu <lwhsu@FreeBSD.org>	Fix build for the platforms where db_expr_t is not long Sponsored by: The FreeBSD Foundation
# 0b12ab81	09-Sep-2019	Conrad Meyer <cem@FreeBSD.org>	Appease Clang false-positive Werrors in r352112 Reported by: bcran
# 8b6acd2b	09-Sep-2019	Conrad Meyer <cem@FreeBSD.org>	ddb(4): Add 'show route <dest>' and 'show routetable [<af>]' These commands show the route resolved for a specified destination, or print out the entire routing table for a given address family (or all families, if none is explicitly provided). Discussed with: emaste Differential Revision: https://reviews.freebsd.org/D21510
# e6481fd4	14-Apr-2019	Michael Tuexen <tuexen@FreeBSD.org>	When sending a routing message, don't allow the user to set the RTF_RNH_LOCKED flag in rtm_flags, since this flag is used only internally. Reported by: syzbot+65c676f5248a13753ea0@syzkaller.appspotmail.com Reviewed by: ae@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D19898
# cc31f982	10-Jan-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Remove recursive NET_EPOCH_ENTER() from sysctl_ifmalist(), missed in r342872.
# a68cc388	08-Jan-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Mechanical cleanup of epoch(9) usage in network stack. - Remove macros that covertly create epoch_tracker on thread stack. Such macros a quite unsafe, e.g. will produce a buggy code if same macro is used in embedded scopes. Explicitly declare epoch_tracker always. - Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read locking macros to what they actually are - the net_epoch. Keeping them as is is very misleading. They all are named FOO_RLOCK(), while they no longer have lock semantics. Now they allow recursion and what's more important they now no longer guarantee protection against their companion WLOCK macros. Note: INP_HASH_RLOCK() has same problems, but not touched by this commit. This is non functional mechanical change. The only functionally changed functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter epoch recursively. Discussed with: jtl, gallatin
# a716ad4a	27-Nov-2018	Andrey V. Elsukov <ae@FreeBSD.org>	Fix possible panic during ifnet detach in rtsock. The panic can happen, when some application does dump of routing table using sysctl interface. To prevent this, set IFF_DYING flag in if_detach_internal() function, when ifnet under lock is removed from the chain. In sysctl_rtsock() take IFNET_RLOCK_NOSLEEP() to prevent ifnet detach during routes enumeration. In case, if some interface was detached in the time before we take the lock, add the check, that ifnet is not DYING. This prevents access to memory that could be freed after ifnet is unlinked. PR: 227720, 230498, 233306 Reviewed by: bz, eugen MFC after: 1 week Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D18338
# d25f8522	26-Nov-2018	Mark Johnston <markj@FreeBSD.org>	Plug routing sysctl leaks. Various structures exported by sysctl_rtsock() contain padding fields which were not being zeroed. Reported by: Thomas Barabosch, Fraunhofer FKIE Reviewed by: ae MFC after: 3 days Security: kernel memory disclosure Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18333
# 5f901c92	24-Jul-2018	Andrew Turner <andrew@FreeBSD.org>	Use the new VNET_DEFINE_STATIC macro when we are defining static VNET variables. Reviewed by: bz Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D16147
# 6573d758	03-Jul-2018	Matt Macy <mmacy@FreeBSD.org>	epoch(9): allow preemptible epochs to compose - Add tracker argument to preemptible epochs - Inline epoch read path in kernel and tied modules - Change in_epoch to take an epoch as argument - Simplify tfb_tcp_do_segment to not take a ti_locked argument, there's no longer any benefit to dropping the pcbinfo lock and trying to do so just adds an error prone branchfest to these functions - Remove cases of same function recursion on the epoch as recursing is no longer free. - Remove the the TAILQ_ENTRY and epoch_section from struct thread as the tracker field is now stack or heap allocated as appropriate. Tested by: pho and Limelight Networks Reviewed by: kbowling at llnw dot com Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16066
# 20efcfc6	16-Jun-2018	Andrey V. Elsukov <ae@FreeBSD.org>	Switch RIB and RADIX_NODE_HEAD lock from rwlock(9) to rmlock(9). Using of rwlock with multiqueue NICs for IP forwarding on high pps produces high lock contention and inefficient. Rmlock fits better for such workloads. Reviewed by: melifaro, olivier Obtained from: Yandex LLC Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D15789
# 4f6c66cc	23-May-2018	Matt Macy <mmacy@FreeBSD.org>	UDP: further performance improvements on tx Cumulative throughput while running 64 netperf -H $DUT -t UDP_STREAM -- -m 1 on a 2x8x2 SKL went from 1.1Mpps to 2.5Mpps Single stream throughput increases from 910kpps to 1.18Mpps Baseline: https://people.freebsd.org/~mmacy/2018.05.11/udpsender2.svg - Protect read access to global ifnet list with epoch https://people.freebsd.org/~mmacy/2018.05.11/udpsender3.svg - Protect short lived ifaddr references with epoch https://people.freebsd.org/~mmacy/2018.05.11/udpsender4.svg - Convert if_afdata read lock path to epoch https://people.freebsd.org/~mmacy/2018.05.11/udpsender5.svg A fix for the inpcbhash contention is pending sufficient time on a canary at LLNW. Reviewed by: gallatin Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15409
# d7c5a620	18-May-2018	Matt Macy <mmacy@FreeBSD.org>	ifnet: Replace if_addr_lock rwlock with epoch + mutex Run on LLNW canaries and tested by pho@ gallatin: Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5 based ConnectX 4-LX NIC, I see an almost 12% improvement in received packet rate, and a larger improvement in bytes delivered all the way to userspace. When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1, I see, using nstat -I mce0 1 before the patch: InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree 4.98 0.00 4.42 0.00 4235592 33 83.80 4720653 2149771 1235 247.32 4.73 0.00 4.20 0.00 4025260 33 82.99 4724900 2139833 1204 247.32 4.72 0.00 4.20 0.00 4035252 33 82.14 4719162 2132023 1264 247.32 4.71 0.00 4.21 0.00 4073206 33 83.68 4744973 2123317 1347 247.32 4.72 0.00 4.21 0.00 4061118 33 80.82 4713615 2188091 1490 247.32 4.72 0.00 4.21 0.00 4051675 33 85.29 4727399 2109011 1205 247.32 4.73 0.00 4.21 0.00 4039056 33 84.65 4724735 2102603 1053 247.32 After the patch InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree 5.43 0.00 4.20 0.00 3313143 33 84.96 5434214 1900162 2656 245.51 5.43 0.00 4.20 0.00 3308527 33 85.24 5439695 1809382 2521 245.51 5.42 0.00 4.19 0.00 3316778 33 87.54 5416028 1805835 2256 245.51 5.42 0.00 4.19 0.00 3317673 33 90.44 5426044 1763056 2332 245.51 5.42 0.00 4.19 0.00 3314839 33 88.11 5435732 1792218 2499 245.52 5.44 0.00 4.19 0.00 3293228 33 91.84 5426301 1668597 2121 245.52 Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch Reviewed by: gallatin Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15366
# 7edd877a	06-May-2018	Matt Macy <mmacy@FreeBSD.org>	The ifnet pointer (ifp) in rt_newaddrmsg can be valid without ifp->if_addr being set if if the ifnet is still live by way of a reference but in line for deletion. Check ifp->if_addr before dereferencing. Approved by: sbruno
# 6469bdcd	06-Apr-2018	Brooks Davis <brooks@FreeBSD.org>	Move most of the contents of opt_compat.h to opt_global.h. opt_compat.h is mentioned in nearly 180 files. In-progress network driver compabibility improvements may add over 100 more so this is closer to "just about everywhere" than "only some files" per the guidance in sys/conf/options. Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h is created on all architectures. Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the set of compiled files. Reviewed by: kib, cem, jhb, jtl Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14941
# b3b6ff23	21-Feb-2018	Ryan Stone <rstone@FreeBSD.org>	Allow route change requests to not specify the gateway. Only require a gateway to be specified on a route add request. On a route change request that does not specify the gateway, the gateway will remain the same. This allows changing other route parameters without having to re-specifying the gateway, like in "route change 10.0.0.0/8 -mtu 9000". Update the route(8) manpage to explicitly call out this usage as being supported. MFC after: 2 weeks Sponsored by: Dell EMC Isilon Reviewed By: eugen (rtsock.c change), rgrimes Differential Revision: https://reviews.freebsd.org/D14291
# 279e33d4	22-Jan-2018	Konstantin Belousov <kib@FreeBSD.org>	Fix compat32 for sysctl net.PF_ROUTE...NET_RT_IFLISTL. Route messages are aligned to the host long type alignment, which breaks 32bit. Reported and tested by: lwhsu Diagnosed by: Yuri Pankov <yuripv@icloud.com> Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 51369649	20-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.
# 68ce5a03	07-Apr-2017	Patrick Kelsey <pkelsey@FreeBSD.org>	Fix typo in comment. logest -> longest MFC after: 3 days
# fbbd9655	28-Feb-2017	Warner Losh <imp@FreeBSD.org>	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96
# 55dfce58	15-Nov-2016	Mark Johnston <markj@FreeBSD.org>	Plug a lock leak in sysctl_ifmalist(). Fix style in the local variable declarations. PR: 214542 MFC after: 1 week
# ab607f28	12-Nov-2016	Ryan Stone <rstone@FreeBSD.org>	Don't read if_counters with if_addr_lock held Calling into an ifnet implementation with the if_addr_lock already held can cause a LOR and potentially a deadlock, as ifnet implementations typically can take the if_addr_lock after their own locks during configuration. Refactor a sysctl handler that was violating this to read if_counter data in a temporary buffer before the if_addr_lock is taken, and then copying the data in its final location later, when the if_addr_lock is held. PR: 194109 Reported by: Jean-Sebastien Pedron MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D8498 Reviewed by: sbruno
# 484149de	03-Jun-2016	Bjoern A. Zeeb <bz@FreeBSD.org>	Introduce a per-VNET flag to enable/disable netisr prcessing on that VNET. Add accessor functions to toggle the state per VNET. The base system (vnet0) will always enable itself with the normal registration. We will share the registered protocol handlers in all VNETs minimising duplication and management. Upon disabling netisr processing for a VNET drain the netisr queue from packets for that VNET. Update netisr consumers to (de)register on a per-VNET start/teardown using VNET_SYS(UN)INIT functionality. The change should be transparent for non-VIMAGE kernels. Reviewed by: gnn (, hiren) Obtained from: projects/vnet MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D6691
# a4641f4e	03-May-2016	Pedro F. Giffuni <pfg@FreeBSD.org>	sys/net*: minor spelling fixes. No functional change.
# 02abd400	19-Apr-2016	Pedro F. Giffuni <pfg@FreeBSD.org>	kernel: use our nitems() macro when it is available through param.h. No functional change, only trivial cases are done in this sweep, Discussed in: freebsd-current
# 35030a5d	29-Mar-2016	Edward Tomasz Napierala <trasz@FreeBSD.org>	Remove some NULL checks for M_WAITOK allocations. MFC after: 1 month Sponsored by: The FreeBSD Foundation
# 61eee0e2	24-Jan-2016	Alexander V. Chernikov <melifaro@FreeBSD.org>	MFP r287070,r287073: split radix implementation and route table structure. There are number of radix consumers in kernel land (pf,ipfw,nfs,route) with different requirements. In fact, first 3 don't have _any_ requirements and first 2 does not use radix locking. On the other hand, routing structure do have these requirements (rnh_gen, multipath, custom to-be-added control plane functions, different locking). Additionally, radix should not known anything about its consumers internals. So, radix code now uses tiny 'struct radix_head' structure along with internal 'struct radix_mask_head' instead of 'struct radix_node_head'. Existing consumers still uses the same 'struct radix_node_head' with slight modifications: they need to pass pointer to (embedded) 'struct radix_head' to all radix callbacks. Routing code now uses new 'struct rib_head' with different locking macro: RADIX_NODE_HEAD prefix was renamed to RIB_ (which stands for routing information base). New net/route_var.h header was added to hold routing subsystem internal data. 'struct rib_head' was placed there. 'struct rtentry' will also be moved there soon.
# 9a1b64d5	04-Jan-2016	Alexander V. Chernikov <melifaro@FreeBSD.org>	Add rib_lookup_info() to provide API for retrieving individual route entries data in unified format. There are control plane functions that require information other than just next-hop data (e.g. individual rtentry fields like flags or prefix/mask). Given that the goal is to avoid rte reference/refcounting, re-use rt_addrinfo structure to store most rte fields. If caller wants to retrieve key/mask or gateway (which are sockaddrs and are allocated separately), it needs to provide sufficient-sized sockaddrs structures w/ ther pointers saved in passed rt_addrinfo. Convert: * lltable new records checks (in_lltable_rtcheck(), nd6_is_new_addr_neighbor(). * rtsock pre-add/change route check. * IPv6 NS ND-proxy check (RADIX_MPATH code was eliminated because 1) we don't support RTF_ANNOUNCE ND-proxy for networks and there should not be multiple host routes for such hosts 2) if we have multiple routes we should inspect them (which is not done). 3) the entire idea of abusing KRT as storage for ND proxy seems odd. Userland programs should be used for that purpose).
# f7bab8d0	09-Nov-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Switch route radix to dual-lock model: use rmlock for data patch access, and config rwlock for conrol plane processing. Route table changes require bock locks held.
# 69d149ad	09-Nov-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Since we no longer return individual radix entries, it is not possible to do per-rte accounting. Remove rt_kpktsent.
# 033074c4	09-Nov-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Replace 'struct route ' if_output() argument with 'struct nhop_info '. Leave 'struct route' as is for legacy routing api users. Remove most of rtalloc_ign*-derived functions.
# 55e5eda6	08-Nov-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Separate radix and routing: use different structures for route and for other customers. Introduce new 'struct rib_head' for routing purposes and make all routing api use it.
# 8c3cfe0b	04-Nov-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Hide 'struct rtentry' and all its macro inside new header: net/route_internal.h The goal is to make its opaque for all code except route/rtsock and proto domain _rmx.
# 56b61ca2	19-Sep-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Remove ifq_drops from struct ifqueue. Now queue drops are accounted in struct ifnet if_oqdrops. Some netgraph modules used ifqueue w/o ifnet. Accounting of queue drops is simply removed from them. There were no API to read this statistic. Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 4f8585e0	11-Sep-2014	Alan Somers <asomers@FreeBSD.org>	Revisions 264905 and 266860 added a "int fib" argument to ifa_ifwithnet and ifa_ifwithdstaddr. For the sake of backwards compatibility, the new arguments were added to new functions named ifa_ifwithnet_fib and ifa_ifwithdstaddr_fib, while the old functions became wrappers around the new ones that passed RT_ALL_FIBS for the fib argument. However, the backwards compatibility is not desired for FreeBSD 11, because there are numerous other incompatible changes to the ifnet(9) API. We therefore decided to remove it from head but leave it in place for stable/9 and stable/10. In addition, this commit adds the fib argument to ifa_ifwithbroadaddr for consistency's sake. sys/sys/param.h Increment __FreeBSD_version sys/net/if.c sys/net/if_var.h sys/net/route.c Add fibnum argument to ifa_ifwithbroadaddr, and remove the _fib versions of ifa_ifwithdstaddr, ifa_ifwithnet, and ifa_ifwithroute. sys/net/route.c sys/net/rtsock.c sys/netinet/in_pcb.c sys/netinet/ip_options.c sys/netinet/ip_output.c sys/netinet6/nd6.c Fixup calls of modified functions. share/man/man9/ifnet.9 Document changed API. CR: https://reviews.freebsd.org/D458 MFC after: Never Sponsored by: Spectra Logic
# e6485f73	31-Aug-2014	Gleb Smirnoff <glebius@FreeBSD.org>	o Remove struct if_data from struct ifnet. Now it is merely API structure for route(4) socket and ifmib(4) sysctl. o Move fields from if_data to ifnet, but keep all statistic counters separate, since they should disappear later. o Provide function if_data_copy() to fill if_data, utilize it in routing socket and ifmib handler. o Provide overridable ifnet(9) method to fetch counters. If no provided, if_get_counters_compat() would be used, that returns old counters. Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 73d76e77	14-Aug-2014	Kevin Lo <kevlo@FreeBSD.org>	Change pr_output's prototype to avoid the need for explicit casts. This is a follow up to r269699. Phabric: D564 Reviewed by: jhb
# 9753faf5	29-Jul-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Garbage collect couple of unused fields from struct ifaddr: - ifa_claim_addr() unused since removal of NetAtalk - ifa_metric seems to be never utilized, always a copy of if_metric
# 2f308a34	29-May-2014	Alan Somers <asomers@FreeBSD.org>	Fix unintended KBI change from r264905. Add _fib versions of ifa_ifwithnet() and ifa_ifwithdstaddr() The legacy functions will call the _fib() versions with RT_ALL_FIBS, preserving legacy behavior. sys/net/if_var.h sys/net/if.c Add legacy-compatible functions as described above. Ensure legacy behavior when RT_ALL_FIBS is passed as fibnum. sys/netinet/in_pcb.c sys/netinet/ip_output.c sys/netinet/ip_options.c sys/net/route.c sys/net/rtsock.c sys/netinet6/nd6.c Call with _fib() functions if we must use a specific fib, or the legacy functions otherwise. tests/sys/netinet/fibs_test.sh tests/sys/netinet/udp_dontroute.c Improve the udp_dontroute test. The bug that this test exercises is that ifa_ifwithnet() will return the wrong address, if multiple interfaces have addresses on the same subnet but with different fibs. The previous version of the test only considered one possible failure mode: that ifa_ifwithnet_fib() might fail to find any suitable address at all. The new version also checks whether ifa_ifwithnet_fib() finds the correct address by checking where the ARP request goes. Reported by: bz, hrs Reviewed by: hrs MFC after: 1 week X-MFC-with: 264905 Sponsored by: Spectra Logic
# 6db47af4	08-May-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Rename rt_msg1() to more handy rtsock_msg_mbuf(). (Just for history purposes: rt_msg2() was renamed to rtsock_msg_buffer() in r265019). Sponsored by: Yandex LLC MFC after: 1 month
# 3deb3649	08-May-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix incorrect netmasks being passed via rtsock. Since radix has been ignoring sa_family in passed sockaddrs, no one ever has bothered filling valid sa_family in netmasks. Additionally, radix adjusts sa_len field in every netmask not to compare zero bytes at all. This leads us to rt_mask with sa_family of AF_UNSPEC (-1) and arbitrary sa_len field (0 for default route, for example). However, rtsock have been passing that rt_mask intact for ages, requiring all rtsock consumers to make ther own local hacks. We even have unfixed on in base: do `route -n monitor` in one window and issue `route -n get addr` for some directly-connected address. You will probably see the following: got message of size 304 on Thu May 8 15:06:06 2014 RTM_GET: Report Metrics: len 304, pid: 30493, seq 1, errno 0, flags:<UP,DONE,PINNED> locks: inits: sockaddrs: <DST,GATEWAY,NETMASK,IFP,IFA> 10.0.0.0 link#1 (255) ffff ffff ff em0:8.0.27.c5.29.d4 10.0.0.92 _________________^^^^^^^^^^^^^^^^^^ after the change: got message of size 312 on Thu May 8 15:44:07 2014 RTM_GET: Report Metrics: len 312, pid: 2895, seq 1, errno 0, flags:<UP,DONE,PINNED> locks: inits: sockaddrs: <DST,GATEWAY,NETMASK,IFP,IFA> 10.0.0.0 link#1 255.255.255.0 em0:8.0.27.c5.29.d4 10.0.0.92 _________________^^^^^^^^^^^^^^^^^^ Sponsored by: Yandex LLC MFC after: 1 month
# c9f98940	03-May-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix sysctl_ifmalist() broken in r265019. Reported by: Olivier Cochard-Labbé MFC with: r265019
# d9437c0f	29-Apr-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Partially revert r265019 - allocating 512 bytes on stack can be too much for architectures like ARM. Always use rounded malloc instead. Discussed with: jmallett MFC after: 4 weeks
# 0fb9298d	29-Apr-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Move rt_setmetrics() from rtsock.c to route.c. All rtsock-initiated rte creation/modification are now performed in route.c holding radix tree write lock. This reduces the need for per-rte mutex. Sponsored by: Yandex LLC MFC after: 1 month
# de46b2c6	27-Apr-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix build Found by: ian Pointyhat to: me
# f2e5eb36	27-Apr-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Improve memory allocation model for rt_msg2() rtsock messages: * memory is now allocated as early as possible, without holding locks. * sysctl users are now guaranteed to get a response (M_WAITOK buffer prealloc). * socket users are more likely to use on-stack buffer for replies. * standard kernel malloc/free functions are now used instead of radix wrappers. rt_msg2() has been renamed to rtsock_msg_buffer(). MFC after: 1 month
# f1fcb552	27-Apr-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Remove useless zeroing of RTAX_DST on error. Cleanup a bit. MFC after: 1 month
# 92c227af	27-Apr-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Cleanup route_output() a bit. MFC after: 1 month
# 2277c5e5	27-Apr-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Do not delay freeing rtm. Bandaid added in r227061 is not needed since r227061, MFC after: 1 month
# f5d9a696	26-Apr-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Move up fibnum to ensure it is always defined. Found by: ian MFC with: r264987
# 773aa053	26-Apr-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Determine fibnum once in the beginning of route_output(). MFC after: 1 month
# c77462dd	26-Apr-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Decouple RTM_CHANGE from RTM_GET handling in rtsock.c:route_output(). RTM_CHANGE is now handled inside route.c:rtrequest1_fib() as it should be. Note change change handler is a separate function rtrequest1_fib_change(). MFC after: 1 month
# 36d55f0f	26-Apr-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Unify sa_equal() macro usage. MFC after: 2 weeks
# 0cfee0c2	24-Apr-2014	Alan Somers <asomers@FreeBSD.org>	Fix subnet and default routes on different FIBs on the same subnet. These two bugs are closely related. The root cause is that ifa_ifwithnet does not consider FIBs when searching for an interface address. sys/net/if_var.h sys/net/if.c Add a fib argument to ifa_ifwithnet and ifa_ifwithdstadddr. Those functions will only return an address whose interface fib equals the argument. sys/net/route.c Update calls to ifa_ifwithnet and ifa_ifwithdstaddr with fib arguments. sys/netinet/in.c Update in_addprefix to consider the interface fib when adding prefixes. This will prevent it from not adding a subnet route when one already exists on a different fib. sys/net/rtsock.c sys/netinet/in_pcb.c sys/netinet/ip_output.c sys/netinet/ip_options.c sys/netinet6/nd6.c Add RT_DEFAULT_FIB arguments to ifa_ifwithdstaddr and ifa_ifwithnet. In some cases it there wasn't a clear specific fib number to use. In others, I was unable to test those functions so I chose RT_DEFAULT_FIB to minimize divergence from current behavior. I will fix some of the latter changes along with PR kern/187553. tests/sys/netinet/fibs_test.sh tests/sys/netinet/udp_dontroute.c tests/sys/netinet/Makefile Revert r263738. The udp_dontroute test was right all along. However, bugs kern/187550 and kern/187553 cancelled each other out when it came to this test. Because of kern/187553, ifa_ifwithnet searched the default fib instead of the requested one, but because of kern/187550, there was an applicable subnet route on the default fib. The new test added in r263738 doesn't work right, however. I can verify with dtrace that ifa_ifwithnet returned the wrong address before I applied this commit, but route(8) miraculously found the correct interface to use anyway. I don't know how. Clear expected failure messages for kern/187550 and kern/187552. PR: kern/187550 PR: kern/187552 Reviewed by: melifaro MFC after: 3 weeks Sponsored by: Spectra Logic
# 2d70c0de	19-Mar-2014	Gleb Smirnoff <glebius@FreeBSD.org>	When exporting ifnet via sysctl, add ifqueue(9) drop count to the ifi_oqdrops. This is a temporary workaround until ifqueue(9) vanishes. While here, remove the pointless ifi_vhid assignment. It has sense only when we are exporting ifaddrs, not ifnets. Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 2c284d93	13-Mar-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Remove IPX support. IPX was a network transport protocol in Novell's NetWare network operating system from late 80s and then 90s. The NetWare itself switched to TCP/IP as default transport in 1998. Later, in this century the Novell Open Enterprise Server became successor of Novell NetWare. The last release that claimed to still support IPX was OES 2 in 2007. Routing equipment vendors (e.g. Cisco) discontinued support for IPX in 2011. Thus, IPX won't be supported in FreeBSD 11.0-RELEASE.
# b245f96c	12-Mar-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Since 32-bit if_baudrate isn't enough to describe a baud rate of a 10 Gbit interface, in the r241616 a crutch was provided. It didn't work well, and finally we decided that it is time to break ABI and simply make if_baudrate a 64-bit value. Meanwhile, the entire struct if_data was reviewed. o Remove the if_baudrate_pf crutch. o Make all fields of struct if_data fixed machine independent size. The notion of data (packet counters, etc) are by no means MD. And it is a bug that on amd64 we've got a 64-bit counters, while on i386 32-bit, which at modern speeds overflow within a second. This also removes quite a lot of COMPAT_FREEBSD32 code. o Give 16 bit for the ifi_datalen field. This field was provided to make future changes to if_data less ABI breaking. Unfortunately the 8 bit size of it had effectively limited sizeof if_data to 256 bytes. o Give 32 bits to ifi_mtu and ifi_metric. o Give 64 bits to the rest of fields, since they are counters. __FreeBSD_version bumped. Discussed with: emax Sponsored by: Netflix Sponsored by: Nginx, Inc.
# e3a7aa6f	04-Mar-2014	Gleb Smirnoff <glebius@FreeBSD.org>	- Remove rt_metrics_lite and simply put its members into rtentry. - Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This removes another cache trashing ++ from packet forwarding path. - Create zini/fini methods for the rtentry UMA zone. Via initialize mutex and counter in them. - Fix reporting of rmx_pksent to routing socket. - Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode. The change is mostly targeted for stable/10 merge. For head, rt_pksent is expected to just disappear. Discussed with: melifaro Sponsored by: Netflix Sponsored by: Nginx, Inc.
# c5d4eab6	19-Feb-2014	Marko Zec <zec@FreeBSD.org>	V_irtualize rtsock refcounting, which reduces the chances for panics on teardown of vnets without active routing sockets while at least one routing socket is active elsewhere. Tested by: Vijay Singh MFC after: 3 days
# d375edc9	09-Jan-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Simplify inet alias handling code: if we're adding/removing alias which has the same prefix as some other alias on the same interface, use newly-added rt_addrmsg() instead of hand-rolled in_addralias_rtmsg(). This eliminates the following rtsock messages: Pinned RTM_ADD for prefix (for alias addition). Pinned RTM_DELETE for prefix (for alias withdrawal). Example (got 10.0.0.1/24 on vlan4, playing with 10.0.0.2/24): before commit, addition: got message of size 116 on Fri Jan 10 14:13:15 2014 RTM_NEWADDR: address being added to iface: len 116, metric 0, flags: sockaddrs: <NETMASK,IFP,IFA,BRD> 255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255 got message of size 192 on Fri Jan 10 14:13:15 2014 RTM_ADD: Add Route: len 192, pid: 0, seq 0, errno 0, flags:<UP,PINNED> locks: inits: sockaddrs: <DST,GATEWAY,NETMASK> 10.0.0.0 10.0.0.2 (255) ffff ffff ff after commit, addition: got message of size 116 on Fri Jan 10 13:56:26 2014 RTM_NEWADDR: address being added to iface: len 116, metric 0, flags: sockaddrs: <NETMASK,IFP,IFA,BRD> 255.255.255.0 vlan4:8.0.27.c5.29.d4 14.0.0.2 14.0.0.255 before commit, wihdrawal: got message of size 192 on Fri Jan 10 13:58:59 2014 RTM_DELETE: Delete Route: len 192, pid: 0, seq 0, errno 0, flags:<UP,PINNED> locks: inits: sockaddrs: <DST,GATEWAY,NETMASK> 10.0.0.0 10.0.0.2 (255) ffff ffff ff got message of size 116 on Fri Jan 10 13:58:59 2014 RTM_DELADDR: address being removed from iface: len 116, metric 0, flags: sockaddrs: <NETMASK,IFP,IFA,BRD> 255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255 adter commit, withdrawal: got message of size 116 on Fri Jan 10 14:14:11 2014 RTM_DELADDR: address being removed from iface: len 116, metric 0, flags: sockaddrs: <NETMASK,IFP,IFA,BRD> 255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255 Sending both RTM_ADD/RTM_DELETE messages to rtsock is completely wrong (and requires some hacks to keep prefix in route table on RTM_DELETE). I've tested this change with quagga (no change) and bird (). bird alias handling is already broken in BSD sysdep code, so nothing changes here, too. I'm going to MFC this change if there will be no complains about behavior change. While here, fix some style(9) bugs introduced by r260488 (pointed by glebius and bde). Sponsored by: Yandex LLC MFC after: 4 weeks
# 4cbac30b	09-Jan-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Split rt_newaddrmsg_fib() into two different functions. Adding/deleting interface addresses involves access to 3 different subsystems, int different parts of code. Each call can fail, so reporting successful operation by rtsock in the middle of the process error-prone. Further split routing notification API and actual rtsock calls via creating public-available rt_addrmsg() / rt_routemsg() functions with "private" rtsock_* backend. MFC after: 2 weeks
# 7d9b6df1	08-Jan-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Constanly use RT_ALL_FIBS everywhere instead of -1. MFC after: 2 weeks
# 5a2f4cbd	04-Jan-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Change semantics for rnh_lookup() function: now it performs exact match search, regardless of netmask existance. This simplifies most of rnh_lookup() consumers. Fix panic triggered by deleting non-existent host route. PR: kern/185092 Submitted by: Nikolay Denev <ndenev at gmail.com> MFC after: 1 month
# 76039bc8	26-Oct-2013	Gleb Smirnoff <glebius@FreeBSD.org>	The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare to this event, adding if_var.h to files that do need it. Also, include all includes that now are included due to implicit pollution via if_var.h Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 7caf4ab7	15-Oct-2013	Gleb Smirnoff <glebius@FreeBSD.org>	- Utilize counter(9) to accumulate statistics on interface addresses. Add four counters to struct ifaddr. This kills '+=' on a variables shared between processors for every packet. - Nuke struct if_data from struct ifaddr. - In ip_input() do not put a reference on ifaddr, instead update statistics right now in place and do IN_IFADDR_RUNLOCK(). These removes atomic(9) for every packet. [1] - To properly support NET_RT_IFLISTL sysctl used by getifaddrs(3), in rtsock.c fill if_data fields using counter_u64_fetch(). - Accidentially fix bug in COMPAT_32 version of NET_RT_IFLISTL, which took if_data not from the ifaddr, but from ifaddr's ifnet. [2] Submitted by: melifaro [1], pluknet[2] Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 413e45bf	20-Aug-2013	Bjoern A. Zeeb <bz@FreeBSD.org>	After r241616 properly export ifi_baudrate_pf in the 32bit compat case. MFC after: 3 days
# 12bdf23a	28-Jul-2013	Hiroki Sato <hrs@FreeBSD.org>	sin6 should be assigned before the loop.
# 4825b1e0	11-Jul-2013	Hiroki Sato <hrs@FreeBSD.org>	Add a leaf node CTL_NET.PF_ROUTE.0.AF.NET_RT_DUMP.0.FIB. This returns routing table with the specified FIB number, not td->td_proc->p_fibnum.
# f672f56f	24-Jun-2013	Qing Li <qingli@FreeBSD.org>	Due to the routing related networking kernel redesign work in FBSD 8.0, interface routes have been returened to the applications without the RTF_GATEWAY bit. This incompatibility has caused some issues with Zebra, Qugga and the like. This patch provides the RTF_GATEWAY flag bit in returned interface routes so to behave similarly to pre 8.0 systems. Reviewed by: hrs Verified by: mackn at opendns dot com
# c69f77c3	14-Mar-2013	Gleb Smirnoff <glebius@FreeBSD.org>	- Use m_getcl() instead of hand allocating. - Convert panic() to KASSERT. - Remove superfluous cleaning of mbuf fields after allocation. - Add comment on possible use of m_get2() here. Sponsored by: Nginx, Inc.
# 0bebb544	05-Dec-2012	Hiroki Sato <hrs@FreeBSD.org>	- Move definition of V_deembed_scopeid to scope6_var.h. - Deembed scope id in L3 address in in6_lltable_dump(). - Simplify scope id recovery in rtsock routines. - Remove embedded scope id handling in ndp(8) and route(8) completely.
# eb1b1807	05-Dec-2012	Gleb Smirnoff <glebius@FreeBSD.org>	Mechanically substitute flags from historic mbuf allocator with malloc(9) flags within sys. Exceptions: - sys/contrib not touched - sys/mbuf.h edited manually
# 5c9fa630	04-Dec-2012	Hiroki Sato <hrs@FreeBSD.org>	- Fix LOR in sa6_recoverscope() in rt_msg2()[1]. - Check V_deembed_scopeid before checking if sa_family == AF_INET6. - Fix scope id handing in route(8)[2] and ifconfig(8). Reported by: rpaulo[1], Mateusz Guzik[1], peter[2]
# d60ec817	17-Nov-2012	Adrian Chadd <adrian@FreeBSD.org>	Fix up a compile time warning if INET6 isn't defined.
# 6bbfef90	17-Nov-2012	Hiroki Sato <hrs@FreeBSD.org>	Fill sin6_scope_id in sockaddr_in6 before passing it from the kernel to userland via routing socket or sysctl. This eliminates the following KAME-specific sin6_scope_id handling routine from each userland utility: sin6.sin6_scope_id = ntohs((u_int16_t )&sin6.sin6_addr.s6_addr[2]); This behavior can be controlled by net.inet6.ip6.deembed_scopeid. This is set to 1 by default (sin6_scope_id will be filled in the kernel). Reviewed by: bz
# c9b652e3	18-Oct-2012	Andre Oppermann <andre@FreeBSD.org>	Mechanically remove the last stray remains of spl* calls from net/. They have been Noop's for a long time now.
# c2508034	22-Apr-2012	Alexander V. Chernikov <melifaro@FreeBSD.org>	Do not require radix write lock to be held while dumping route table via sysctl(4) interface. This permits router not to stop forwarding packets while route table is being written to user-supplied buffer. Reported by: Pawel Tyll <ptyll@nitronet.pl> Approved by: kib(mentor) MFC after: 1 week
# 6d076ae8	10-Feb-2012	Bjoern A. Zeeb <bz@FreeBSD.org>	Introduce a new NET_RT_IFLISTL API to query the address list. It works on extended and extensible structs if_msghdrl and ifa_msghdrl. This will allow us to extend both the msghdrl structs and eventually if_data in the future without breaking the ABI. Bump __FreeBSD_version to allow ports to more easily detect the new API. Reviewed by: glebius, brooks MFC after: 3 days
# e82cf13b	10-Feb-2012	Bjoern A. Zeeb <bz@FreeBSD.org>	Backout changes from r228571. Remove if_data from struct ifa_msghdr again. While this breaks carp on HEAD temporary, it restores the upgrade path from stable, and head before 20111215. Reviewed by: glebius, brooks
# c82c34b4	08-Jan-2012	Gleb Smirnoff <glebius@FreeBSD.org>	Copy ifa->if_data to ifam->ifam_data. This was forgotten in r228571. Submitted by: bz
# 137f91e8	05-Jan-2012	John Baldwin <jhb@FreeBSD.org>	Convert all users of IF_ADDR_LOCK to use new locking macros that specify either a read lock or write lock. Reviewed by: bz MFC after: 2 weeks
# 08b68b0e	15-Dec-2011	Gleb Smirnoff <glebius@FreeBSD.org>	A major overhaul of the CARP implementation. The ip_carp.c was started from scratch, copying needed functionality from the old implemenation on demand, with a thorough review of all code. The main change is that interface layer has been removed from the CARP. Now redundant addresses are configured exactly on the interfaces, they run on. The CARP configuration itself is, as before, configured and read via SIOCSVH/SIOCGVH ioctls. A new prefix created with SIOCAIFADDR or SIOCAIFADDR_IN6 may now be configured to a particular virtual host id, which makes the prefix redundant. ifconfig(8) semantics has been changed too: now one doesn't need to clone carpXX interface, he/she should directly configure a vhid on a Ethernet interface. To supply vhid data from the kernel to an application the getifaddrs(8) function had been changed to pass ifam_data with each address. [1] The new implementation definitely closes all PRs related to carp(4) being an interface, and may close several others. It also allows to run a single redundant IP per interface. Big thanks to Bjoern Zeeb for his help with inet6 part of patch, for idea on using ifam_data and for several rounds of reviewing! PR: kern/117000, kern/126945, kern/126714, kern/120130, kern/117448 Reviewed by: bz Submitted by: bz [1]
# 6472ac3d	07-Nov-2011	Ed Schouten <ed@FreeBSD.org>	Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs. The SYSCTL_NODE macro defines a list that stores all child-elements of that node. If there's no SYSCTL_DECL macro anywhere else, there's no reason why it shouldn't be static.
# 3ca1a2d6	03-Nov-2011	Max Laier <mlaier@FreeBSD.org>	Fix a use-after-free/redzone issue in the routing code. Reported by (repeatedly): Mike Tancsa Prodded by (repeatedly): bz Forgotten by (repeatedly): mlaier MFC after: 2 weeks
# 528737fd	28-Sep-2011	Bjoern A. Zeeb <bz@FreeBSD.org>	Pass the fibnum where we need filtering of the message on the rtsock allowing routing daemons to filter routing updates on an rtsock per FIB. Adjust raw_input() and split it into wrapper and a new function taking an optional callback argument even though we only have one consumer [1] to keep the hackish flags local to rtsock.c. PR: kern/134931 Submitted by: multiple (see PR) Suggested by: rwatson [1] Reviewed by: rwatson MFC after: 3 days
# 826bf287	09-Feb-2011	Max Laier <mlaier@FreeBSD.org>	As info.rti_info[RTAX_DST] can point inside of rtm we must not free the rtm until rt_dispatch is done with the sockaddr. Found by: memguard MFC after: 3 days
# a7d5f7eb	19-Oct-2010	Jamie Gritton <jamie@FreeBSD.org>	A new jail(8) with a configuration file, to replace the work currently done by /etc/rc.d/jail.
# ee7c7fee	16-Oct-2010	Bjoern A. Zeeb <bz@FreeBSD.org>	Close a race acquiring the IF_ADDR_LOCK() for each entry while iterating over all interfaces to make sure the address will neither change nor be freed while we are working on it. PR: kern/146250 Submitted by: Mikolaj Golub (to.my.trociny gmail.com) MFC after: 1 week
# 04f32057	03-Aug-2010	Konstantin Belousov <kib@FreeBSD.org>	Properly set ifi_datalen for compat32 struct if_data32. PR: kern/149240 Submitted by: Stef Walter <stef memberwebs com> MFC after: 1 weeks
# dd62f5c0	25-Jun-2010	Qing Li <qingli@FreeBSD.org>	MFC r208553 This patch fixes the problem where proxy ARP entries cannot be added over the if_ng interface. Approved by: re (bz)
# 0ed6142b	25-May-2010	Qing Li <qingli@FreeBSD.org>	This patch fixes the problem where proxy ARP entries cannot be added over the if_ng interface. MFC after: 3 days
# b0af8356	08-May-2010	Konstantin Belousov <kib@FreeBSD.org>	MFC r207194: Provide 32bit compat shims for sysctl net.route NET_RT_IFLIST.
# 427a928a	25-Apr-2010	Konstantin Belousov <kib@FreeBSD.org>	Provide 32bit compat shims for sysctl net.route NET_RT_IFLIST. This allows getifaddrs(3) to work for compat32 binaries. Submitted by: jhb (6.x version) Reviewed by: emaste Tested by: emaste and <pluknet gmail com> MFC after: 2 weeks
# 32c53401	05-Jan-2010	Qing Li <qingli@FreeBSD.org>	MFC r201282, r201543 r201282 ------- The proxy arp entries could not be added into the system over the IFF_POINTOPOINT link types. The reason was due to the routing entry returned from the kernel covering the remote end is of an interface type that does not support ARP. This patch fixes this problem by providing a hint to the kernel routing code, which indicates the prefix route instead of the PPP host route should be returned to the caller. Since a host route to the local end point is also added into the routing table, and there could be multiple such instantiations due to multiple PPP links can be created with the same local end IP address, this patch also fixes the loopback route installation failure problem observed prior to this patch. The reference count of loopback route to local end would be either incremented or decremented. The first instantiation would create the entry and the last removal would delete the route entry. r201543 ------- The IFA_RTSELF address flag marks a loopback route has been installed for the interface address. This marker is necessary to properly support PPP types of links where multiple links can have the same local end IP address. The IFA_RTSELF flag bit maps to the RTF_HOST value, which was combined into the route flag bits during prefix installation in IPv6. This inclusion causing the prefix route to be unusable. This patch fixes this bug by excluding the IFA_RTSELF flag during route installation. PR: ports/141342, kern/141134
# c7ab6602	30-Dec-2009	Qing Li <qingli@FreeBSD.org>	The proxy arp entries could not be added into the system over the IFF_POINTOPOINT link types. The reason was due to the routing entry returned from the kernel covering the remote end is of an interface type that does not support ARP. This patch fixes this problem by providing a hint to the kernel routing code, which indicates the prefix route instead of the PPP host route should be returned to the caller. Since a host route to the local end point is also added into the routing table, and there could be multiple such instantiations due to multiple PPP links can be created with the same local end IP address, this patch also fixes the loopback route installation failure problem observed prior to this patch. The reference count of loopback route to local end would be either incremented or decremented. The first instantiation would create the entry and the last removal would delete the route entry. MFC after: 5 days
# 950cde50	28-Dec-2009	Bjoern A. Zeeb <bz@FreeBSD.org>	MFC r200473: Throughout the network stack we have a few places of if (jailed(cred)) left. If you are running with a vnet (virtual network stack) those will return true and defer you to classic IP-jails handling and thus things will be "denied" or returned with an error. Work around this problem by introducing another "jailed()" function, jailed_without_vnet(), that also takes vnets into account, and permits the calls, should the jail from the given cred have its own virtual network stack. We cannot change the classic jailed() call to do that, as it is used outside the network stack as well. Discussed with: julian, zec, jamie, rwatson (back in Sept)
# de0bd6f7	13-Dec-2009	Bjoern A. Zeeb <bz@FreeBSD.org>	Throughout the network stack we have a few places of if (jailed(cred)) left. If you are running with a vnet (virtual network stack) those will return true and defer you to classic IP-jails handling and thus things will be "denied" or returned with an error. Work around this problem by introducing another "jailed()" function, jailed_without_vnet(), that also takes vnets into account, and permits the calls, should the jail from the given cred have its own virtual network stack. We cannot change the classic jailed() call to do that, as it is used outside the network stack as well. Discussed with: julian, zec, jamie, rwatson (back in Sept) MFC after: 5 days
# c7276c59	30-Aug-2009	Qing Li <qingli@FreeBSD.org>	As part of r196609, a call to "rtalloc" did not take the fib into account. So call the appropriate "rtalloc_ign_fib()" instead of calling "rtalloc_ign()". Reviewed by: pointed out by bz Approved by: re
# 5311e988	30-Aug-2009	Qing Li <qingli@FreeBSD.org>	As part of r196609, a call to "rtalloc" did not take the fib into account. So call the appropriate "rtalloc_ign_fib()" instead of calling "rtalloc_ign()". Reviewed by:i pointed out by bz MFC after: immediately
# ba3ae75b	30-Aug-2009	Qing Li <qingli@FreeBSD.org>	MFC r196609 In ip_output(), the flow-table module must not try to cache L2/L3 information for interface of IFF_POINTOPOINT or IFF_LOOPBACK type. Since the L2 information (rt_lle) is invalid for these interface types, accidental caching attempt will trigger panic when the invalid rt_lle reference is accessed. When installing a new route, or when updating an existing route, the user supplied gateway address may be an interface address (this is particularly true for point-to-point interface related modules such as ppp, if_tun, if_gif). Currently the routing command handler always set the RTF_GATEWAY flag if the gateway address is given as part of the command paramters. Therefore the gateway address must be verified against interface addresses or else the route would be treated as an indirect route, thus making that route unusable. Reviewed by: kmacy, julian, rwatson Approved by: re
# 9231d35f	28-Aug-2009	Qing Li <qingli@FreeBSD.org>	In ip_output(), the flow-table module must not try to cache L2/L3 information for interface of IFF_POINTOPOINT or IFF_LOOPBACK type. Since the L2 information (rt_lle) is invalid for these interface types, accidental caching attempt will trigger panic when the invalid rt_lle reference is accessed. When installing a new route, or when updating an existing route, the user supplied gateway address may be an interface address (this is particularly true for point-to-point interface related modules such as ppp, if_tun, if_gif). Currently the routing command handler always set the RTF_GATEWAY flag if the gateway address is given as part of the command paramters. Therefore the gateway address must be verified against interface addresses or else the route would be treated as an indirect route, thus making that route unusable. Reviewed by: kmacy, julia, rwatson Verified by: marcus MFC after: 3 days
# 9b36fbe9	13-Aug-2009	Bjoern A. Zeeb <bz@FreeBSD.org>	MFC r196174: Put multiple instructions into a block when iterating; unbreaks NET_RT_DUMP, which otherwise only returned information of AF_MAX. This was broken in r193232 (save your time - my bug, my fix). Reported by: Larry Baird (lab gta.com) Tested by: Larry Baird (lab gta.com) Reviewed by: zec, lstewart, qing PR: kern/137700 Approved by: re (kib)
# 20b0cdb7	13-Aug-2009	Bjoern A. Zeeb <bz@FreeBSD.org>	Put multiple instructions into a block when iterating; unbreaks NET_RT_DUMP, which otherwise only returned information of AF_MAX. This was broken in r193232 (save your time - my bug, my fix). PR: kern/137700 Reported by: Larry Baird (lab gta.com) Tested by: Larry Baird (lab gta.com) Reviewed by: zec, lstewart, qing Approved by: re (kib)
# 530c0060	01-Aug-2009	Robert Watson <rwatson@FreeBSD.org>	Merge the remainder of kern_vimage.c and vimage.h into vnet.c and vnet.h, we now use jails (rather than vimages) as the abstraction for virtualization management, and what remained was specific to virtual network stacks. Minor cleanups are done in the process, and comments updated to reflect these changes. Reviewed by: bz Approved by: re (vimage blanket)
# d0728d71	23-Jul-2009	Robert Watson <rwatson@FreeBSD.org>	Introduce and use a sysinit-based initialization scheme for virtual network stacks, VNET_SYSINIT: - Add VNET_SYSINIT and VNET_SYSUNINIT macros to declare events that will occur each time a network stack is instantiated and destroyed. In the !VIMAGE case, these are simply mapped into regular SYSINIT/SYSUNINIT. For the VIMAGE case, we instead use SYSINIT's to track their order and properties on registration, using them for each vnet when created/ destroyed, or immediately on module load for already-started vnets. - Remove vnet_modinfo mechanism that existed to serve this purpose previously, as well as its dependency scheme: we now just use the SYSINIT ordering scheme. - Implement VNET_DOMAIN_SET() to allow protocol domains to declare that they want init functions to be called for each virtual network stack rather than just once at boot, compiling down to DOMAIN_SET() in the non-VIMAGE case. - Walk all virtualized kernel subsystems and make use of these instead of modinfo or DOMAIN_SET() for init/uninit events. In some cases, convert modular components from using modevent to using sysinit (where appropriate). In some cases, do minor rejuggling of SYSINIT ordering to make room for or better manage events. Portions submitted by: jhb (VNET_SYSINIT), bz (cleanup) Discussed with: jhb, bz, julian, zec Reviewed by: bz Approved by: re (VIMAGE blanket)
# eddfbb76	14-Jul-2009	Robert Watson <rwatson@FreeBSD.org>	Build on Jeff Roberson's linker-set based dynamic per-CPU allocator (DPCPU), as suggested by Peter Wemm, and implement a new per-virtual network stack memory allocator. Modify vnet to use the allocator instead of monolithic global container structures (vinet, ...). This change solves many binary compatibility problems associated with VIMAGE, and restores ELF symbols for virtualized global variables. Each virtualized global variable exists as a "reference copy", and also once per virtual network stack. Virtualized global variables are tagged at compile-time, placing the in a special linker set, which is loaded into a contiguous region of kernel memory. Virtualized global variables in the base kernel are linked as normal, but those in modules are copied and relocated to a reserved portion of the kernel's vnet region with the help of a the kernel linker. Virtualized global variables exist in per-vnet memory set up when the network stack instance is created, and are initialized statically from the reference copy. Run-time access occurs via an accessor macro, which converts from the current vnet and requested symbol to a per-vnet address. When "options VIMAGE" is not compiled into the kernel, normal global ELF symbols will be used instead and indirection is avoided. This change restores static initialization for network stack global variables, restores support for non-global symbols and types, eliminates the need for many subsystem constructors, eliminates large per-subsystem structures that caused many binary compatibility issues both for monitoring applications (netstat) and kernel modules, removes the per-function INIT_VNET_*() macros throughout the stack, eliminates the need for vnet_symmap ksym(2) munging, and eliminates duplicate definitions of virtualized globals under VIMAGE_GLOBALS. Bump __FreeBSD_version and update UPDATING. Portions submitted by: bz Reviewed by: bz, zec Discussed with: gnn, jamie, jeff, jhb, julian, sam Suggested by: peter Approved by: re (kensmith)
# 8c0fec80	23-Jun-2009	Robert Watson <rwatson@FreeBSD.org>	Modify most routines returning 'struct ifaddr *' to return references rather than pointers, requiring callers to properly dispose of those references. The following routines now return references: ifaddr_byindex ifa_ifwithaddr ifa_ifwithbroadaddr ifa_ifwithdstaddr ifa_ifwithnet ifaof_ifpforaddr ifa_ifwithroute ifa_ifwithroute_fib rt_getifa rt_getifa_fib IFP_TO_IA ip_rtaddr in6_ifawithifp in6ifa_ifpforlinklocal in6ifa_ifpwithaddr in6_ifadd carp_iamatch6 ip6_getdstifaddr Remove unused macro which didn't have required referencing: IFP_TO_IA6 This closes many small races in which changes to interface or address lists while an ifaddr was in use could lead to use of freed memory (etc). In a few cases, add missing if_addr_list locking required to safely acquire references. Because of a lack of deep copying support, we accept a race in which an in6_ifaddr pointed to by mbuf tags and extracted with ip6_getdstifaddr() doesn't hold a reference while in transmit. Once we have mbuf tag deep copy support, this can be fixed. Reviewed by: bz Obtained from: Apple, Inc. (portions) MFC after: 6 weeks (portions)
# 1099f828	21-Jun-2009	Robert Watson <rwatson@FreeBSD.org>	Clean up common ifaddr management: - Unify reference count and lock initialization in a single function, ifa_init(). - Move tear-down from a macro (IFAFREE) to a function ifa_free(). - Move reference count bump from a macro (IFAREF) to a function ifa_ref(). - Instead of using a u_int protected by a mutex to refcount(9) for reference count management. The ifa_mtx is now used for exactly one ioctl, and possibly should be removed. MFC after: 3 weeks
# c03528b6	10-Jun-2009	Bjoern A. Zeeb <bz@FreeBSD.org>	SCTP needs either IPv4 or IPv6 as lower layer[1]. So properly hide the already #ifdef SCTP code with #if defined(INET) \|\| defined(INET6) as well to get us closer to a non-INET/INET6 kernel. Discussed with: tuexen [1]
# 8d8bc018	08-Jun-2009	Bjoern A. Zeeb <bz@FreeBSD.org>	After r193232 rt_tables in vnet.h are no longer indirectly dependent on the ROUTETABLES kernel option thus there is no need to include opt_route.h anymore in all consumers of vnet.h and no longer depend on it for module builds. Remove the hidden include in flowtable.h as well and leave the two explicit #includes in ip_input.c and ip_output.c.
# c2c2a7c1	01-Jun-2009	Bjoern A. Zeeb <bz@FreeBSD.org>	Convert the two dimensional array to be malloced and introduce an accessor function to get the correct rnh pointer back. Update netstat to get the correct pointer using kvm_read() as well. This not only fixes the ABI problem depending on the kernel option but also permits the tunable to overwrite the kernel option at boot time up to MAXFIBS, enlarging the number of FIBs without having to recompile. So people could just use GENERIC now. Reviewed by: julian, rwatson, zec X-MFC: not possible
# d4b5cae4	01-Jun-2009	Robert Watson <rwatson@FreeBSD.org>	Reimplement the netisr framework in order to support parallel netisr threads: - Support up to one netisr thread per CPU, each processings its own workstream, or set of per-protocol queues. Threads may be bound to specific CPUs, or allowed to migrate, based on a global policy. In the future it would be desirable to support topology-centric policies, such as "one netisr per package". - Allow each protocol to advertise an ordering policy, which can currently be one of: NETISR_POLICY_SOURCE: packets must maintain ordering with respect to an implicit or explicit source (such as an interface or socket). NETISR_POLICY_FLOW: make use of mbuf flow identifiers to place work, as well as allowing protocols to provide a flow generation function for mbufs without flow identifers (m2flow). Falls back on NETISR_POLICY_SOURCE if now flow ID is available. NETISR_POLICY_CPU: allow protocols to inspect and assign a CPU for each packet handled by netisr (m2cpuid). - Provide utility functions for querying the number of workstreams being used, as well as a mapping function from workstream to CPU ID, which protocols may use in work placement decisions. - Add explicit interfaces to get and set per-protocol queue limits, and get and clear drop counters, which query data or apply changes across all workstreams. - Add a more extensible netisr registration interface, in which protocols declare 'struct netisr_handler' structures for each registered NETISR_ type. These include name, handler function, optional mbuf to flow ID function, optional mbuf to CPU ID function, queue limit, and ordering policy. Padding is present to allow these to be expanded in the future. If no queue limit is declared, then a default is used. - Queue limits are now per-workstream, and raised from the previous IFQ_MAXLEN default of 50 to 256. - All protocols are updated to use the new registration interface, and with the exception of netnatm, default queue limits. Most protocols register as NETISR_POLICY_SOURCE, except IPv4 and IPv6, which use NETISR_POLICY_FLOW, and will therefore take advantage of driver- generated flow IDs if present. - Formalize a non-packet based interface between interface polling and the netisr, rather than having polling pretend to be two protocols. Provide two explicit hooks in the netisr worker for start and end events for runs: netisr_poll() and netisr_pollmore(), as well as a function, netisr_sched_poll(), to allow the polling code to schedule netisr execution. DEVICE_POLLING still embeds single-netisr assumptions in its implementation, so for now if it is compiled into the kernel, a single and un-bound netisr thread is enforced regardless of tunable configuration. In the default configuration, the new netisr implementation maintains the same basic assumptions as the previous implementation: a single, un-bound worker thread processes all deferred work, and direct dispatch is enabled by default wherever possible. Performance measurement shows a marginal performance improvement over the old implementation due to the use of batched dequeue. An rmlock is used to synchronize use and registration/unregistration using the framework; currently, synchronized use is disabled (replicating current netisr policy) due to a measurable 3%-6% hit in ping-pong micro-benchmarking. It will be enabled once further rmlock optimization has taken place. However, in practice, netisrs are rarely registered or unregistered at runtime. A new man page for netisr will follow, but since one doesn't currently exist, it hasn't been updated. This change is not appropriate for MFC, although the polling shutdown handler should be merged to 7-STABLE. Bump __FreeBSD_version. Reviewed by: bz
# 0304c731	27-May-2009	Jamie Gritton <jamie@FreeBSD.org>	Add hierarchical jails. A jail may further virtualize its environment by creating a child jail, which is visible to that jail and to any parent jails. Child jails may be restricted more than their parents, but never less. Jail names reflect this hierarchy, being MIB-style dot-separated strings. Every thread now points to a jail, the default being prison0, which contains information about the physical system. Prison0's root directory is the same as rootvnode; its hostname is the same as the global hostname, and its securelevel replaces the global securelevel. Note that the variable "securelevel" has actually gone away, which should not cause any problems for code that properly uses securelevel_gt() and securelevel_ge(). Some jail-related permissions that were kept in global variables and set via sysctls are now per-jail settings. The sysctls still exist for backward compatibility, used only by the now-deprecated jail(2) system call. Approved by: bz (mentor)
# 21ca7b57	05-May-2009	Marko Zec <zec@FreeBSD.org>	Change the curvnet variable from a global const struct vnet , previously always pointing to the default vnet context, to a dynamically changing thread-local one. The currvnet context should be set on entry to networking code via CURVNET_SET() macros, and reverted to previous state via CURVNET_RESTORE(). Recursions on curvnet are permitted, though strongly discuouraged. This change should have no functional impact on nooptions VIMAGE kernel builds, where CURVNET_ macros expand to whitespace. The curthread->td_vnet (aka curvnet) variable's purpose is to be an indicator of the vnet context in which the current network-related operation takes place, in case we cannot deduce the current vnet context from any other source, such as by looking at mbuf's m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so far curvnet has turned out to be an invaluable consistency checking aid: it helps to catch cases when sockets, ifnets or any other vnet-aware structures may have leaked from one vnet to another. The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros was a result of an empirical iterative process, whith an aim to reduce recursions on CURVNET_SET() to a minimum, while still reducing the scope of CURVNET_SET() to networking only operations - the alternative would be calling CURVNET_SET() on each system call entry. In general, curvnet has to be set in three typicall cases: when processing socket-related requests from userspace or from within the kernel; when processing inbound traffic flowing from device drivers to upper layers of the networking stack, and when executing timer-driven networking functions. This change also introduces a DDB subcommand to show the list of all vnet instances. Approved by: julian (mentor)
# 093f25f8	26-Apr-2009	Marko Zec <zec@FreeBSD.org>	In preparation for turning on options VIMAGE in next commits, rearrange / replace / adjust several INIT_VNET_* initializer macros, all of which currently resolve to whitespace. Reviewed by: bz (an older version of the patch) Approved by: julian (mentor)
# 77ee4c0e	20-Apr-2009	Robert Watson <rwatson@FreeBSD.org>	Acquire address list lock before walking an interface's address list to identify possible jail addresses on it for IPv4 and IPv6. MFC after: 2 weeks
# 427ac07f	14-Apr-2009	Kip Macy <kmacy@FreeBSD.org>	Extend route command: - add show as alias for get - add weights to allow mpath to do more than equal cost - add sticky / nostick to disable / re-enable per-connection load balancing This adds a field to rt_metrics_lite so network bits of world will need to be re-built. Reviewed by: jeli & qingli
# 9c79d243	05-Feb-2009	Jamie Gritton <jamie@FreeBSD.org>	Call prison_if from rtm_get_jailed, instead of splitting it out into prison_check_ip4 and prison_check_ip6. As prison_if includes a jailed() check, remove that check before calling rtm_get_jailed. Approved by: bz (mentor)
# b89e82dd	05-Feb-2009	Jamie Gritton <jamie@FreeBSD.org>	Standardize the various prison_foo_ip[46] functions and prison_if to return zero on success and an error code otherwise. The possible errors are EADDRNOTAVAIL if an address being checked for doesn't match the prison, and EAFNOSUPPORT if the prison doesn't have any addresses in that address family. For most callers of these functions, use the returned error code instead of e.g. a hard-coded EADDRNOTAVAIL or EINVAL. Always include a jailed() check in these functions, where a non-jailed cred always returns success (and makes no changes). Remove the explicit jailed() checks that preceded many of the function calls. Approved by: bz (mentor)
# 1cecba0f	25-Jan-2009	Bjoern A. Zeeb <bz@FreeBSD.org>	For consistency with prison_{local,remote,check}_ipN rename prison_getipN to prison_get_ipN. Submitted by: jamie (as part of a larger patch) MFC after: 1 week
# 82b334e8	16-Jan-2009	Qing Li <qingli@FreeBSD.org>	The RTF_LLINFO was revived unconditionally, but within the kernel the check on the sysctl argument value being RTF_LLINFO is conditioned on the COMPAT_ROUTE_FLAGS kernel option. This mismatch caused the L2 table retrieval failure, and the arp/ndp -an command displays empty L2 tables. Reviewed by: pjd
# 14981d80	12-Jan-2009	Qing Li <qingli@FreeBSD.org>	Revive the RTF_LLINFO flag in route.h. The kernel code is guarded by the new kernel option COMPAT_ROUTE_FLAGS for binary backward compatibility. The RTF_LLDATA flag maps to the same value as RTF_LLINFO. RTF_LLDATA is used by the arp and ndp utilities. The RTF_LLDATA flag is always returned to the userland regardless whether the COMPAT_ROUTE_FLAGS is defined.
# c2ded8ae	09-Jan-2009	Bjoern A. Zeeb <bz@FreeBSD.org>	Rather than using the cred from curthread, take it from the thread referenced in the sysctl req argument. Reviewed by: rwatson MFC after: 2 weeks
# 813dd6ae	09-Jan-2009	Bjoern A. Zeeb <bz@FreeBSD.org>	Restrict arp, ndp and theoretically the FIB listing (if not read with libkvm) to the addresses of a prison, when inside a jail. [1] As the patch from the PR was pre-'new-arp', add checks to the llt_dump handlers as well. While touching RTM_GET in route_output(), consistently use curthread credentials rather than the creds from the socket there. [2] PR: kern/68189 Submitted by: Mark Delany <sxcg2-fuwxj@qmda.emu.st> [1] Discussed with: rwatson [2] Reviewed by: rwatson MFC after: 4 weeks
# ebda3fc3	09-Jan-2009	Bjoern A. Zeeb <bz@FreeBSD.org>	Take the cred from curthread rather than curproc as curproc would need locking but the credential from curthread (usually) never changes. Discussed with: jhb MFC after: 2 weeks
# 8eca593c	26-Dec-2008	Qing Li <qingli@FreeBSD.org>	This checkin addresses a couple of issues: 1. The "route" command allows route insertion through the interface-direct option "-iface". During if_attach(), an sockaddr_dl{} entry is created for the interface and is part of the interface address list. This sockaddr_dl{} entry describes the interface in detail. The "route" command selects this entry as the "gateway" object when the "-iface" option is present. The "arp" and "ndp" commands also interact with the kernel through the routing socket when adding and removing static L2 entries. The static L2 information is also provided through the "gateway" object with an AF_LINK family type, similar to what is provided by the "route" command. In order to differentiate between these two types of operations, a RTF_LLDATA flag is introduced. This flag is set by the "arp" and "ndp" commands when issuing the add and delete commands. This flag is also set in each L2 entry returned by the kernel. The "arp" and "ndp" command follows a convention where a RTM_GET is issued first followed by a RTM_ADD/DELETE. This RTM_GET request fills in the fields for a "rtm" object, which is reinjected into the kernel by a subsequent RTM_ADD/DELETE command. The entry returend from RTM_GET is a prefix route, so the RTF_LLDATA flag must be specified when issuing the RTM_ADD/DELETE messages. 2. Enforce the convention that NET_RT_FLAGS with a 0 w_arg is the specification for retrieving L2 information. Also optimized the code logic. Reviewed by: julian
# 6e6b3f7c	14-Dec-2008	Qing Li <qingli@FreeBSD.org>	This main goals of this project are: 1. separating L2 tables (ARP, NDP) from the L3 routing tables 2. removing as much locking dependencies among these layers as possible to allow for some parallelism in the search operations 3. simplify the logic in the routing code, The most notable end result is the obsolescent of the route cloning (RTF_CLONING) concept, which translated into code reduction in both IPv4 ARP and IPv6 NDP related modules, and size reduction in struct rtentry{}. The change in design obsoletes the semantics of RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland applications such as "arp" and "ndp" have been modified to reflect those changes. The output from "netstat -r" shows only the routing entries. Quite a few developers have contributed to this project in the past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and Andre Oppermann. And most recently: - Kip Macy revised the locking code completely, thus completing the last piece of the puzzle, Kip has also been conducting active functional testing - Sam Leffler has helped me improving/refactoring the code, and provided valuable reviews - Julian Elischer setup the perforce tree for me and has helped me maintaining that branch before the svn conversion
# a5a3926c	13-Dec-2008	Andrew Thompson <thompsa@FreeBSD.org>	Dont leak the rnh lock on error.
# 9b20205d	10-Dec-2008	Kip Macy <kmacy@FreeBSD.org>	fix a reported panic when adding a route and one hit here when deleting a route - pass RTF_RNH_LOCKED to rtalloc1_fib in 2 cases where the lock is held - make sure the rnh lock is held across rt_setgate and rt_getifa_fib
# 609ff41f	07-Dec-2008	Warner Losh <imp@FreeBSD.org>	Add missing include to sys/lock.h before sys/rwlock.h
# 3120b9d4	07-Dec-2008	Kip Macy <kmacy@FreeBSD.org>	- convert radix node head lock from mutex to rwlock - make radix node head lock not recursive - fix LOR in rtexpunge - fix LOR in rtredirect Reviewed by: sam
# 4b79449e	02-Dec-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Rather than using hidden includes (with cicular dependencies), directly include only the header files needed. This reduces the unneeded spamming of various headers into lots of files. For now, this leaves us with very few modules including vnet.h and thus needing to depend on opt_route.h. Reviewed by: brooks, gnn, des, zec, imp Sponsored by: The FreeBSD Foundation
# 413628a7	29-Nov-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
# 1ede983c	23-Oct-2008	Dag-Erling Smørgrav <des@FreeBSD.org>	Retire the MALLOC and FREE macros. They are an abomination unto style(9). MFC after: 3 months
# d7f03759	19-Oct-2008	Ulf Lilleengen <lulf@FreeBSD.org>	- Import the HEAD csup code which is the basis for the cvsmode work.
# 8b615593	02-Oct-2008	Marko Zec <zec@FreeBSD.org>	Step 1.5 of importing the network stack virtualization infrastructure from the vimage project, as per plan established at devsummit 08/08: http://wiki.freebsd.org/Image/Notes200808DevSummit Introduce INIT_VNET_() initializer macros, VNET_FOREACH() iterator macros, and CURVNET_SET() context setting macros, all currently resolving to NOPs. Prepare for virtualization of selected SYSCTL objects by introducing a family of SYSCTL_V_() macros, currently resolving to their global counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT(). Move selected #defines from sys/sys/vimage.h to newly introduced header files specific to virtualized subsystems (sys/net/vnet.h, sys/netinet/vinet.h etc.). All the changes are verified to have zero functional impact at this point in time by doing MD5 comparision between pre- and post-change object files(). () netipsec/keysock.c did not validate depending on compile time options. Implemented by: julian, bz, brooks, zec Reviewed by: julian, bz, brooks, kris, rwatson, ... Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
# 603724d3	17-Aug-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Commit step 1 of the vimage project, (network stack) virtualization work done by Marko Zec (zec@). This is the first in a series of commits over the course of the next few weeks. Mark all uses of global variables to be virtualized with a V_ prefix. Use macros to map them back to their global names for now, so this is a NOP change only. We hope to have caught at least 85-90% of what is needed so we do not invalidate a lot of outstanding patches again. Obtained from: //depot/projects/vimage-commit2/... Reviewed by: brooks, des, ed, mav, julian, jamie, kris, rwatson, zec, ... (various people I forgot, different versions) md5 (with a bit of help) Sponsored by: NLnet Foundation, The FreeBSD Foundation X-MFC after: never V_Commit_Message_Reviewed_By: more people than the patch
# 4d896055	09-Jul-2008	Robert Watson <rwatson@FreeBSD.org>	Remove unused support for local and foreign addresses in generic raw socket support. These utility routines are used only for routing and pfkey sockets, neither of which have a notion of address, so were required to mock up fake socket addresses to avoid connection requirements for applications that did not specify their own fake addresses (most of them). Quite a bit of the removed code is #ifdef notdef, since raw sockets don't support bind() or connect() in practice. Removing this simplifies the raw socket implementation, and removes two (commented out) uses of dtom(9). Fake addresses passed to sendto(2) by applications are ignored for compatibility reasons, but this is now done in a more consistent way (and with a comment). Possibly, EINVAL could be returned here in the future if it is determined that no applications depend on the semantic inconsistency of specifying a destination address for a protocol without address support, but this will require some amount of careful surveying. NB: This does not affect netinet, netinet6, or other wire protocol raw sockets, which provide their own independent infrastructure with control block address support specific to the protocol. MFC after: 3 weeks Reviewed by: bz
# 59dd72d0	03-Jul-2008	Robert Watson <rwatson@FreeBSD.org>	Remove NETISR_MPSAFE, which allows specific netisr handlers to be directly dispatched without Giant, and add NETISR_FORCEQUEUE, which allows specific netisr handlers to always be dispatched via a queue (deferred). Mark the usb and if_ppp netisr handlers as NETISR_FORCEQUEUE, and explicitly acquire Giant in those handlers. Previously, any netisr handler not marked NETISR_MPSAFE would necessarily run deferred and with Giant acquired. This change removes Giant scaffolding from the netisr infrastructure, but NETISR_FORCEQUEUE allows non-MPSAFE handlers to continue to force deferred dispatch so as to avoid lock order reversals between their acqusition of Giant and any calling context. It is likely we will be able to remove NETISR_FORCEQUEUE once IFF_NEEDSGIANT is removed, as non-MPSAFE usb and if_ppp drivers will no longer be supported. Reviewed by: bz MFC after: 1 month X-MFC note: We can't remove NETISR_MPSAFE from stable/7 for KPI reasons, but the rest can go back.
# 8b07e49a	09-May-2008	Julian Elischer <julian@FreeBSD.org>	Add code to allow the system to handle multiple routing tables. This particular implementation is designed to be fully backwards compatible and to be MFC-able to 7.x (and 6.x) Currently the only protocol that can make use of the multiple tables is IPv4 Similar functionality exists in OpenBSD and Linux. From my notes: ----- One thing where FreeBSD has been falling behind, and which by chance I have some time to work on is "policy based routing", which allows different packet streams to be routed by more than just the destination address. Constraints: ------------ I want to make some form of this available in the 6.x tree (and by extension 7.x) , but FreeBSD in general needs it so I might as well do it in -current and back port the portions I need. One of the ways that this can be done is to have the ability to instantiate multiple kernel routing tables (which I will now refer to as "Forwarding Information Bases" or "FIBs" for political correctness reasons). Which FIB a particular packet uses to make the next hop decision can be decided by a number of mechanisms. The policies these mechanisms implement are the "Policies" referred to in "Policy based routing". One of the constraints I have if I try to back port this work to 6.x is that it must be implemented as a EXTENSION to the existing ABIs in 6.x so that third party applications do not need to be recompiled in timespan of the branch. This first version will not have some of the bells and whistles that will come with later versions. It will, for example, be limited to 16 tables in the first commit. Implementation method, Compatible version. (part 1) ------------------------------- For this reason I have implemented a "sufficient subset" of a multiple routing table solution in Perforce, and back-ported it to 6.x. (also in Perforce though not always caught up with what I have done in -current/P4). The subset allows a number of FIBs to be defined at compile time (8 is sufficient for my purposes in 6.x) and implements the changes needed to allow IPV4 to use them. I have not done the changes for ipv6 simply because I do not need it, and I do not have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it. Other protocol families are left untouched and should there be users with proprietary protocol families, they should continue to work and be oblivious to the existence of the extra FIBs. To understand how this is done, one must know that the current FIB code starts everything off with a single dimensional array of pointers to FIB head structures (One per protocol family), each of which in turn points to the trie of routes available to that family. The basic change in the ABI compatible version of the change is to extent that array to be a 2 dimensional array, so that instead of protocol family X looking at rt_tables[X] for the table it needs, it looks at rt_tables[Y][X] when for all protocol families except ipv4 Y is always 0. Code that is unaware of the change always just sees the first row of the table, which of course looks just like the one dimensional array that existed before. The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign() are all maintained, but refer only to the first row of the array, so that existing callers in proprietary protocols can continue to do the "right thing". Some new entry points are added, for the exclusive use of ipv4 code called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(), which have an extra argument which refers the code to the correct row. In addition, there are some new entry points (currently called rtalloc_fib() and friends) that check the Address family being looked up and call either rtalloc() (and friends) if the protocol is not IPv4 forcing the action to row 0 or to the appropriate row if it IS IPv4 (and that info is available). These are for calling from code that is not specific to any particular protocol. The way these are implemented would change in the non ABI preserving code to be added later. One feature of the first version of the code is that for ipv4, the interface routes show up automatically on all the FIBs, so that no matter what FIB you select you always have the basic direct attached hosts available to you. (rtinit() does this automatically). You CAN delete an interface route from one FIB should you want to but by default it's there. ARP information is also available in each FIB. It's assumed that the same machine would have the same MAC address, regardless of which FIB you are using to get to it. This brings us as to how the correct FIB is selected for an outgoing IPV4 packet. Firstly, all packets have a FIB associated with them. if nothing has been done to change it, it will be FIB 0. The FIB is changed in the following ways. Packets fall into one of a number of classes. 1/ locally generated packets, coming from a socket/PCB. Such packets select a FIB from a number associated with the socket/PCB. This in turn is inherited from the process, but can be changed by a socket option. The process in turn inherits it on fork. I have written a utility call setfib that acts a bit like nice.. setfib -3 ping target.example.com # will use fib 3 for ping. It is an obvious extension to make it a property of a jail but I have not done so. It can be achieved by combining the setfib and jail commands. 2/ packets received on an interface for forwarding. By default these packets would use table 0, (or possibly a number settable in a sysctl(not yet)). but prior to routing the firewall can inspect them (see below). (possibly in the future you may be able to associate a FIB with packets received on an interface.. An ifconfig arg, but not yet.) 3/ packets inspected by a packet classifier, which can arbitrarily associate a fib with it on a packet by packet basis. A fib assigned to a packet by a packet classifier (such as ipfw) would over-ride a fib associated by a more default source. (such as cases 1 or 2). 4/ a tcp listen socket associated with a fib will generate accept sockets that are associated with that same fib. 5/ Packets generated in response to some other packet (e.g. reset or icmp packets). These should use the FIB associated with the packet being reponded to. 6/ Packets generated during encapsulation. gif, tun and other tunnel interfaces will encapsulate using the FIB that was in effect withthe proces that set up the tunnel. thus setfib 1 ifconfig gif0 [tunnel instructions] will set the fib for the tunnel to use to be fib 1. Routing messages would be associated with their process, and thus select one FIB or another. messages from the kernel would be associated with the fib they refer to and would only be received by a routing socket associated with that fib. (not yet implemented) In addition Netstat has been edited to be able to cope with the fact that the array is now 2 dimensional. (It looks in system memory using libkvm (!)). Old versions of netstat see only the first FIB. In addition two sysctls are added to give: a) the number of FIBs compiled in (active) b) the default FIB of the calling process. Early testing experience: ------------------------- Basically our (IronPort's) appliance does this functionality already using ipfw fwd but that method has some drawbacks. For example, It can't fully simulate a routing table because it can't influence the socket's choice of local address when a connect() is done. Testing during the generating of these changes has been remarkably smooth so far. Multiple tables have co-existed with no notable side effects, and packets have been routes accordingly. ipfw has grown 2 new keywords: setfib N ip from anay to any count ip from any to any fib N In pf there seems to be a requirement to be able to give symbolic names to the fibs but I do not have that capacity. I am not sure if it is required. SCTP has interestingly enough built in support for this, called VRFs in Cisco parlance. it will be interesting to see how that handles it when it suddenly actually does something. Where to next: -------------------- After committing the ABI compatible version and MFCing it, I'd like to proceed in a forward direction in -current. this will result in some roto-tilling in the routing code. Firstly: the current code's idea of having a separate tree per protocol family, all of the same format, and pointed to by the 1 dimensional array is a bit silly. Especially when one considers that there is code that makes assumptions about every protocol having the same internal structures there. Some protocols don't WANT that sort of structure. (for example the whole idea of a netmask is foreign to appletalk). This needs to be made opaque to the external code. My suggested first change is to add routing method pointers to the 'domain' structure, along with information pointing the data. instead of having an array of pointers to uniform structures, there would be an array pointing to the 'domain' structures for each protocol address domain (protocol family), and the methods this reached would be called. The methods would have an argument that gives FIB number, but the protocol would be free to ignore it. When the ABI can be changed it raises the possibilty of the addition of a fib entry into the "struct route". Currently, the structure contains the sockaddr of the desination, and the resulting fib entry. To make this work fully, one could add a fib number so that given an address and a fib, one can find the third element, the fib entry. Interaction with the ARP layer/ LL layer would need to be revisited as well. Qing Li has been working on this already. This work was sponsored by Ironport Systems/Cisco Reviewed by: several including rwatson, bz and mlair (parts each) Obtained from: Ironport systems/Cisco
# e440aed9	12-Apr-2008	Qing Li <qingli@FreeBSD.org>	This patch provides the back end support for equal-cost multi-path (ECMP) for both IPv4 and IPv6. Previously, multipath route insertion is disallowed. For example, route add -net 192.103.54.0/24 10.9.44.1 route add -net 192.103.54.0/24 10.9.44.2 The second route insertion will trigger an error message of "add net 192.103.54.0/24: gateway 10.2.5.2: route already in table" Multiple default routes can also be inserted. Here is the netstat output: default 10.2.5.1 UGS 0 3074 bge0 => default 10.2.5.2 UGS 0 0 bge0 When multipath routes exist, the "route delete" command requires a specific gateway to be specified or else an error message would be displayed. For example, route delete default would fail and trigger the following error message: "route: writing to routing socket: No such process" "delete net default: not in table" On the other hand, route delete default 10.2.5.2 would be successful: "delete net default: gateway 10.2.5.2" One does not have to specify a gateway if there is only a single route for a particular destination. I need to perform more testings on address aliases and multiple interfaces that have the same IP prefixes. This patch as it stands today is not yet ready for prime time. Therefore, the ECMP code fragments are fully guarded by the RADIX_MPATH macro. Include the "options RADIX_MPATH" in the kernel configuration to enable this feature. Reviewed by: robert, sam, gnn, julian, kmacy
# 237fdd78	16-Mar-2008	Robert Watson <rwatson@FreeBSD.org>	In keeping with style(9)'s recommendations on macros, use a ';' after each SYSINIT() macro invocation. This makes a number of lightweight C parsers much happier with the FreeBSD kernel source, including cflow's prcc and lxr. MFC after: 1 month Discussed with: imp, rink
# 18b6e4c8	08-Sep-2007	Olivier Houchard <cognet@FreeBSD.org>	Do not set the RTF_GATEWAY flag if RTF_LLINFO is set, it doesn't make much sense in that context, and leads to unusable routes. This should unbreak bootpd. Discussed with: glebius Submitted by: bms Approved by: re (bmah)
# 5de55821	27-Mar-2007	Gleb Smirnoff <glebius@FreeBSD.org>	Fix regression in rev. 1.140. Reported by: Yuriy Tsibizov <Yuriy.Tsibizov gfk.ru>, bsam
# 75ae0c01	27-Mar-2007	Bruce M Simpson <bms@FreeBSD.org>	Fix a case where hardware removal of an interface caused an attempt to announce an ll_ifma which has gone away. Add a KASSERT to catch regressions. Bug found by: Tom Uffner
# 9406b274	22-Mar-2007	Gleb Smirnoff <glebius@FreeBSD.org>	When working on an RTM_CHANGE do the route editing in the following sequence. First, if rt_ifa is going to be changed, then call ifa_rtrequest(RTM_DELETE). Second, if gateway is going to be changed, then call rt_setgate(). Third, change rt_ifa. With this change we are able to change a link level route to a gateway one, that wasn't possible before: # ifconfig em0 192.168.22.1/24 # arp -s 192.168.22.99 00:11:22:33:44:55 # route change 192.168.22.99 192.168.22.199 # ping 192.168.22.99 db> Reported by: avatar
# acd3428b	06-Nov-2006	Robert Watson <rwatson@FreeBSD.org>	Sweep kernel replacing suser(9) calls with priv(9) calls, assigning specific privilege names to a broad range of privileges. These may require some future tweaking. Sponsored by: nCircle Network Security, Inc. Obtained from: TrustedBSD Project Discussed on: arch@ Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri, Alex Lyashkov <umka at sevcity dot net>, Skip Ford <skip dot ford at verizon dot net>, Antoine Brodin <antoine dot brodin at laposte dot net>
# f8829a4a	03-Nov-2006	Randall Stewart <rrs@FreeBSD.org>	Ok, here it is, we finally add SCTP to current. Note that this work is not just mine, but it is also the works of Peter Lei and Michael Tuexen. They both are my two key other developers working on the project.. and they need ata-boy's too: ** peterlei@cisco.com tuexen@fh-muenster.de ** I did do a make sysent which updated the syscall's and sysproto.. I hope that is correct... without it you don't build since we have new syscalls for SCTP :-0 So go out and look at the NOTES, add option SCTP (make sure inet and inet6 are present too) and play with SCTP. I will see about comitting some test tools I have after I figure out where I should place them. I also have a lib (libsctp.a) that adds some of the missing socketapi functions that I need to put into lib's.. I will talk to George about this :-) There may still be some 64 bit issues in here, none of us have a 64 bit processor to test with yet.. Michael may have a MAC but thats another beast too.. If you have a mac and want to use SCTP contact Michael he maintains a web site with a loadable module with this code :-) Reviewed by: gnn Approved by: gnn
# a152f8a3	21-Jul-2006	Robert Watson <rwatson@FreeBSD.org>	Change semantics of socket close and detach. Add a new protocol switch function, pru_close, to notify protocols that the file descriptor or other consumer of a socket is closing the socket. pru_abort is now a notification of close also, and no longer detaches. pru_detach is no longer used to notify of close, and will be called during socket tear-down by sofree() when all references to a socket evaporate after an earlier call to abort or close the socket. This means detach is now an unconditional teardown of a socket, whereas previously sockets could persist after detach of the protocol retained a reference. This faciliates sharing mutexes between layers of the network stack as the mutex is required during the checking and removal of references at the head of sofree(). With this change, pru_detach can now assume that the mutex will no longer be required by the socket layer after completion, whereas before this was not necessarily true. Reviewed by: gnn
# e27c3f48	05-Jul-2006	Oleg Bulyzhin <oleg@FreeBSD.org>	Adjust rt_(set\|get)metrics() to do kernel <-> userland timebase conversion. We need it since kernel timebase has changed (time_second -> time_uptime). Approved by: glebius (mentor)
# bc725eaf	01-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Chance protocol switch method pru_detach() so that it returns void rather than an error. Detaches do not "fail", they other occur or the protocol flags SS_PROTOREF to take ownership of the socket. soclose() no longer looks at so_pcb to see if it's NULL, relying entirely on the protocol to decide whether it's time to free the socket or not using SS_PROTOREF. so_pcb is now entirely owned and managed by the protocol code. Likewise, no longer test so_pcb in other socket functions, such as soreceive(), which have no business digging into protocol internals. Protocol detach routines no longer try to free the socket on detach, this is performed in the socket code if the protocol permits it. In rts_detach(), no longer test for rp != NULL in detach, and likewise in other protocols that don't permit a NULL so_pcb, reduce the incidence of testing for it during detach. netinet and netinet6 are not fully updated to this change, which will be in an upcoming commit. In their current state they may leak memory or panic. MFC after: 3 months
# ac45e92f	01-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Change protocol switch pru_abort() API so that it returns void rather than an int, as an error here is not meaningful. Modify soabort() to unconditionally free the socket on the return of pru_abort(), and modify most protocols to no longer conditionally free the socket, since the caller will do this. This commit likely leaves parts of netinet and netinet6 in a situation where they may panic or leak memory, as they have not are not fully updated by this commit. This will be corrected shortly in followup commits to these components. MFC after: 3 months
# 22cafcf0	15-Mar-2006	Andre Oppermann <andre@FreeBSD.org>	- Fill in the correct rtm_index for RTM_ADD and RTM_CHANGE messages. - Allow RTM_CHANGE to change a number of route flags as specified by RTF_FMASK. - The unused rtm_use field in struct rt_msghdr is redesignated as rtm_fmask field to communicate route flag changes in RTM_CHANGE messages from userland. The use count of a route was moved to rtm_rmx a long time ago. For source code compatibility reasons a define of rtm_use to rtm_fmask is provided. These changes faciliate running of multiple cooperating routing daemons at the same time without causing undesired interference. Open[BGP\|OSPF]D make use of these features to have IGP routes override EGP ones. Obtained from: OpenBSD (claudio@) MFC after: 3 days
# 4a0d6638	11-Nov-2005	Ruslan Ermilov <ru@FreeBSD.org>	- Store pointer to the link-level address right in "struct ifnet" rather than in ifindex_table[]; all (except one) accesses are through ifp anyway. IF_LLADDR() works faster, and all (except one) ifaddr_byindex() users were converted to use ifp->if_addr. - Stop storing a (pointer to) Ethernet address in "struct arpcom", and drop the IFP2ENADDR() macro; all users have been converted to use IF_LLADDR() instead.
# 303989a2	09-Nov-2005	Ruslan Ermilov <ru@FreeBSD.org>	Use sparse initializers for "struct domain" and "struct protosw", so they are easier to follow for the human being.
# a11faa9f	19-Sep-2005	Gleb Smirnoff <glebius@FreeBSD.org>	Drop current rtentry lock before calling rt_getifa(). This fixes a LOR and a possible recursive use of rtentry mutex. PR: kern/69356 Reviewed by: sam
# fe0fc7ef	10-Sep-2005	Christian S.J. Peron <csjp@FreeBSD.org>	Protect interface and address lists using the appropriate mutex. These locks were not aquired because the user buffers were not wired, thus it was possible that that SYSCTL_OUT could sleep, causing a number of different problems such as lock ordering issues and dead locks. -Wire user supplied buffer to ensure SYSCTL_OUT will not sleep. -Pickup ifnet locks to protect the list. -Where applicable pickup address locks. -Pickup radix node head locks. -Remove splnet stubs -Remove various comments about locking here, because they are no longer needed. It is the hope that these changes will make sysctl_rtsock MP safe. MFC after: 3 weeks
# 5b1c0294	07-Sep-2005	David E. O'Brien <obrien@FreeBSD.org>	Forward declaring static variables as extern is invalid ISO-C. Now that GCC can properly handle forward static declarations, do this properly.
# 7e994955	25-Aug-2005	Robert Watson <rwatson@FreeBSD.org>	De-spl parts of the routing socket code now generally protected through locking; leave some spl references around code where there are open questions about global variable references. Also, add an XXX regarding locking in sysctl. MFC after: 3 days
# 79188861	11-Aug-2005	Gleb Smirnoff <glebius@FreeBSD.org>	o To prevent a race between RTM_DELETE message and arptimer() deleting stale entry, we need to lock rtentry before unlocking radix head. Reviewed by: sam
# 292ee7be	09-Aug-2005	Robert Watson <rwatson@FreeBSD.org>	Rename IFF_RUNNING to IFF_DRV_RUNNING, IFF_OACTIVE to IFF_DRV_OACTIVE, and move both flags from ifnet.if_flags to ifnet.if_drv_flags, making and documenting the locking of these flags the responsibility of the device driver, not the network stack. The flags for these two fields will be mutually exclusive so that they can be exposed to user space as though they were stored in the same variable. Provide #defines to provide the old names #ifndef _KERNEL, so that user applications (such as ifconfig) can use the old flag names. Using the old names in a device driver will result in a compile error in order to help device driver writers adopt the new model. When exposing the interface flags to user space, via interface ioctls or routing sockets, or the two fields together. Since the driver flags cannot currently be set for user space, no new logic is currently required to handle this case. Add some assertions that general purpose network stack routines, such as if_setflags(), are not improperly used on driver-owned flags. With this change, a large number of very minor network stack races are closed, subject to correct device driver locking. Most were likely never triggered. Driver sweep to follow; many thanks to pjd and bz for the line-by-line review they gave this patch. Reviewed by: pjd, bz MFC after: 7 days
# ba7be0a9	15-Jul-2005	George V. Neville-Neil <gnn@FreeBSD.org>	Fix for PR 82974. We were not checking that the route looked up in the case of an RTM_CHANGE was specific, i.e. that it matched completely. This led to a route change of a non-existent route changing the default route as the radix code would simply back track to that point and hand that route back to the routing socket code. PR: 82974 Reviewed by: Tai-hwa Liang <avatar@mmlab.cse.yzu.edu.tw> Ben Kaduk <minimarmot@gmail.com> Bjoern A. Zeeb <bzeeb-lists@lists.zabbadoz.net> Obtained from: OpenBSD with modifications. MFC after: 2 weeks
# 25029d6c	08-Jun-2005	Hartmut Brandt <harti@FreeBSD.org>	When returing an RTM_GET message through the routing socket fill in the rtm_index field whenever we have an interface pointer. This is consistent with the RTM_GET messages returned by sysctl().
# 7a7fa27b	26-Mar-2005	Sam Leffler <sam@FreeBSD.org>	rt_newaddrmsg will blow up if given something other than RTM_ADD or RTM_DELETE; add an assertion, may want to do something more heavyhanded in the future Noticed by: Coverity Prevent analysis tool Reviewed by: mdodd
# 8d78bea4	23-Feb-2005	Sam Leffler <sam@FreeBSD.org>	eliminate dead code and collapse the remainder Noticed by: Coverity Prevent analysis tool Reviewed by: rwatson
# c398230b	06-Jan-2005	Warner Losh <imp@FreeBSD.org>	/* -> /*- for license, minor formatting changes
# 756d52a1	08-Nov-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Initialize struct pr_userreqs in new/sparse style and fill in common default elements in net_init_domain(). This makes it possible to grep these structures and see any bogosities.
# b83a279f	05-Oct-2004	Sam Leffler <sam@FreeBSD.org>	Add 802.11-specific events that are dispatched through the routing socket. This really doesn't belong here but is preferred (for the moment) over adding yet another mechanism for sending msgs from the kernel to user apps. Reviewed by: imp
# 3161f583	27-Aug-2004	Andre Oppermann <andre@FreeBSD.org>	Apply error and success logic consistently to the function netisr_queue() and its users. netisr_queue() now returns (0) on success and ERRNO on failure. At the moment ENXIO (netisr queue not functional) and ENOBUFS (netisr queue full) are supported. Previously it would return (1) on success but the return value of IF_HANDOFF() was interpreted wrongly and (0) was actually returned on success. Due to this schednetisr() was never called to kick the scheduling of the isr. However this was masked by other normal packets coming through netisr_dispatch() causing the dequeueing of waiting packets. PR: kern/70988 Found by: MOROHOSHI Akihiko <moro@remus.dti.ne.jp> MFC after: 3 days
# 18aee723	24-Aug-2004	Peter Pentchev <roam@FreeBSD.org>	Fix a typo (attacked -> attached). Approved by: sam
# b062951a	21-Aug-2004	Robert Watson <rwatson@FreeBSD.org>	If a tunable for the routing socket netisr queue max is defined, allow it to override the default value, rather than the default value overriding the tunable.
# 190a4c94	21-Aug-2004	Robert Watson <rwatson@FreeBSD.org>	Allow the size of the routing socket netisr queue to be configured using the tunable or sysctl 'net.route.netisr_maxqlen'. Default the maximum depth to 256 rather than IFQ_MAXLEN due to the downsides of dropping routing messages. MT5 candidate. Discussed with: mdodd, mlaier, Vincent Jardin <jardin at 6wind.com>
# 3b7d076f	13-Aug-2004	Robert Watson <rwatson@FreeBSD.org>	Use IFQ_SET_MAXLEN() to set the maximum queue depth of the routing socket netisr queue. Pointed out by: winter
# 9b3d77e7	05-Jul-2004	Bruce M Simpson <bms@FreeBSD.org>	Be consistent and use bzero() instead of memset().
# d989c7b3	08-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Introduce a netisr to deliver kernel-generated routing, avoiding recursive entering of the socket code from the routing code: - Modify rt_dispatch() to bundle up the sockaddr family, if any, associated with a pending mbuf to dispatch to routing sockets, in an m_tag on the mbuf. - Allocate NETISR_ROUTE for use by routing sockets. - Introduce rtsintrq, an ifqueue to be used by the netisr, and introduce rts_input(), a function to unbundle the tagged sockaddr and inject the mbuf and address into raw_input(), which previously occurred in rt_dispatch(). - Introduce rts_init() to initialize rtsintrq, its mutex, and register the netisr. Perform this at the same point in system initialization as setup of the domains. This change introduces asynchrony between the generation of a pending routing socket message and delivery to sockets for use by userspace. It avoids socket->routing->rtsock->socket use and helps to avoid lock order reversals between the routing code and socket code (in particular, raw socket control blocks), as route locks are held over calls to rt_dispatch(). Reviewed by: "George V.Neville-Neil" <gnn@neville-neil.com> Conceptual head nod by: sam
# 3581cc66	10-May-2004	Christian S.J. Peron <csjp@FreeBSD.org>	Zero the un-used portions of the struct sockaddr data before sending it back to userspace, so it does not break bind(2) on raw sockets in jails. Currently some processes, like traceroute(8) construct a routing request to determine its source address based on the destination. This sockaddr data is fed directly to bind(2). When bind calls ifa_ifwithaddr(9) to make sure the address exists on the interface, the comparison will fail causing bind(2) to return EADDRNOTAVAIL if the data wasnt zero'ed before initialization. Approved by: bmilekic (mentor)
# 1a0c4873	03-May-2004	Maxim Konovalov <maxim@FreeBSD.org>	o Fix misindentation in the previous commit.
# 5a59cefc	26-Apr-2004	Bosko Milekic <bmilekic@FreeBSD.org>	Give jail(8) the feature to allow raw sockets from within a jail, which is less restrictive but allows for more flexible jail usage (for those who are willing to make the sacrifice). The default is off, but allowing raw sockets within jails can now be accomplished by tuning security.jail.allow_raw_sockets to 1. Turning this on will allow you to use things like ping(8) or traceroute(8) from within a jail. The patch being committed is not identical to the patch in the PR. The committed version is more friendly to APIs which pjd is working on, so it should integrate into his work quite nicely. This change has also been presented and addressed on the freebsd-hackers mailing list. Submitted by: Christian S.J. Peron <maneo@bsdpro.com> PR: kern/65800
# 9554c70b	19-Apr-2004	Ruslan Ermilov <ru@FreeBSD.org>	More style and deobfuscation fixes. Submitted by: bde
# ae24a36e	18-Apr-2004	Ruslan Ermilov <ru@FreeBSD.org>	Style and code unobfuscation.
# b088717c	18-Apr-2004	Ruslan Ermilov <ru@FreeBSD.org>	Fixed a bug from rev. 1.42: cast to a correct type. Submitted by: luigi
# 6b96f1af	18-Apr-2004	Luigi Rizzo <luigi@FreeBSD.org>	+ replace Bcmp/Bzero with 'the real thing' as in the rest of the file. + remember to check and fix or explain a strange cast in route_output()
# 5dfc91d7	17-Apr-2004	Luigi Rizzo <luigi@FreeBSD.org>	Minor changes to improve code readability (no actual code changes): + replace 0 with NULL where appropriate (not complete) + remove register declaration while there + add argument names to function prototypes to have a better idea of what they are used for + add 'const' qualifiers in 3 places
# 913af518	17-Apr-2004	Luigi Rizzo <luigi@FreeBSD.org>	misc cleanup in sysctl_ifmalist(): + remove a partly incorrect comment that i introduced in the last commit; + deal with the correct part of the above comment by cleaning up the updates of 'info' -- rti_addrs needd not to be updated, rti_info[RTAX_IFP] can be set once outside the loop. While at it, correct a few misspelling of NULL as 0, but there are way too many in this file, and i did not want to clutter the important part of this commit.
# 9b98ee2c	16-Apr-2004	Luigi Rizzo <luigi@FreeBSD.org>	Consistently use ifaddr_byindex() to access the link-level address of an interface. No functional change. On passing, comment a likely bug in net/rtsock.c:sysctl_ifmalist() which, if confirmed, would deserve to be fixed and MFC'ed
# e74642df	13-Apr-2004	Luigi Rizzo <luigi@FreeBSD.org>	route.h: introduce a macro, SA_SIZE(struct sockaddr *) which returns the space occupied by a struct sockaddr when passed through a routing socket. Use it to replace the macro ROUNDUP(int), that does the same but is redefined by every file which uses it, courtesy of the School of Cut'n'Paste Programming(TM). (partial) userland changes to follow.
# a8b76c8f	12-Apr-2004	Luigi Rizzo <luigi@FreeBSD.org>	remove an almost-duplicate piece of code by setting the loop limits appropriately.
# f36cfd49	07-Apr-2004	Warner Losh <imp@FreeBSD.org>	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999 and email from Peter Wemm, Alan Cox and Robert Watson. Approved by: core, peter, alc, rwatson
# 97d8d152	20-Nov-2003	Andre Oppermann <andre@FreeBSD.org>	Introduce tcp_hostcache and remove the tcp specific metrics from the routing table. Move all usage and references in the tcp stack from the routing table metrics to the tcp hostcache. It caches measured parameters of past tcp sessions to provide better initial start values for following connections from or to the same source or destination. Depending on the network parameters to/from the remote host this can lead to significant speedups for new tcp connections after the first one because they inherit and shortcut the learning curve. tcp_hostcache is designed for multiple concurrent access in SMP environments with high contention and is hash indexed by remote ip address. It removes significant locking requirements from the tcp stack with regard to the routing table. Reviewed by: sam (mentor), bms Reviewed by: -net, -current, core@kame.net (IPv6 parts) Approved by: re (scottl)
# a557af22	17-Nov-2003	Robert Watson <rwatson@FreeBSD.org>	Introduce a MAC label reference in 'struct inpcb', which caches the MAC label referenced from 'struct socket' in the IPv4 and IPv6-based protocols. This permits MAC labels to be checked during network delivery operations without dereferencing inp->inp_socket to get to so->so_label, which will eventually avoid our having to grab the socket lock during delivery at the network layer. This change introduces 'struct inpcb' as a labeled object to the MAC Framework, along with the normal circus of entry points: initialization, creation from socket, destruction, as well as a delivery access control check. For most policies, the inpcb label will simply be a cache of the socket label, so a new protocol switch method is introduced, pr_sosetlabel() to notify protocols that the socket layer label has been updated so that the cache can be updated while holding appropriate locks. Most protocols implement this using pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use the the worker function in_pcbsosetlabel(), which calls into the MAC Framework to perform a cache update. Biba, LOMAC, and MLS implement these entry points, as do the stub policy, and test policy. Reviewed by: sam, bms Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 05b2efe0	14-Nov-2003	Bruce M Simpson <bms@FreeBSD.org>	Add a sysctl MIB, NET_RT_IFMALIST, to retrieve multicast group memberships in a protocol-independent way. Submitted by: harti
# 7138d65c	08-Nov-2003	Sam Leffler <sam@FreeBSD.org>	replace explicit changes to rt_refcnt by RT_ADDREF and RT_REMREF macros that expand to include assertions when the system is built with INVARIANTS Supported by: FreeBSD Foundation
# 9bf40ede	31-Oct-2003	Brooks Davis <brooks@FreeBSD.org>	Replace the if_name and if_unit members of struct ifnet with new members if_xname, if_dname, and if_dunit. if_xname is the name of the interface and if_dname/unit are the driver name and instance. This change paves the way for interface renaming and enhanced pseudo device creation and configuration symantics. Approved By: re (in principle) Reviewed By: njl, imp Tested On: i386, amd64, sparc64 Obtained From: NetBSD (if_xname)
# d1dd20be	03-Oct-2003	Sam Leffler <sam@FreeBSD.org>	Locking for updates to routing table entries. Each rtentry gets a mutex that covers updates to the contents. Note this is separate from holding a reference and/or locking the routing table itself. Other/related changes: o rtredirect loses the final parameter by which an rtentry reference may be returned; this was never used and added unwarranted complexity for locking. o minor style cleanups to routing code (e.g. ansi-fy function decls) o remove the logic to bump the refcnt on the parent of cloned routes, we assume the parent will remain as long as the clone; doing this avoids a circularity in locking during delete o convert some timeouts to MPSAFE callouts Notes: 1. rt_mtx in struct rtentry is guarded by #ifdef _KERNEL as user-level applications cannot/do-no know about mutex's. Doing this requires that the mutex be the last element in the structure. A better solution is to introduce an externalized version of struct rtentry but this is a major task because of the intertwining of rtentry and other data structures that are visible to user applications. 2. There are known LOR's that are expected to go away with forthcoming work to eliminate many held references. If not these will be resolved prior to release. 3. ATM changes are untested. Sponsored by: FreeBSD Foundation Obtained from: BSD/OS (partly)
# aea8b30f	03-Oct-2003	Sam Leffler <sam@FreeBSD.org>	trivial locking rtsock_cb Sponsored by: FreeBSD Foundation
# becc44d7	03-Oct-2003	Sam Leffler <sam@FreeBSD.org>	cleanups prior to adding locking (and in some cases to eliminate locking): o move route_cb to be private to rtsock.c o replace global static route_proto by locals o eliminate global #define shorthands for info references o remove some register decls o ansi-fy function decls o move items to be close in scope to their usage o add rt_dispatch function for dispatching the actual message o cleanup tangled logic for doing all-but-me msg send Support by: FreeBSD Foundation
# 3c6b084e	05-Mar-2003	Peter Wemm <peter@FreeBSD.org>	Finish driving a stake through the heart of netns and the associated ifdefs scattered around the place - its dead Jim! The SMB stuff had stolen AF_NS, make it official.
# a163d034	18-Feb-2003	Warner Losh <imp@FreeBSD.org>	Back out M_* changes, per decision of the TRB. Approved by: trb
# 44956c98	21-Jan-2003	Alfred Perlstein <alfred@FreeBSD.org>	Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
# 7701e15b	25-Dec-2002	Jeffrey Hsu <hsu@FreeBSD.org>	Disable radix node locking for sysctl until we fix the sysctl infrastructure to not sleep.
# b2aaf46e	25-Dec-2002	Jeffrey Hsu <hsu@FreeBSD.org>	Range-check the address family parameter passed in to the sysctl handler. Submitted by: ru
# 71eba915	25-Dec-2002	Ruslan Ermilov <ru@FreeBSD.org>	If the caller of rtrequest*(RTM_DELETE, ...) asked for a copy of the entry being removed (ret_nrt != NULL), increment the entry's rt_refcnt like we do it for RTM_ADD and RTM_RESOLVE, rather than messing around with 1->0 transitions for rtfree() all over.
# 956b0b65	23-Dec-2002	Jeffrey Hsu <hsu@FreeBSD.org>	SMP locking for radix nodes.
# b30a244c	21-Dec-2002	Jeffrey Hsu <hsu@FreeBSD.org>	SMP locking for ifnet list.
# 12e552d6	20-Dec-2002	Jeffrey Hsu <hsu@FreeBSD.org>	Swap the order of a free and a use of an ifaddr structure.
# 19fc74fb	18-Dec-2002	Jeffrey Hsu <hsu@FreeBSD.org>	Lock up ifaddr reference counts.
# 8d3574c7	01-Oct-2002	Poul-Henning Kamp <phk@FreeBSD.org>	Fix some harmless mis-indents. Spotted by: FlexeLint
# 37c84183	28-Sep-2002	Poul-Henning Kamp <phk@FreeBSD.org>	Be consistent about "static" functions: if the function is marked static in its prototype, mark it static at the definition too. Inspired by: FlexeLint warning #512
# 93b0017f	25-Aug-2002	Philippe Charnier <charnier@FreeBSD.org>	Replace various spelling with FALLTHROUGH which is lint()able
# 62f76486	18-Aug-2002	Maxim Sobolev <sobomax@FreeBSD.org>	Increase size of ifnet.if_flags from 16 bits (short) to 32 bits (int). To avoid breaking application ABI use unused ifreq.ifru_flags[1] for upper 16 bits in SIOCSIFFLAGS and SIOCGIFFLAGS ioctl's. Reviewed by: -hackers, -net
# 03e49181	18-Jun-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Remove so*_locked(), which were backed out by mistake.
# 4cc20ab1	31-May-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Back out my lats commit of locking down a socket, it conflicts with hsu's work. Requested by: hsu
# 243917fe	19-May-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Lock down a socket, milestone 1. o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a socket buffer. The mutex in the receive buffer also protects the data in struct socket. o Determine the lock strategy for each members in struct socket. o Lock down the following members: - so_count - so_options - so_linger - so_state o Remove *_locked() socket APIs. Make the following socket APIs touching the members above now require a locked socket: - sodisconnect() - soisconnected() - soisconnecting() - soisdisconnected() - soisdisconnecting() - sofree() - soref() - sorele() - sorwakeup() - sotryfree() - sowakeup() - sowwakeup() Reviewed by: alfred
# f1320723	01-May-2002	Alfred Perlstein <alfred@FreeBSD.org>	Redo the sigio locking. Turn the sigio sx into a mutex. Sigio lock is really only needed to protect interrupts from dereferencing the sigio pointer in an object when the sigio itself is being destroyed. In order to do this in the most unintrusive manner change pgsigio's sigio * argument into a **, that way we can lock internally to the function.
# 960ed29c	29-Apr-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Revert the change of #includes in sys/filedesc.h and sys/socketvar.h. Requested by: bde Since locking sigio_lock is usually followed by calling pgsigio(), move the declaration of sigio_lock and the definitions of SIGIO_*() to sys/signalvar.h. While I am here, sort include files alphabetically, where possible.
# d48d4b25	27-Apr-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Add a global sx sigio_lock to protect the pointer to the sigio object of a socket. This avoids lock order reversal caused by locking a process in pgsigio(). sowakeup() and the callers of it (sowwakeup, soisconnected, etc.) now require sigio_lock to be locked. Provide sowwakeup_locked(), soisconnected_locked(), and so on in case where we have to modify a socket and wake up a process atomically.
# 44731cab	01-Apr-2002	John Baldwin <jhb@FreeBSD.org>	Change the suser() API to take advantage of td_ucred as well as do a general cleanup of the API. The entire API now consists of two functions similar to the pre-KSE API. The suser() function takes a thread pointer as its only argument. The td_ucred member of this thread must be valid so the only valid thread pointers are curthread and a few kernel threads such as thread0. The suser_cred() function takes a pointer to a struct ucred as its first argument and an integer flag as its second argument. The flag is currently only used for the PRISON_ROOT flag. Discussed on: smp@
# 929ddbbb	19-Mar-2002	Alfred Perlstein <alfred@FreeBSD.org>	Remove __P.
# a854ed98	27-Feb-2002	John Baldwin <jhb@FreeBSD.org>	Simple p_ucred -> td_ucred changes to start using the per-thread ucred reference.
# 694ff264	27-Jan-2002	Andrew Gallatin <gallatin@FreeBSD.org>	Prevent the kernel from generating an unaligned sysctl data buffer on 64-bit platforms. The unaligned access is caused by struct ifa_msghdr not being a multiple of 8-bytes in size. If an interface has an odd number of addresses, this causes the next interface to generate an unaligned access in the user-level app walking the interfaces (ifconfig). Submitted by: Bernd Walter <ticso@cicely8.cicely.de>
# f7a54d06	24-Jan-2002	Crist J. Clark <cjc@FreeBSD.org>	Have sysctl() return the correct errno(2) as documented in the sysctl(3) manpage. Submitted by: ru Obtained from: BSD/OS
# 7b6edd04	18-Jan-2002	Ruslan Ermilov <ru@FreeBSD.org>	Introduce an interface announcement message for the routing socket so that routing daemons and other interested parties know when an interface is attached/detached. PR: kern/33747 Obtained from: NetBSD MFC after: 2 weeks
# e20e9426	19-Dec-2001	Brian Somers <brian@FreeBSD.org>	It's no longer necessary to ensure that ``gate'' is set when RTF_GATEWAY is passed, as subsequent code does that check now anyway. Submitted by: ru
# 02a5d63e	19-Dec-2001	Brian Somers <brian@FreeBSD.org>	Only call rt_getifa() if we've either been passed a gateway or if we've been given an RTA_IFP or changed RTA_IFA sockaddr. This fixes the following bug: >/dev/tun100 >/dev/tun101 ifconfig tun100 1.2.3.4 5.6.7.8 ifconfig tun101 1.2.3.4 6.7.8.9 route change 6.7.8.9 -ifa 1.2.3.4 -iface -mtu 500 which erroneously changed tun101's host route to have an ifp of tun100 (rt_getifa() sets the ifp after calling ifa_ifwithnet(1.2.3.4)) This incarnation submitted by: ru
# 8071913d	17-Oct-2001	Ruslan Ermilov <ru@FreeBSD.org>	Pull post-4.4BSD change to sys/net/route.c from BSD/OS 4.2. Have sys/net/route.c:rtrequest1(), which takes ``rt_addrinfo '' as the argument. Pass rt_addrinfo all the way down to rtrequest1 and ifa->ifa_rtrequest. 3rd argument of ifa->ifa_rtrequest is now ``rt_addrinfo '' instead of ``sockaddr '' (almost noone is using it anyways). Benefit: the following command now works. Previously we needed two route(8) invocations, "add" then "change". # route add -inet6 default ::1 -ifp gif0 Remove unsafe typecast in rtrequest(), from ``rtentry '' to ``sockaddr *''. It was introduced by 4.3BSD-Reno and never corrected. Obtained from: BSD/OS, NetBSD MFC after: 1 month PR: kern/28360
# 28070a0e	17-Oct-2001	Ruslan Ermilov <ru@FreeBSD.org>	Bring in latest CSRG revisions to this file: - Report destination address of a P2P link when servicing routing socket messages. - Report interface name, address, and destination address of a P2P link when servicing NET_RT_{DUMP,FLAGS} sysctls. Part of CSRG revision 8.6 coresponds to revision 1.12. CSRG revision 8.7 corresponds to revision 1.15.
# a35b06c5	28-Sep-2001	Jonathan Lemon <jlemon@FreeBSD.org>	Change sysctl_iflist() so it has a single point of return. This will assist any future locking efforts.
# dadb6c3b	20-Sep-2001	Ruslan Ermilov <ru@FreeBSD.org>	Use the current process's credentials rather than socket's cached. If the process drops its super-user privileges, we certainly don't want to allow it to modify routing tables. Discussed with: rwatson
# b40ce416	12-Sep-2001	Julian Elischer <julian@FreeBSD.org>	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
# 162c0b2e	30-Aug-2001	Ruslan Ermilov <ru@FreeBSD.org>	Synch with NetBSD and OpenBSD. Allow non-superuser to open, listen to, and send safe commands on the routing socket. Superuser priviledge is required for all commands but RTM_GET. Lose `setuid root' bit of route(8). Reviewed by: wollman, dd
# 7ba271ae	02-Aug-2001	Jonathan Chen <jon@FreeBSD.org>	fix memory leak when error during opening of routing socket PR: kern/29336 Submitted by: Richard Andrades <richard@xebeo.com> MFC after: 1 month
# 03311056	04-Jul-2001	Hajimu UMEMOTO <ume@FreeBSD.org>	adjust mbuf length right in route_output(). Obtained from: KAME MFC after: 1 week
# 33841545	10-Jun-2001	Hajimu UMEMOTO <ume@FreeBSD.org>	Sync with recent KAME. This work was based on kame-20010528-freebsd43-snap.tgz and some critical problem after the snap was out were fixed. There are many many changes since last KAME merge. TODO: - The definitions of SADB_* in sys/net/pfkeyv2.h are still different from RFC2407/IANA assignment because of binary compatibility issue. It should be fixed under 5-CURRENT. - ip6po_m member of struct ip6_pktopts is no longer used. But, it is still there because of binary compatibility issue. It should be removed under 5-CURRENT. Reviewed by: itojun Obtained from: KAME MFC after: 3 weeks
# 91421ba2	20-Feb-2001	Robert Watson <rwatson@FreeBSD.org>	o Move per-process jail pointer (p->pr_prison) to inside of the subject credential structure, ucred (cr->cr_prison). o Allow jail inheritence to be a function of credential inheritence. o Abstract prison structure reference counting behind pr_hold() and pr_free(), invoked by the similarly named credential reference management functions, removing this code from per-ABI fork/exit code. o Modify various jail() functions to use struct ucred arguments instead of struct proc arguments. o Introduce jailed() function to determine if a credential is jailed, rather than directly checking pointers all over the place. o Convert PRISON_CHECK() macro to prison_check() function. o Move jail() function prototypes to jail.h. o Emulate the P_JAILED flag in fill_kinfo_proc() and no longer set the flag in the process flags field itself. o Eliminate that "const" qualifier from suser/p_can/etc to reflect mutex use. Notes: o Some further cleanup of the linux/jail code is still required. o It's now possible to consider resolving some of the process vs credential based permission checking confusion in the socket code. o Mutex protection of struct prison is still not present, and is required to protect the reference count plus some fields in the structure. Reviewed by: freebsd-arch Obtained from: TrustedBSD Project
# fc2ffbe6	04-Feb-2001	Poul-Henning Kamp <phk@FreeBSD.org>	Mechanical change to use <sys/queue.h> macro API instead of fondling implementation details. Created with: sed(1) Reviewed by: md5(1)
# 22f29826	03-Feb-2001	Poul-Henning Kamp <phk@FreeBSD.org>	Use <sys/queue.h> macro api rather than fondle its implementation detals. Created with: /usr/bin/sed Reviewed by: /sbin/md5
# 7cc0979f	08-Dec-2000	David Malone <dwmalone@FreeBSD.org>	Convert more malloc+bzero to malloc+M_ZERO. Submitted by: josh@zipperup.org Submitted by: Robert Drehmel <robd@gmx.net>
# 8bf72aef	25-Jul-2000	Hajimu UMEMOTO <ume@FreeBSD.org>	Workaround to avoid panic during detach pccard nic.
# 77978ab8	04-Jul-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Previous commit changing SYSCTL_HANDLER_ARGS violated KNF. Pointed out by: bde
# 82d9ae4e	03-Jul-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Style police catches up with rev 1.26 of src/sys/sys/sysctl.h: Sanitize SYSCTL_HANDLER_ARGS so that simplistic tools can grog our sources: -sysctl_vm_zone SYSCTL_HANDLER_ARGS +sysctl_vm_zone (SYSCTL_HANDLER_ARGS)
# 242c5536	12-Feb-2000	Peter Wemm <peter@FreeBSD.org>	Clean up some loose ends in the network code, including the X.25 and ISO #ifdefs. Clean out unused netisr's and leftover netisr linker set gunk. Tested on x86 and alpha, including world. Approved by: jkh
# 899ce4f4	28-Jan-2000	Yoshinobu Inoue <shin@FreeBSD.org>	Count AF_INET6 attachement to routing socket. Obtained from: KAME project
# 920eb79f	28-Dec-1999	Ruslan Ermilov <ru@FreeBSD.org>	Make cloning mask sockaddr (genmask) possible. PR: kern/3061 Reviewed by: wollman
# c3aac50f	27-Aug-1999	Peter Wemm <peter@FreeBSD.org>	$Id$ -> $FreeBSD$
# cb64988f	28-Apr-1999	Luoqi Chen <luoqi@FreeBSD.org>	Postpone route_init() until all domains are attached.
# 75c13541	28-Apr-1999	Poul-Henning Kamp <phk@FreeBSD.org>	This Implements the mumbled about "Jail" feature. This is a seriously beefed up chroot kind of thing. The process is jailed along the same lines as a chroot does it, but with additional tough restrictions imposed on what the superuser can do. For all I know, it is safe to hand over the root bit inside a prison to the customer living in that prison, this is what it was developed for in fact: "real virtual servers". Each prison has an ip number associated with it, which all IP communications will be coerced to use and each prison has its own hostname. Needless to say, you need more RAM this way, but the advantage is that each customer can run their own particular version of apache and not stomp on the toes of their neighbors. It generally does what one would expect, but setting up a jail still takes a little knowledge. A few notes: I have no scripts for setting up a jail, don't ask me for them. The IP number should be an alias on one of the interfaces. mount a /proc in each jail, it will make ps more useable. /proc/<pid>/status tells the hostname of the prison for jailed processes. Quotas are only sensible if you have a mountpoint per prison. There are no privisions for stopping resource-hogging. Some "#ifdef INET" and similar may be missing (send patches!) If somebody wants to take it from here and develop it into more of a "virtual machine" they should be most welcome! Tools, comments, patches & documentation most welcome. Have fun... Sponsored by: http://www.rndassociates.com/ Run for almost a year by: http://www.servetheweb.com/
# 831a80b0	27-Jan-1999	Matthew Dillon <dillon@FreeBSD.org>	Fix warnings in preparation for adding -Wall -Wcast-qual to the kernel compile
# 95b6073c	31-Oct-1997	David Greenman <dg@FreeBSD.org>	Fixed bug in RTM_ADD where rmx_locks weren't being set on the new route, preventing "route add default 1.2.3.4 -lock -mtu 1500" from working as expected (which is, BTW, to disable Path MTU Discovery).
# 55b211e3	28-Oct-1997	Bruce Evans <bde@FreeBSD.org>	Removed unused #includes.
# a1c995b6	12-Oct-1997	Poul-Henning Kamp <phk@FreeBSD.org>	Last major round (Unless Bruce thinks of somthing :-) of malloc changes. Distribute all but the most fundamental malloc types. This time I also remembered the trick to making things static: Put "static" in front of them. A couple of finer points by: bde
# f8f6cbba	13-Sep-1997	Peter Wemm <peter@FreeBSD.org>	Update network code to use poll support.
# 4d1d4912	01-Sep-1997	Bruce Evans <bde@FreeBSD.org>	Added used #include - don't depend on <sys/mbuf.h> including <sys/malloc.h> (unless we only use the bogusly shared M*WAIT flags).
# 57bf258e	16-Aug-1997	Garrett Wollman <wollman@FreeBSD.org>	Fix all areas of the system (or at least all those in LINT) to avoid storing socket addresses in mbufs. (Socket buffers are the one exception.) A number of kernel APIs needed to get fixed in order to make this happen. Also, fix three protocol families which kept PCBs in mbufs to not malloc them instead. Delete some old compatibility cruft while we're at it, and add some new routines in the in_cksum family.
# 076d0761	18-Jul-1997	Julian Elischer <julian@FreeBSD.org>	An actual fix for the routing default crashes that 1/ is compatible with the old route(1) in case needed. 2/ actually fixes the problem while vetting bad user input. note: I have already fixed route(1) so the problem shouldn't occur. if it does. use 0.0.0.0/0 instead of the word 'default' :)
# 9a969a6e	17-Jul-1997	Mike Smith <msmith@FreeBSD.org>	Fix Julian's fixed fix. Routing is weird. We need to accept at least one sockaddr with zero length, in order to be able to set the default route. Suggested by: Phone conversation with Julian (sleep well!)
# ff6d0a59	16-Jul-1997	Julian Elischer <julian@FreeBSD.org>	Bungled cut/paste leaves kernel with page faults.. (read all about it!)
# 7f33a738	15-Jul-1997	Julian Elischer <julian@FreeBSD.org>	Finally track down the reason for some of my occasional kernel crashes. Route(1) has a bug that sends a bad message to the kernel. The kernel trusts it and crashes. Add some sanity checks so that we don't trust the user quite as much any more. (also add a comment in if_ethersubr.c)
# a29f300e	27-Apr-1997	Garrett Wollman <wollman@FreeBSD.org>	The long-awaited mega-massive-network-code- cleanup. Part I. This commit includes the following changes: 1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility glue for them is deleted, and the kernel will panic on boot if any are compiled in. 2) Certain protocol entry points are modified to take a process structure, so they they can easily tell whether or not it is possible to sleep, and also to access credentials. 3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt() call. Protocols should use the process pointer they are now passed. 4) The PF_LOCAL and PF_ROUTE families have been updated to use the new style, as has the `raw' skeleton family. 5) PF_LOCAL sockets now obey the process's umask when creating a socket in the filesystem. As a result, LINT is now broken. I'm hoping that some enterprising hacker with a bit more time will either make the broken bits work (should be easy for netipx) or dike them out.
# 6875d254	22-Feb-1997	Peter Wemm <peter@FreeBSD.org>	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# 3aa72005	03-Feb-1997	Bill Fenner <fenner@FreeBSD.org>	Make sure we have arguments to pass before calling ifaof_ifpforaddr and ifa_ifwithroute. This eliminates the panic seen in kern/2647, although it doesn't address the fact that RTM_CHANGE can't change flags.
# 1130b656	14-Jan-1997	Jordan K. Hubbard <jkh@FreeBSD.org>	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# 477180fb	13-Jan-1997	Garrett Wollman <wollman@FreeBSD.org>	Use the new if_multiaddrs list for multicast addresses rather than the previous hackery involving struct in_ifaddr and arpcom. Get rid of the abominable multi_kludge. Update all network interfaces to use the new machanism. Distressingly few Ethernet drivers program the multicast filter properly (assuming the hardware has one, which it usually does).
# 59562606	13-Dec-1996	Garrett Wollman <wollman@FreeBSD.org>	Convert the interface address and IP interface address structures to TAILQs. Fix places which referenced these for no good reason that I can see (the references remain, but were fixed to compile again; they are still questionable).
# 29412182	11-Dec-1996	Garrett Wollman <wollman@FreeBSD.org>	Use queue macros for the list of interfaces. Next stop: ifaddrs!
# 1db1fffa	09-Jul-1996	Bill Fenner <fenner@FreeBSD.org>	Disallow host routes that point to themselves. These routes serve no purpose, other than to get in the way of the ARP table and cause "can't allocate llinfo" errors. This change may cause gated or routed to start complaining when adding such routes. If so, these programs will need to be fixed to not try to add these routes. Reviewed by: wollman
# 6ddbf1e2	07-May-1996	Gary Palmer <gpalmer@FreeBSD.org>	Clean up various compiler warnings. Most (if not all) were benign Reviewed by: bde
# 2ee45d7d	11-Mar-1996	David Greenman <dg@FreeBSD.org>	Move or add #include <queue.h> in preparation for upcoming struct socket changes.
# 52041295	16-Nov-1995	Poul-Henning Kamp <phk@FreeBSD.org>	All net.* sysctl converted now.
# cc6a66f2	26-Oct-1995	Julian Elischer <julian@FreeBSD.org>	Reviewed by: julian and jhay@mikom.csir.co.za Submitted by: Mike Mitchell, supervisor@alb.asctmd.com This is a bulk mport of Mike's IPX/SPX protocol stacks and all the related gunf that goes with it.. it is not guaranteed to work 100% correctly at this time but as we had several people trying to work on it I figured it would be better to get it checked in so they could all get teh same thing to work on.. Mikes been using it for a year or so but on 2.0 more changes and stuff will be merged in from other developers now that this is in. Mike Mitchell, Network Engineer AMTECH Systems Corporation, Technology and Manufacturing 8600 Jefferson Street, Albuquerque, New Mexico 87113 (505) 856-8000 supervisor@alb.asctmd.com
# ed07cc7f	13-Oct-1995	Garrett Wollman <wollman@FreeBSD.org>	Protect against routing socket messages with way-too-big address families. Submitted by: Keith Sklower by way of Paul Traina
# 4ab32c34	08-Oct-1995	Bruce Evans <bde@FreeBSD.org>	Fix types of sysctl functions. Add prototypes. Cosmetic.
# 9e52b982	03-Oct-1995	Garrett Wollman <wollman@FreeBSD.org>	Import of 4.4-Lite-2 sys/net to make merge and examination easier. Since we are not on the vendor branch for any of these files, the conflicts shown make no matter. Obtained from: 4.4BSD-Lite-2
# 9b2e5354	30-May-1995	Rodney W. Grimes <rgrimes@FreeBSD.org>	Remove trailing whitespace.
# c9a84156	15-May-1995	David Greenman <dg@FreeBSD.org>	Fixed route reference count bug that squirmed in during the the routing-socket code upgrade from Berkeley.. Submitted by: Garrett Wollman via Peter Wemm via Cornell
# 748e0b0a	10-May-1995	Garrett Wollman <wollman@FreeBSD.org>	Make networking domains drop-ins, through the magic of GNU ld. (Some day, there may even be LKMs.) Also, change the internal name of `unixdomain' to `localdomain' since AF_LOCAL is now the preferred name of this family. Declare netisr correctly and in the right place.
# 78a82810	10-May-1995	Garrett Wollman <wollman@FreeBSD.org>	Updated routing-socket code from Berkeley Obtained from: Keith Bostic by way of Paul Traina
# b5e8ce9f	16-Mar-1995	Bruce Evans <bde@FreeBSD.org>	Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) and most of the warnings from `gcc -Wnested-externs'. Fix all the bugs found. There were no serious ones.
# 995add1a	13-Dec-1994	Garrett Wollman <wollman@FreeBSD.org>	Add support for two separate cloning flags, one set by the lower layers, and one set by the protocol family. Also add another parameter to rtalloc1() to allow for any interface flags to be ignored; currently this is only useful for RTF_PRCLONING. Get rid of rt_prflags and re-unite with rt_flags. Add T/TCP ``route metrics''. NB: YOU MUST RECOMPILE `route' AND OTHER RELATED PROGRAMS AS A RESULT OF THIS CHANGE. This also adds a new interface parameter, `ifi_physical', which will eventually replace IFF_ALTPHYS as the mechanism for specifying the particular physical connection desired on a multiple-connection card. NB: YOU MUST RECOMPILE `ifconfig' AND OTHER RELATED PROGRAMS AS A RESULT OF THIS CHANGE.
# 5df72964	11-Oct-1994	Garrett Wollman <wollman@FreeBSD.org>	Fix a bug which caused panics when attempting to change just the flags of a route. (This still doesn't work, but it doesn't panic now.) It looks like there may be a number of incipient bugs in this code. Also, get ready for the time when all IP gateway routes are cloning, which is necessary to keep proper TCP statistics.
# df440948	08-Oct-1994	Poul-Henning Kamp <phk@FreeBSD.org>	Cosmetics: to silence gcc -wall.
# 2624cf89	05-Oct-1994	Garrett Wollman <wollman@FreeBSD.org>	A number of bug-fixes inspired by Mark Treacy: - Allow PPP to run multicasts natively. - Deal properly with lots of similarly-named interfaces. - Don't sign-extend if_flags. NB: the last fix (to rtsock.c) must be reversed when we expand if_flags to a reasonable size. Submitted by: Mark Treacy
# c5789ba3	04-Oct-1994	Poul-Henning Kamp <phk@FreeBSD.org>	Moved m_copyback into uipc_mbuf.c
# 3c4dd356	02-Aug-1994	David Greenman <dg@FreeBSD.org>	Added $Id$
# 26f9a767	25-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
# df8bae1d	24-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	BSD 4.4 Lite Kernel Sources