Cross Reference: /freebsd-current/sys/netinet/ip

History log of /freebsd-current/sys/netinet/ip_divert.c
Revision	Date	Author	Comments
# d4033ebd	12-Jan-2024	Gleb Smirnoff <glebius@FreeBSD.org>	divert: just return EOPNOTSUPP on shutdown(2) Before this change we would always return ENOTCONN. There is no legitimate use of shutdown(2) on divert(4).
# c1146e6a	20-Oct-2023	Kristof Provost <kp@FreeBSD.org>	pf: use an enum for packet direction in divert tag The benefit is that in the debugger you will see PF_DIVERT_MTAG_DIR_IN instead of 1 when looking at a structure. And compilation time failure if anybody sets it to a wrong value. Using "port" instead of "ndir" when assigning a port improves readability of code. Suggested by: glebius MFC after: 3 weeks X-MFC-With: fabf705f4b
# fabf705f	18-Oct-2023	Igor Ostapenko <pm@igoro.pro>	pf: fix pf divert-to loop Resolved conflict between ipfw and pf if both are used and pf wants to do divert(4) by having separate mtags for pf and ipfw. Also fix the incorrect 'rulenum' check, which caused the reported loop. While here add a few test cases to ensure that divert-to works as expected, even if ipfw is loaded. divert(4) PR: 272770 MFC after: 3 weeks Reviewed by: kp Differential Revision: https://reviews.freebsd.org/D42142
# 685dc743	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove $FreeBSD$: one-line .c pattern Remove /^[\s]__FBSDID$"\$FreeBSD\$"$;?\s*\n/
# 3d0d5b21	23-Jan-2023	Justin Hibbits <jhibbits@FreeBSD.org>	IfAPI: Explicitly include <net/if_private.h> in netstack Summary: In preparation of making if_t completely opaque outside of the netstack, explicitly include the header. <net/if_var.h> will stop including the header in the future. Sponsored by: Juniper Networks, Inc. Reviewed by: glebius, melifaro Differential Revision: https://reviews.freebsd.org/D38200
# aa74cc6d	06-Sep-2022	Gleb Smirnoff <glebius@FreeBSD.org>	divert(4): do not depend on ipfw(4) Although originally socket was intended to use with ipfw(4) only, now it also can be used with pf(4). On a kernel without packet filters, it still can be used to inject traffic.
# 999c9fd7	06-Sep-2022	Gleb Smirnoff <glebius@FreeBSD.org>	divert(4): don't check for CSUM_SCTP without INET This compiles, but actually is a dead code. Noticed by: bz Fixes: e72c522858cb
# e72c5228	30-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	divert(4): make it compilable and working without INET Differential revision: https://reviews.freebsd.org/D36383
# f1fb0517	30-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	divert(4): maintain own cb database and stop using inpcb KPI Here go cons of using inpcb for divert: - divert(4) uses only 16 bits (local port) out of struct inpcb, which is 424 bytes today. - The inpcb KPI isn't able to provide hashing for divert(4), thus it uses global inpcb list for lookups. - divert(4) uses INET-specific part of the KPI, making INET a requirement for IPDIVERT. Maintain our own very simple hash lookup database instead. It has mutex protection for write and epoch protection for lookups. Since now so->so_pcb no longer points to struct inpcb, don't initialize protosw methods to methods that belong to PF_INET. Also, drop support for setting options on a divert socket. My review of software in base and ports confirms that this has no use and unlikely worked before. Differential revision: https://reviews.freebsd.org/D36382
# 2b1c7217	30-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	divert(4): provide statistics Instead of incrementing pretty random counters in the IP statistics, create divert socket statistics structure. Export via netstat(1). Differential revision: https://reviews.freebsd.org/D36381
# 8624f434	30-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	divert: declare PF_DIVERT domain and stop abusing PF_INET The divert(4) is not a protocol of IPv4. It is a socket to intercept packets from ipfw(4) to userland and re-inject them back. It can divert and re-inject IPv4 and IPv6 packets today, but potentially it is not limited to these two protocols. The IPPROTO_DIVERT does not belong to known IP protocols, it doesn't even fit into u_char. I guess, the implementation of divert(4) was done the way it is done basically because it was easier to do it this way, back when protocols for sockets were intertwined with IP protocols and domains were statically compiled in. Moving divert(4) out of inetsw accomplished two important things: 1) IPDIVERT is getting much closer to be not dependent on INET. This will be finalized in following changes. 2) Now divert socket no longer aliases with raw IPv4 socket. Domain/proto selection code won't need a hack for SOCK_RAW and multiple entries in inetsw implementing different flavors of raw socket can merge into one without requirement of raw IPv4 being the last member of dom_protosw. Differential revision: https://reviews.freebsd.org/D36379
# 8fc80638	29-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	divert: merge div_output() into div_send() No functional change intended.
# e7d02be1	17-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	protosw: refactor protosw and domain static declaration and load o Assert that every protosw has pr_attach. Now this structure is only for socket protocols declarations and nothing else. o Merge struct pr_usrreqs into struct protosw. This was suggested in 1996 by wollman@ (see 7b187005d18ef), and later reiterated in 2006 by rwatson@ (see 6fbb9cf860dcd). o Make struct domain hold a variable sized array of protosw pointers. For most protocols these pointers are initialized statically. Those domains that may have loadable protocols have spacers. IPv4 and IPv6 have 8 spacers each (andre@ dff3237ee54ea). o For inetsw and inet6sw leave a comment noting that many protosw entries very likely are dead code. o Refactor pf_proto_[un]register() into protosw_[un]register(). o Isolate pr_*_notsupp() methods into uipc_domain.c Reviewed by: melifaro Differential revision: https://reviews.freebsd.org/D36232
# 78b1fc05	17-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	protosw: separate pr_input and pr_ctlinput out of protosw The protosw KPI historically has implemented two quite orthogonal things: protocols that implement a certain kind of socket, and protocols that are IPv4/IPv6 protocol. These two things do not make one-to-one correspondence. The pr_input and pr_ctlinput methods were utilized only in IP protocols. This strange duality required IP protocols that doesn't have a socket to declare protosw, e.g. carp(4). On the other hand developers of socket protocols thought that they need to define pr_input/pr_ctlinput always, which lead to strange dead code, e.g. div_input() or sdp_ctlinput(). With this change pr_input and pr_ctlinput as part of protosw disappear and IPv4/IPv6 get their private single level protocol switch table ip_protox[] and ip6_protox[] respectively, pointing at array of ipproto_input_t functions. The pr_ctlinput that was used for control input coming from the network (ICMP, ICMPv6) is now represented by ip_ctlprotox[] and ip6_ctlprotox[]. ipproto_register() becomes the only official way to register in the table. Those protocols that were always static and unlikely anybody is interested in making them loadable, are now registered by ip_init(), ip6_init(). An IP protocol that considers itself unloadable shall register itself within its own private SYSINIT(). Reviewed by: tuexen, melifaro Differential revision: https://reviews.freebsd.org/D36157
# c7a62c92	10-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	inpcb: gather v4/v6 handling code into in_pcballoc() from protocols Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D36062
# 39f7de58	13-Apr-2022	John Baldwin <jhb@FreeBSD.org>	divert_packet: ip is only used for SCTP.
# fec8a8c7	03-Jan-2022	Gleb Smirnoff <glebius@FreeBSD.org>	inpcb: use global UMA zones for protocols Provide structure inpcbstorage, that holds zones and lock names for a protocol. Initialize it with global protocol init using macro INPCBSTORAGE_DEFINE(). Then, at VNET protocol init supply it as the main argument to the in_pcbinfo_init(). Each VNET pcbinfo uses its private hash, but they all use same zone to allocate and SMR section to synchronize. Note: there is kern.ipc.maxsockets sysctl, which controls UMA limit on the socket zone, which was always global. Historically same maxsockets value is applied also to every PCB zone. Important fact: you can't create a pcb without a socket! A pcb may outlive its socket, however. Given that there are multiple protocols, and only one socket zone, the per pcb zone limits seem to have little value. Under very special conditions it may trigger a little bit earlier than socket zone limit, but in most setups the socket zone limit will be triggered earlier. When VIMAGE was added to the kernel PCB zones became per-VNET. This magnified existing disbalance further: now we have multiple pcb zones in multiple vnets limited to maxsockets, but every pcb requires a socket allocated from the global zone also limited by maxsockets. IMHO, this per pcb zone limit doesn't bring any value, so this patch drops it. If anybody explains value of this limit, it can be restored very easy - just 2 lines change to in_pcbstorage_init(). Differential revision: https://reviews.freebsd.org/D33542
# 89128ff3	03-Jan-2022	Gleb Smirnoff <glebius@FreeBSD.org>	protocols: init with standard SYSINIT(9) or VNET_SYSINIT The historical BSD network stack loop that rolls over domains and over protocols has no advantages over more modern SYSINIT(9). While doing the sweep, split global and per-VNET initializers. Getting rid of pr_init allows to achieve several things: o Get rid of ifdef's that protect against double foo_init() when both INET and INET6 are compiled in. o Isolate initializers statically to the module they init. o Makes code easier to understand and maintain. Reviewed by: melifaro Differential revision: https://reviews.freebsd.org/D33537
# db0ac6de	02-Dec-2021	Cy Schubert <cy@FreeBSD.org>	Revert "wpa: Import wpa_supplicant/hostapd commit 14ab4a816" This reverts commit 266f97b5e9a7958e365e78288616a459b40d924a, reversing changes made to a10253cffea84c0c980a36ba6776b00ed96c3e3b. A mismerge of a merge to catch up to main resulted in files being committed which should not have been.
# de2d4784	02-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	SMR protection for inpcbs With introduction of epoch(9) synchronization to network stack the inpcb database became protected by the network epoch together with static network data (interfaces, addresses, etc). However, inpcb aren't static in nature, they are created and destroyed all the time, which creates some traffic on the epoch(9) garbage collector. Fairly new feature of uma(9) - Safe Memory Reclamation allows to safely free memory in page-sized batches, with virtually zero overhead compared to uma_zfree(). However, unlike epoch(9), it puts stricter requirement on the access to the protected memory, needing the critical(9) section to access it. Details: - The database is already build on CK lists, thanks to epoch(9). - For write access nothing is changed. - For a lookup in the database SMR section is now required. Once the desired inpcb is found we need to transition from SMR section to r/w lock on the inpcb itself, with a check that inpcb isn't yet freed. This requires some compexity, since SMR section itself is a critical(9) section. The complexity is hidden from KPI users in inp_smr_lock(). - For a inpcb list traversal (a pcblist sysctl, or broadcast notification) also a new KPI is provided, that hides internals of the database - inp_next(struct inp_iterator *). Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33022
# 44775b16	24-Nov-2021	Mark Johnston <markj@FreeBSD.org>	netinet: Remove unneeded mb_unmapped_to_ext() calls in_cksum_skip() now handles unmapped mbufs on platforms where they're permitted. Reviewed by: glebius, jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33097
# 756bb50b	16-Nov-2021	Mark Johnston <markj@FreeBSD.org>	sctp: Remove now-unneeded mb_unmapped_to_ext() calls sctp_delayed_checksum() now handles unmapped mbufs, thanks to m_apply(). No functional change intended. Reviewed by: tuexen MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32942
# 4a9e9528	02-Nov-2021	Andrey V. Elsukov <ae@FreeBSD.org>	ip_divert: calculate delayed checksum for IPv6 adress family Before passing an IPv6 packet to application apply delayed checksum calculation. Mbuf flags will be lost when divert listener will return a packet back, so we will not be able to do delayed checksum calculation later. Also an application will get a packet with correct checksum. Reviewed by: donner MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D32807
# f2d266f3	22-Oct-2021	Gleb Smirnoff <glebius@FreeBSD.org>	Don't run ip_ctloutput() for divert socket. It was here since divert(4) was introduced, probably just came with a protocol definition boilerplate. There is no useful socket option that can be set or get for a divert socket. Reviewed by: donner Differential Revision: https://reviews.freebsd.org/D32608
# d89c820b	22-Oct-2021	Gleb Smirnoff <glebius@FreeBSD.org>	Remove div_ctlinput(). This function does nothing since 97d8d152c28b. It was introduced in 252f24a2cf40 with a sidenote "may not be needed". Reviewed by: donner Differential Revision: https://reviews.freebsd.org/D32608
# 7045b160	28-Jul-2021	Roy Marples <roy@marples.name>	socket: Implement SO_RERROR SO_RERROR indicates that receive buffer overflows should be handled as errors. Historically receive buffer overflows have been ignored and programs could not tell if they missed messages or messages had been truncated because of overflows. Since programs historically do not expect to get receive overflow errors, this behavior is not the default. This is really really important for programs that use route(4) to keep in sync with the system. If we loose a message then we need to reload the full system state, otherwise the behaviour from that point is undefined and can lead to chasing bogus bug reports. Reviewed by: philip (network), kbowling (transport), gbe (manpages) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D26652
# d8acd268	12-May-2021	Mark Johnston <markj@FreeBSD.org>	Fix mbuf leaks in various pru_send implementations The various protocol implementations are not very consistent about freeing mbufs in error paths. In general, all protocols must free both "m" and "control" upon an error, except if PRUS_NOTREADY is specified (this is only implemented by TCP and unix(4) and requires further work not handled in this diff), in which case "control" still must be freed. This diff plugs various leaks in the pru_send implementations. Reviewed by: tuexen MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30151
# a1fadf7d	07-May-2021	Mark Johnston <markj@FreeBSD.org>	divert: Fix mbuf ownership confusion in div_output() div_output_outbound() and div_output_inbound() relied on the caller to free the mbuf if an error occurred. However, this is contrary to the semantics of their callees, ip_output(), ip6_output() and netisr_queue_src(), which always consume the mbuf. So, if one of these functions returned an error, that would get propagated up to div_output(), resulting in a double free. Fix the problem by making div_output_outbound() and div_output_inbound() responsible for freeing the mbuf in all cases. Reported by: Michael Schmiedgen <schmiedgen@gmx.net> Tested by: Michael Schmiedgen Reviewed by: donner MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30129
# f161d294	02-May-2021	Mark Johnston <markj@FreeBSD.org>	Add missing sockaddr length and family validation to various protocols Several protocol methods take a sockaddr as input. In some cases the sockaddr lengths were not being validated, or were validated after some out-of-bounds accesses could occur. Add requisite checking to various protocol entry points, and convert some existing checks to assertions where appropriate. Reported by: syzkaller+KASAN Reviewed by: tuexen, melifaro MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29519
# 65290859	21-Apr-2021	Mark Johnston <markj@FreeBSD.org>	Add required checks for unmapped mbufs in ipdivert and ipfw Also add an M_ASSERTMAPPED() macro to verify that all mbufs in the chain are mapped. Use it in ipfw_nat, which operates on a chain returned by m_megapullup(). PR: 255164 Reviewed by: ae, gallatin MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29838
# c80a4b76	29-Mar-2021	Andrey V. Elsukov <ae@FreeBSD.org>	ipdivert: check that PCB is still valid after taking INPCB_RLOCK. We are inspecting PCBs of divert sockets under NET_EPOCH section, but PCB could be already detached and we should check INP_FREED flag when we took INP_RLOCK. PR: 254478 MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29420
# 924d1c9a	08-Feb-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	Revert "SO_RERROR indicates that receive buffer overflows should be handled as errors." Wrong version of the change was pushed inadvertenly. This reverts commit 4a01b854ca5c2e5124958363b3326708b913af71.
# 4a01b854	07-Feb-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	SO_RERROR indicates that receive buffer overflows should be handled as errors. Historically receive buffer overflows have been ignored and programs could not tell if they missed messages or messages had been truncated because of overflows. Since programs historically do not expect to get receive overflow errors, this behavior is not the default. This is really really important for programs that use route(4) to keep in sync with the system. If we loose a message then we need to reload the full system state, otherwise the behaviour from that point is undefined and can lead to chasing bogus bug reports.
# 95033af9	18-Jun-2020	Mark Johnston <markj@FreeBSD.org>	Add the SCTP_SUPPORT kernel option. This is in preparation for enabling a loadable SCTP stack. Analogous to IPSEC/IPSEC_SUPPORT, the SCTP_SUPPORT kernel option must be configured in order to support a loadable SCTP implementation. Discussed with: tuexen MFC after: 2 weeks Sponsored by: The FreeBSD Foundation
# 7029da5c	26-Feb-2020	Pawel Biernacki <kaktus@FreeBSD.org>	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718
# 481be5de	12-Feb-2020	Randall Stewart <rrs@FreeBSD.org>	White space cleanup -- remove trailing tab's or spaces from any line. Sponsored by: Netflix Inc.
# 75831a1c	26-Jan-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix NOINET6 build after r357038. Reported by: AN <andy at neu.net>
# ab15488f	23-Jan-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Bring indentation back to normal after r357038. No functional changes. MFC after: 3 weeks
# 5533ec48	23-Jan-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix epoch-related panic in ipdivert, ensuring in_broadcast() is called within epoch. Simplify gigantic div_output() by splitting it into 3 functions, handling preliminary setup, remote "ip[6]_output" case and local "netisr" case. Leave original indenting in most parts to ease diff comparison. Indentation will be fixed by a followup commit. Reported by: Nick Hibma <nick at van-laarhoven.org> Reviewed by: glebius Differential Revision: https://reviews.freebsd.org/D23317
# b9555453	21-Jan-2020	Gleb Smirnoff <glebius@FreeBSD.org>	Make ip6_output() and ip_output() require network epoch. All callers that before may called into these functions without network epoch now must enter it.
# 032677ce	07-Nov-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Now that there is no R/W lock on PCB list the pcblist sysctls handlers can be greatly simplified. All the previous double cycling and complex locking was added to avoid these functions holding global PCB locks for extended period of time, preventing addition of new entries.
# de537d63	07-Nov-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Remove unnecessary recursive epoch enter via INP_INFO_RLOCK macro in divert_packet(). This function is called only from pfil(9) filters, which in their place always run in the network epoch.
# 1e5db73d	10-Oct-2019	Gleb Smirnoff <glebius@FreeBSD.org>	The divert(4) module must always be running in network epoch, thus call to if_addr_rlock() isn't needed.
# 1830dae3	14-Mar-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Make second argument of ip_divert(), that specifies packet direction a bool. This allows pf(4) to avoid including ipfw(4) private files.
# a68cc388	08-Jan-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Mechanical cleanup of epoch(9) usage in network stack. - Remove macros that covertly create epoch_tracker on thread stack. Such macros a quite unsafe, e.g. will produce a buggy code if same macro is used in embedded scopes. Explicitly declare epoch_tracker always. - Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read locking macros to what they actually are - the net_epoch. Keeping them as is is very misleading. They all are named FOO_RLOCK(), while they no longer have lock semantics. Now they allow recursion and what's more important they now no longer guarantee protection against their companion WLOCK macros. Note: INP_HASH_RLOCK() has same problems, but not touched by this commit. This is non functional mechanical change. The only functionally changed functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter epoch recursively. Discussed with: jtl, gallatin
# 79db6fe7	22-Nov-2018	Mark Johnston <markj@FreeBSD.org>	Plug some networking sysctl leaks. Various network protocol sysctl handlers were not zero-filling their output buffers and thus would export uninitialized stack memory to userland. Fix a number of such handlers. Reported by: Thomas Barabosch, Fraunhofer FKIE Reviewed by: tuexen MFC after: 3 days Security: kernel memory disclosure Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18301
# 5f901c92	24-Jul-2018	Andrew Turner <andrew@FreeBSD.org>	Use the new VNET_DEFINE_STATIC macro when we are defining static VNET variables. Reviewed by: bz Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D16147
# 6573d758	03-Jul-2018	Matt Macy <mmacy@FreeBSD.org>	epoch(9): allow preemptible epochs to compose - Add tracker argument to preemptible epochs - Inline epoch read path in kernel and tied modules - Change in_epoch to take an epoch as argument - Simplify tfb_tcp_do_segment to not take a ti_locked argument, there's no longer any benefit to dropping the pcbinfo lock and trying to do so just adds an error prone branchfest to these functions - Remove cases of same function recursion on the epoch as recursing is no longer free. - Remove the the TAILQ_ENTRY and epoch_section from struct thread as the tracker field is now stack or heap allocated as appropriate. Tested by: pho and Limelight Networks Reviewed by: kbowling at llnw dot com Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16066
# 99208b82	01-Jul-2018	Matt Macy <mmacy@FreeBSD.org>	inpcb: don't gratuitously defer frees Don't defer frees in sysctl handlers. It isn't necessary and it just confuses things. revert: r333911, r334104, and r334125 Requested by: jtl
# b872626d	12-Jun-2018	Matt Macy <mmacy@FreeBSD.org>	mechanical CK macro conversion of inpcbinfo lists This is a dependency for converting the inpcbinfo hash and info rlocks to epoch.
# fe524329	23-May-2018	Matt Macy <mmacy@FreeBSD.org>	convert allocations to INVARIANTS M_ZERO
# 4f6c66cc	23-May-2018	Matt Macy <mmacy@FreeBSD.org>	UDP: further performance improvements on tx Cumulative throughput while running 64 netperf -H $DUT -t UDP_STREAM -- -m 1 on a 2x8x2 SKL went from 1.1Mpps to 2.5Mpps Single stream throughput increases from 910kpps to 1.18Mpps Baseline: https://people.freebsd.org/~mmacy/2018.05.11/udpsender2.svg - Protect read access to global ifnet list with epoch https://people.freebsd.org/~mmacy/2018.05.11/udpsender3.svg - Protect short lived ifaddr references with epoch https://people.freebsd.org/~mmacy/2018.05.11/udpsender4.svg - Convert if_afdata read lock path to epoch https://people.freebsd.org/~mmacy/2018.05.11/udpsender5.svg A fix for the inpcbhash contention is pending sufficient time on a canary at LLNW. Reviewed by: gallatin Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15409
# 246a6199	23-May-2018	Matt Macy <mmacy@FreeBSD.org>	epoch: allow for conditionally asserting that the epoch context fields are unused by zeroing on INVARIANTS builds
# 056b40e2	19-May-2018	Matt Macy <mmacy@FreeBSD.org>	inpcb: consolidate possible deletion in pcblist functions in to epoch deferred context.
# d7c5a620	18-May-2018	Matt Macy <mmacy@FreeBSD.org>	ifnet: Replace if_addr_lock rwlock with epoch + mutex Run on LLNW canaries and tested by pho@ gallatin: Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5 based ConnectX 4-LX NIC, I see an almost 12% improvement in received packet rate, and a larger improvement in bytes delivered all the way to userspace. When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1, I see, using nstat -I mce0 1 before the patch: InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree 4.98 0.00 4.42 0.00 4235592 33 83.80 4720653 2149771 1235 247.32 4.73 0.00 4.20 0.00 4025260 33 82.99 4724900 2139833 1204 247.32 4.72 0.00 4.20 0.00 4035252 33 82.14 4719162 2132023 1264 247.32 4.71 0.00 4.21 0.00 4073206 33 83.68 4744973 2123317 1347 247.32 4.72 0.00 4.21 0.00 4061118 33 80.82 4713615 2188091 1490 247.32 4.72 0.00 4.21 0.00 4051675 33 85.29 4727399 2109011 1205 247.32 4.73 0.00 4.21 0.00 4039056 33 84.65 4724735 2102603 1053 247.32 After the patch InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree 5.43 0.00 4.20 0.00 3313143 33 84.96 5434214 1900162 2656 245.51 5.43 0.00 4.20 0.00 3308527 33 85.24 5439695 1809382 2521 245.51 5.42 0.00 4.19 0.00 3316778 33 87.54 5416028 1805835 2256 245.51 5.42 0.00 4.19 0.00 3317673 33 90.44 5426044 1763056 2332 245.51 5.42 0.00 4.19 0.00 3314839 33 88.11 5435732 1792218 2499 245.52 5.44 0.00 4.19 0.00 3293228 33 91.84 5426301 1668597 2121 245.52 Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch Reviewed by: gallatin Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15366
# 51369649	20-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.
# 38cc96a8	17-May-2017	Andrey V. Elsukov <ae@FreeBSD.org>	Set M_BCAST and M_MCAST flags on mbuf sent via divert socket. r290383 has changed how mbufs sent by divert socket are handled. Previously they are always handled by slow path processing in ip_input(). Now ip_tryforward() is invoked from ip_input() before in_broadcast() check. Since diverted packet lost all mbuf flags, it passes the broadcast check in ip_tryforward() due to missing M_BCAST flag. In the result the broadcast packet is forwarded to the wire instead of be consumed by network stack. Add in_broadcast() check to the div_output() function. And restore the M_BCAST flag if destination address is broadcast for the given network interface. PR: 209491 MFC after: 1 week
# cc487c16	15-May-2017	Gleb Smirnoff <glebius@FreeBSD.org>	Reduce in_pcbinfo_init() by two params. No users supply any flags to this function (they used to say UMA_ZONE_NOFREE), so flag parameter goes away. The zone_fini parameter also goes away. Previously no protocols (except divert) supplied zone_fini function, so inpcb locks were leaked with slabs. This was okay while zones were allocated with UMA_ZONE_NOFREE flag, but now this is a leak. Fix that by suppling inpcb_fini() function as fini method for all inpcb zones.
# cc65eb4e	21-Mar-2017	Gleb Smirnoff <glebius@FreeBSD.org>	Hide struct inpcb, struct tcpcb from the userland. This is a painful change, but it is needed. On the one hand, we avoid modifying them, and this slows down some ideas, on the other hand we still eventually modify them and tools like netstat(1) never work on next version of FreeBSD. We maintain a ton of spares in them, and we already got some ifdef hell at the end of tcpcb. Details: - Hide struct inpcb, struct tcpcb under _KERNEL \|\| _WANT_FOO. - Make struct xinpcb, struct xtcpcb pure API structures, not including kernel structures inpcb and tcpcb inside. Export into these structures the fields from inpcb and tcpcb that are known to be used, and put there a ton of spare space. - Make kernel and userland utilities compilable after these changes. - Bump __FreeBSD_version. Reviewed by: rrs, gnn Differential Revision: D10018
# fbbd9655	28-Feb-2017	Warner Losh <imp@FreeBSD.org>	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96
# 3f58662d	01-Jun-2016	Bjoern A. Zeeb <bz@FreeBSD.org>	The pr_destroy field does not allow us to run the teardown code in a specific order. VNET_SYSUNINITs however are doing exactly that. Thus remove the VIMAGE conditional field from the domain(9) protosw structure and replace it with VNET_SYSUNINITs. This also allows us to change some order and to make the teardown functions file local static. Also convert divert(4) as it uses the same mechanism ip(4) and ip6(4) use internally. Slightly reshuffle the SI_SUB_* fields in kernel.h and add a new ones, e.g., for pfil consumers (firewalls), partially for this commit and for others to come. Reviewed by: gnn, tuexen (sctp), jhb (kernel.h) Obtained from: projects/vnet MFC after: 2 weeks X-MFC: do not remove pr_destroy Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D6652
# a5b50fbc	27-Apr-2016	Pedro F. Giffuni <pfg@FreeBSD.org>	ipdivert: Remove unnecessary and incorrectly typed variable. In principle n is only used to carry a copy of ipi_count, which is unsigned, in the non-VIMAGE case, however ipi_count can be used directly so it is not needed at all. Removing it makes things look cleaner.
# 99d628d5	15-Apr-2016	Pedro F. Giffuni <pfg@FreeBSD.org>	netinet: for pointers replace 0 with NULL. These are mostly cosmetical, no functional change. Found with devel/coccinelle. Reviewed by: ae. tuexen
# 4c86b2bc	09-Apr-2016	Bjoern A. Zeeb <bz@FreeBSD.org>	Mfp: r296346 No reason identified to keep UMA_ZONE_NOFREE here. Reviewed by: gnn MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5736
# eef5775f	22-Jan-2016	Bjoern A. Zeeb <bz@FreeBSD.org>	Fix build and avoid a double-free in the VIMAGE case. Sponsored by: The FreeBSD Foundation
# bb84e3d7	22-Jan-2016	Bjoern A. Zeeb <bz@FreeBSD.org>	Correct function arguments for SYSUNINITs. Sponsored by: The FreeBSD Foundation
# 1f12da0e	22-Jan-2016	Bjoern A. Zeeb <bz@FreeBSD.org>	Just checkpoint the WIP in order to be able to make the tree update easier. Note: this is currently not in a usable state as certain teardown parts are not called and the DOMAIN rework is missing. More to come soon and find its way to head. Obtained from: P4 //depot/user/bz/vimage/... Sponsored by: The FreeBSD Foundation
# 257480b8	04-Nov-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Convert netinet6/ to use new routing API. * Remove &ifpp from ip6_output() in favor of ri->ri_nh_info * Provide different wrappers to in6_selectsrc: Currently it is used by 2 differenct type of customers: - socket-based one, which all are unsure about provided address scope and - in-kernel ones (ND code mostly), which don't have any sockets, options, crededentials, etc. So, we provide two different wrappers to in6_selectsrc() returning select source. * Make different versions of selectroute(): Currenly selectroute() is used in two scenarios: - SAS, via in6_selecsrc() -> in6_selectif() -> selectroute() - output, via in6_output -> wrapper -> selectroute() Provide different versions for each customer: - fib6_lookup_nh_basic()-based in6_selectif() which is capable of returning interface only, without MTU/NHOP/L2 calculations - full-blown fib6_selectroute() with cached route/multipath/ MTU/L2 * Stop using routing table for link-local address lookups * Add in6_ifawithifp_lla() to make for-us check faster for link-local * Add in6_splitscope / in6_setllascope for faster embed/deembed scopes
# 585a4290	11-Oct-2014	John Baldwin <jhb@FreeBSD.org>	Update ip_divert.ko to depend on version 3 of ipfw.
# 8f5a8818	07-Aug-2014	Kevin Lo <kevlo@FreeBSD.org>	Merge 'struct ip6protosw' and 'struct protosw' into one. Now we have only one protocol switch structure that is shared between ipv4 and ipv6. Phabric: D476 Reviewed by: jhb
# c3322cb9	28-Oct-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Include necessary headers that now are available due to pollution via if_var.h. Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 76039bc8	26-Oct-2013	Gleb Smirnoff <glebius@FreeBSD.org>	The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare to this event, adding if_var.h to files that do need it. Also, include all includes that now are included due to implicit pollution via if_var.h Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 8f134647	22-Oct-2012	Gleb Smirnoff <glebius@FreeBSD.org>	Switch the entire IPv4 stack to keep the IP packet header in network byte order. Any host byte order processing is done in local variables and host byte order values are never[1] written to a packet. After this change a packet processed by the stack isn't modified at all[2] except for TTL. After this change a network stack hacker doesn't need to scratch his head trying to figure out what is the byte order at the given place in the stack. [1] One exception still remains. The raw sockets convert host byte order before pass a packet to an application. Probably this would remain for ages for compatibility. [2] The ip_input() still subtructs header len from ip->ip_len, but this is planned to be fixed soon. Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru> Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>
# e76163a5	15-Oct-2012	Gleb Smirnoff <glebius@FreeBSD.org>	We don't need to convert ip6_len to host byte order before ip6_output(), the IPv6 stack is working in net byte order. The reason this code worked before is that ip6_output() doesn't look at ip6_plen at all and recalculates it based on mbuf length.
# 9823d527	10-Oct-2012	Kevin Lo <kevlo@FreeBSD.org>	Revert previous commit... Pointyhat to: kevlo (myself)
# a10cee30	09-Oct-2012	Kevin Lo <kevlo@FreeBSD.org>	Prefer NULL over 0 for pointers
# 23e9c6dc	08-Oct-2012	Gleb Smirnoff <glebius@FreeBSD.org>	After r241245 it appeared that in_delayed_cksum(), which still expects host byte order, was sometimes called with net byte order. Since we are moving towards net byte order throughout the stack, the function was converted to expect net byte order, and its consumers fixed appropriately: - ip_output(), ipfilter(4) not changed, since already call in_delayed_cksum() with header in net byte order. - divert(4), ng_nat(4), ipfw_nat(4) now don't need to swap byte order there and back. - mrouting code and IPv6 ipsec now need to switch byte order there and back, but I hope, this is temporary solution. - In ipsec(4) shifted switch to net byte order prior to in_delayed_cksum(). - pf_route() catches up on r241245 changes to ip_output().
# b7fb54d8	08-Oct-2012	Gleb Smirnoff <glebius@FreeBSD.org>	No reason to play with IP header before calling sctp_delayed_cksum() with offset beyond the IP header.
# 08adfbbf	22-Jan-2012	Bjoern A. Zeeb <bz@FreeBSD.org>	Make #error messages string-literals and remove punctuation. Reported by: bde (for ip_divert) Reviewed by: bde MFC after: 3 days
# aa57e971	21-Jan-2012	Bjoern A. Zeeb <bz@FreeBSD.org>	Fix ip_divert handling of inet and inet6 and module building some more. Properly sort the "carp" case in modules/Makefile after it was renamed. Reported by: bde (most) Reviewed by: bde MFC after: 3 days
# 6472ac3d	07-Nov-2011	Ed Schouten <ed@FreeBSD.org>	Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs. The SYSCTL_NODE macro defines a list that stores all child-elements of that node. If there's no SYSCTL_DECL macro anywhere else, there's no reason why it shouldn't be static.
# 217e3abc	01-Aug-2011	Gleb Smirnoff <glebius@FreeBSD.org>	Add missing break; in r223593. Submitted by: sem Pointy hat to: glebius Approved by: re (kib)
# 812f1d32	26-Jun-2011	Gleb Smirnoff <glebius@FreeBSD.org>	Add possibility to pass IPv6 packets to a divert(4) socket. Submitted by: sem
# 52cd27cb	05-Jun-2011	Robert Watson <rwatson@FreeBSD.org>	Implement a CPU-affine TCP and UDP connection lookup data structure, struct inpcbgroup. pcbgroups, or "connection groups", supplement the existing inpcbinfo connection hash table, which when pcbgroups are enabled, might now be thought of more usefully as a per-protocol 4-tuple reservation table. Connections are assigned to connection groups base on a hash of their 4-tuple; wildcard sockets require special handling, and are members of all connection groups. During a connection lookup, a per-connection group lock is employed rather than the global pcbinfo lock. By aligning connection groups with input path processing, connection groups take on an effective CPU affinity, especially when aligned with RSS work placement (see a forthcoming commit for details). This eliminates cache line migration associated with global, protocol-layer data structures in steady state TCP and UDP processing (with the exception of protocol-layer statistics; further commit to follow). Elements of this approach were inspired by Willman, Rixner, and Cox's 2006 USENIX paper, "An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems". However, there are also significant differences: we maintain the inpcb lock, rather than using the connection group lock for per-connection state. Likewise, the focus of this implementation is alignment with NIC packet distribution strategies such as RSS, rather than pure software strategies. Despite that focus, software distribution is supported through the parallel netisr implementation, and works well in configurations where the number of hardware threads is greater than the number of NIC input queues, such as in the RMI XLR threaded MIPS architecture. Another important difference is the continued maintenance of existing hash tables as "reservation tables" -- these are useful both to distinguish the resource allocation aspect of protocol name management and the more common-case lookup aspect. In configurations where connection tables are aligned with hardware hashes, it is desirable to use the traditional lookup tables for loopback or encapsulated traffic rather than take the expense of hardware hashes that are hard to implement efficiently in software (such as RSS Toeplitz). Connection group support is enabled by compiling "options PCBGROUP" into your kernel configuration; for the time being, this is an experimental feature, and hence is not enabled by default. Subject to the limited MFCability of change dependencies in inpcb, and its change to the inpcbinfo init function signature, this change in principle could be merged to FreeBSD 8.x. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
# 711b3dbd	04-Jun-2011	Robert Watson <rwatson@FreeBSD.org>	IP divert sockets use their inpcbinfo for port reservation, although not for lookup. I missed its call to in_pcbbind() when preparing previous patches, which would lead to a lock assertion failure (although problem not an actual race condition due to global pcbinfo locks providing required synchronisation -- in this particular case only). This change adds the missing locking of the pcbhash lock. (Existing comments in the ipdivert code question the need for using the global hash to manage the namespace, as really it's a simple port namespace and not an address/port namespace. Also, although in_pcbbind is used to manage reservations, the hash tables aren't used for lookup. It might be a good idea to make them use hashed lookup, or to use a different reservation scheme.) Reviewed by: bz Reported by: Kristof Provost <kristof at sigsegv.be> Sponsored by: Juniper Networks
# fa046d87	30-May-2011	Robert Watson <rwatson@FreeBSD.org>	Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
# 79c3d51b	18-Jan-2011	Matthew D Fleming <mdf@FreeBSD.org>	Specify a CTLTYPE_FOO so that a future sysctl(8) change does not need to rely on the format string. For SYSCTL_PROC instances that I noticed a discrepancy between the CTLTYPE and the format specifier, fix the CTLTYPE.
# 3e288e62	22-Nov-2010	Dimitry Andric <dim@FreeBSD.org>	After some off-list discussion, revert a number of changes to the DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various people working on the affected files. A better long-term solution is still being considered. This reversal may give some modules empty set_pcpu or set_vnet sections, but these are harmless. Changes reverted: ------------------------------------------------------------------------ r215318 \| dim \| 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) \| 4 lines Instead of unconditionally emitting .globl's for the __start_set_xxx and __stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu sections are actually defined. ------------------------------------------------------------------------ r215317 \| dim \| 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) \| 3 lines Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout the tree. ------------------------------------------------------------------------ r215316 \| dim \| 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) \| 2 lines Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
# 31c6a003	14-Nov-2010	Dimitry Andric <dim@FreeBSD.org>	Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout the tree.
# a7d5f7eb	19-Oct-2010	Jamie Gritton <jamie@FreeBSD.org>	A new jail(8) with a configuration file, to replace the work currently done by /etc/rc.d/jail.
# c007b96a	17-Aug-2010	John Baldwin <jhb@FreeBSD.org>	Ensure a minimum "slop" of 10 extra pcb structures when providing a memory size estimate to userland for pcb list sysctls. The previous behavior of a "slop" of n/8 does not work well for small values of n (e.g. no slop at all if you have less than 8 open UDP connections). Reviewed by: bz MFC after: 1 week
# c00cb785	03-Jun-2010	Robert Watson <rwatson@FreeBSD.org>	Merge r204810 from head to stable/8: Remove unnecessary locking of divcbinfo lock from div_output(): this has not been required since FreeBSD 7.0 when the so_pcb pointer leading to inp was guaranteed to be stable when a valid socket reference is held (as it is in the output path). Reviewed by: bz Sponsored by: Juniper Networks Approved by: re (kib)
# 54bb4167	05-Apr-2010	Randall Stewart <rrs@FreeBSD.org>	MFC of 2 items to fix the csum for v6 issue: Revision 205075 and 205104: ---------205075---------- With the recent change of the sctp checksum to support offload, no delayed checksum was added to the ip6 output code. This causes cards that do not support SCTP checksum offload to have SCTP packets that are IPv6 NOT have the sctp checksum performed. Thus you could not communicate with a peer. This adds the missing bits to make the checksum happen for these cards. ------------------------- ---------205104---------- The proper fix for the delayed SCTP checksum is to have the delayed function take an argument as to the offset to the SCTP header. This allows it to work for V4 and V6. This of course means changing all callers of the function to either pass the header len, if they have it, or create it (ip_hl << 2 or sizeof(ip6_hdr)). ------------------------- PR: 144529
# 397069f2	27-Mar-2010	Bjoern A. Zeeb <bz@FreeBSD.org>	MFC r205251: Add pcb reference counting to the pcblist sysctl handler functions to ensure type stability while caching the pcb pointers for the copyout. Reviewed by: rwatson
# 8018e843	23-Mar-2010	Luigi Rizzo <luigi@FreeBSD.org>	MFC of a large number of ipfw and dummynet fixes and enhancements done in CURRENT over the last 4 months. HEAD and RELENG_8 are almost in sync now for ipfw, dummynet the pfil hooks and related components. Among the most noticeable changes: - r200855 more efficient lookup of skipto rules, and remove O(N) blocks from critical sections in the kernel; - r204591 large restructuring of the dummynet module, with support for multiple scheduling algorithms (4 available so far) See the original commit logs for details. Changes in the kernel/userland ABI should be harmless because the kernel is able to understand previous requests from RELENG_8 and RELENG_7. For this reason, this changeset would be applicable to RELENG_7 as well, but i am not sure if it is worthwhile.
# d0e157f6	17-Mar-2010	Bjoern A. Zeeb <bz@FreeBSD.org>	Add pcb reference counting to the pcblist sysctl handler functions to ensure type stability while caching the pcb pointers for the copyout. Reviewed by: rwatson MFC after: 7 days
# 9bcd427b	14-Mar-2010	Robert Watson <rwatson@FreeBSD.org>	Abstract out initialization of most aspects of struct inpcbinfo from their calling contexts in {IP divert, raw IP sockets, TCP, UDP} and create new helper functions: in_pcbinfo_init() and in_pcbinfo_destroy() to do this work in a central spot. As inpcbinfo becomes more complex due to ongoing work to add connection groups, this will reduce code duplication. MFC after: 1 month Reviewed by: bz Sponsored by: Juniper Networks
# 1966e5b5	12-Mar-2010	Randall Stewart <rrs@FreeBSD.org>	The proper fix for the delayed SCTP checksum is to have the delayed function take an argument as to the offset to the SCTP header. This allows it to work for V4 and V6. This of course means changing all callers of the function to either pass the header len, if they have it, or create it (ip_hl << 2 or sizeof(ip6_hdr)). PR: 144529 MFC after: 2 weeks
# 1d7429e0	06-Mar-2010	Robert Watson <rwatson@FreeBSD.org>	Remove unnecessary locking of divcbinfo lock from div_output(): this has not been required since FreeBSD 7.0 when the so_pcb pointer leading to inp was guaranteed to be stable when a valid socket reference is held (as it is in the output path). MFC after: 1 week Reviewed by: bz Sponsored by: Juniper Networks
# b2019e17	07-Jan-2010	Luigi Rizzo <luigi@FreeBSD.org>	Following up on a request from Ermal Luci to make ip_divert work as a client of pf(4), make ip_divert not depend on ipfw. This is achieved by moving to ip_var.h the struct ipfw_rule_ref (which is part of the mtag for all reinjected packets) and other declarations of global variables, and moving to raw_ip.c global variables for filter and divert hooks. Note that names and locations could be made more generic (ipfw_rule_ref is really a generic reference robust to reconfigurations; the packet filter is not necessarily ipfw; filters and their clients are not necessarily limited to ipv4), but _right now_ most of this stuff works on ipfw and ipv4, so i don't feel like doing a gratuitous renaming, at least for the time being.
# 7173b6e5	04-Jan-2010	Luigi Rizzo <luigi@FreeBSD.org>	Various cleanup done in ipfw3-head branch including: - use a uniform mtag format for all packets that exit and re-enter the firewall in the middle of a rulechain. On reentry, all tags containing reinject info are renamed to MTAG_IPFW_RULE so the processing is simpler. - make ipfw and dummynet use ip_len and ip_off in network format everywhere. Conversion is done only once instead of tracking the format in every place. - use a macro FREE_PKT to dispose of mbufs. This eases portability. On passing i also removed a few typos, staticise or localise variables, remove useless declarations and other minor things. Overall the code shrinks a bit and is hopefully more readable. I have tested functionality for all but ng_ipfw and if_bridge/if_ethersubr. For ng_ipfw i am actually waiting for feedback from glebius@ because we might have some small changes to make. For if_bridge and if_ethersubr feedback would be welcome (there are still some redundant parts in these two modules that I would like to remove, but first i need to check functionality).
# 70228fb3	15-Dec-2009	Luigi Rizzo <luigi@FreeBSD.org>	Start splitting ip_fw2.c and ip_fw.h into smaller components. At this time we pull out from ip_fw2.c the logging functions, and support for dynamic rules, and move kernel-only stuff into netinet/ipfw/ip_fw_private.h No ABI change involved in this commit, unless I made some mistake. ip_fw.h has changed, though not in the userland-visible part. Files touched by this commit: conf/files now references the two new source files netinet/ip_fw.h remove kernel-only definitions gone into netinet/ipfw/ip_fw_private.h. netinet/ipfw/ip_fw_private.h new file with kernel-specific ipfw definitions netinet/ipfw/ip_fw_log.c ipfw_log and related functions netinet/ipfw/ip_fw_dynamic.c code related to dynamic rules netinet/ipfw/ip_fw2.c removed the pieces that goes in the new files netinet/ipfw/ip_fw_nat.c minor rearrangement to remove LOOKUP_NAT from the main headers. This require a new function pointer. A bunch of other kernel files that included netinet/ip_fw.h now require netinet/ipfw/ip_fw_private.h as well. Not 100% sure i caught all of them. MFC after: 1 month
# f04e871e	28-Aug-2009	Marko Zec <zec@FreeBSD.org>	MFC r196502: Introduce a div_destroy() function which takes over per-vnet cleanup tasks from the existing modevent / MOD_UNLOAD handler, and register div_destroy() in protosw as per-vnet .pr_destroy() handler for options VIMAGE builds. In nooptions VIMAGE builds, div_destroy() will be invoked from the modevent handler, resulting in effectively identical operation as it was prior this change. div_destroy() also tears down hashtables used by ipdivert, which were previously left behind on ipdivert kldunloads. For options VIMAGE builds only, temporarily disable kldunloading of ipdivert, because without introducing additional locking logic it is impossible to atomically check whether all ipdivert instances in all vnets are idle, and proceed with cleanup without opening a race window for a vnet to open an ipdivert socket while ipdivert tear-down is in progress. While here, staticize div_init(), because it is not used outside of ip_divert.c. In cooperation with: julian Approved by: re (rwatson), julian (mentor) Approved by: re (rwatson)
# 2b73aaca	24-Aug-2009	Marko Zec <zec@FreeBSD.org>	Introduce a div_destroy() function which takes over per-vnet cleanup tasks from the existing modevent / MOD_UNLOAD handler, and register div_destroy() in protosw as per-vnet .pr_destroy() handler for options VIMAGE builds. In nooptions VIMAGE builds, div_destroy() will be invoked from the modevent handler, resulting in effectively identical operation as it was prior this change. div_destroy() also tears down hashtables used by ipdivert, which were previously left behind on ipdivert kldunloads. For options VIMAGE builds only, temporarily disable kldunloading of ipdivert, because without introducing additional locking logic it is impossible to atomically check whether all ipdivert instances in all vnets are idle, and proceed with cleanup without opening a race window for a vnet to open an ipdivert socket while ipdivert tear-down is in progress. While here, staticize div_init(), because it is not used outside of ip_divert.c. In cooperation with: julian Approved by: re (rwatson), julian (mentor) MFC after: 3 days
# 315e3e38	02-Aug-2009	Robert Watson <rwatson@FreeBSD.org>	Many network stack subsystems use a single global data structure to hold all pertinent statatistics for the subsystem. These structures are sometimes "borrowed" by kernel modules that require a place to store statistics for similar events. Add KPI accessor functions for statistics structures referenced by kernel modules so that they no longer encode certain specifics of how the data structures are named and stored. This change is intended to make it easier to move to per-CPU network stats following 8.0-RELEASE. The following modules are affected by this change: if_bridge if_cxgb if_gif ip_mroute ipdivert pf In practice, most of these statistics consumers should, in fact, maintain their own statistics data structures rather than borrowing structures from the base network stack. However, that change is too agressive for this point in the release cycle. Reviewed by: bz Approved by: re (kib)
# 530c0060	01-Aug-2009	Robert Watson <rwatson@FreeBSD.org>	Merge the remainder of kern_vimage.c and vimage.h into vnet.c and vnet.h, we now use jails (rather than vimages) as the abstraction for virtualization management, and what remained was specific to virtual network stacks. Minor cleanups are done in the process, and comments updated to reflect these changes. Reviewed by: bz Approved by: re (vimage blanket)
# 1e77c105	16-Jul-2009	Robert Watson <rwatson@FreeBSD.org>	Remove unused VNET_SET() and related macros; only VNET_GET() is ever actually used. Rename VNET_GET() to VNET() to shorten variable references. Discussed with: bz, julian Reviewed by: bz Approved by: re (kensmith, kib)
# eddfbb76	14-Jul-2009	Robert Watson <rwatson@FreeBSD.org>	Build on Jeff Roberson's linker-set based dynamic per-CPU allocator (DPCPU), as suggested by Peter Wemm, and implement a new per-virtual network stack memory allocator. Modify vnet to use the allocator instead of monolithic global container structures (vinet, ...). This change solves many binary compatibility problems associated with VIMAGE, and restores ELF symbols for virtualized global variables. Each virtualized global variable exists as a "reference copy", and also once per virtual network stack. Virtualized global variables are tagged at compile-time, placing the in a special linker set, which is loaded into a contiguous region of kernel memory. Virtualized global variables in the base kernel are linked as normal, but those in modules are copied and relocated to a reserved portion of the kernel's vnet region with the help of a the kernel linker. Virtualized global variables exist in per-vnet memory set up when the network stack instance is created, and are initialized statically from the reference copy. Run-time access occurs via an accessor macro, which converts from the current vnet and requested symbol to a per-vnet address. When "options VIMAGE" is not compiled into the kernel, normal global ELF symbols will be used instead and indirection is avoided. This change restores static initialization for network stack global variables, restores support for non-global symbols and types, eliminates the need for many subsystem constructors, eliminates large per-subsystem structures that caused many binary compatibility issues both for monitoring applications (netstat) and kernel modules, removes the per-function INIT_VNET_*() macros throughout the stack, eliminates the need for vnet_symmap ksym(2) munging, and eliminates duplicate definitions of virtualized globals under VIMAGE_GLOBALS. Bump __FreeBSD_version and update UPDATING. Portions submitted by: bz Reviewed by: bz, zec Discussed with: gnn, jamie, jeff, jhb, julian, sam Suggested by: peter Approved by: re (kensmith)
# 6c861560	25-Jun-2009	Robert Watson <rwatson@FreeBSD.org>	Update various IPFW-related modules to use if_addr_rlock()/ if_addr_runlock() rather than IF_ADDR_LOCK()/IF_ADDR_UNLOCK(). MFC after: 6 weeks
# 8c0fec80	23-Jun-2009	Robert Watson <rwatson@FreeBSD.org>	Modify most routines returning 'struct ifaddr *' to return references rather than pointers, requiring callers to properly dispose of those references. The following routines now return references: ifaddr_byindex ifa_ifwithaddr ifa_ifwithbroadaddr ifa_ifwithdstaddr ifa_ifwithnet ifaof_ifpforaddr ifa_ifwithroute ifa_ifwithroute_fib rt_getifa rt_getifa_fib IFP_TO_IA ip_rtaddr in6_ifawithifp in6ifa_ifpforlinklocal in6ifa_ifpwithaddr in6_ifadd carp_iamatch6 ip6_getdstifaddr Remove unused macro which didn't have required referencing: IFP_TO_IA6 This closes many small races in which changes to interface or address lists while an ifaddr was in use could lead to use of freed memory (etc). In a few cases, add missing if_addr_list locking required to safely acquire references. Because of a lack of deep copying support, we accept a race in which an in6_ifaddr pointed to by mbuf tags and extracted with ip6_getdstifaddr() doesn't hold a reference while in transmit. Once we have mbuf tag deep copy support, this can be fixed. Reviewed by: bz Obtained from: Apple, Inc. (portions) MFC after: 6 weeks (portions)
# bcf11e8d	05-Jun-2009	Robert Watson <rwatson@FreeBSD.org>	Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd
# f93bfb23	02-Jun-2009	Robert Watson <rwatson@FreeBSD.org>	Add internal 'mac_policy_count' counter to the MAC Framework, which is a count of the number of registered policies. Rather than unconditionally locking sockets before passing them into MAC, lock them in the MAC entry points only if mac_policy_count is non-zero. This avoids locking overhead for a number of socket system calls when no policies are registered, eliminating measurable overhead for the MAC Framework for the socket subsystem when there are no active policies. Possibly socket locks should be acquired by policies if they are required for socket labels, which would further avoid locking overhead when there are policies but they don't require labeling of sockets, or possibly don't even implement socket controls. Obtained from: TrustedBSD Project
# d4b5cae4	01-Jun-2009	Robert Watson <rwatson@FreeBSD.org>	Reimplement the netisr framework in order to support parallel netisr threads: - Support up to one netisr thread per CPU, each processings its own workstream, or set of per-protocol queues. Threads may be bound to specific CPUs, or allowed to migrate, based on a global policy. In the future it would be desirable to support topology-centric policies, such as "one netisr per package". - Allow each protocol to advertise an ordering policy, which can currently be one of: NETISR_POLICY_SOURCE: packets must maintain ordering with respect to an implicit or explicit source (such as an interface or socket). NETISR_POLICY_FLOW: make use of mbuf flow identifiers to place work, as well as allowing protocols to provide a flow generation function for mbufs without flow identifers (m2flow). Falls back on NETISR_POLICY_SOURCE if now flow ID is available. NETISR_POLICY_CPU: allow protocols to inspect and assign a CPU for each packet handled by netisr (m2cpuid). - Provide utility functions for querying the number of workstreams being used, as well as a mapping function from workstream to CPU ID, which protocols may use in work placement decisions. - Add explicit interfaces to get and set per-protocol queue limits, and get and clear drop counters, which query data or apply changes across all workstreams. - Add a more extensible netisr registration interface, in which protocols declare 'struct netisr_handler' structures for each registered NETISR_ type. These include name, handler function, optional mbuf to flow ID function, optional mbuf to CPU ID function, queue limit, and ordering policy. Padding is present to allow these to be expanded in the future. If no queue limit is declared, then a default is used. - Queue limits are now per-workstream, and raised from the previous IFQ_MAXLEN default of 50 to 256. - All protocols are updated to use the new registration interface, and with the exception of netnatm, default queue limits. Most protocols register as NETISR_POLICY_SOURCE, except IPv4 and IPv6, which use NETISR_POLICY_FLOW, and will therefore take advantage of driver- generated flow IDs if present. - Formalize a non-packet based interface between interface polling and the netisr, rather than having polling pretend to be two protocols. Provide two explicit hooks in the netisr worker for start and end events for runs: netisr_poll() and netisr_pollmore(), as well as a function, netisr_sched_poll(), to allow the polling code to schedule netisr execution. DEVICE_POLLING still embeds single-netisr assumptions in its implementation, so for now if it is compiled into the kernel, a single and un-bound netisr thread is enforced regardless of tunable configuration. In the default configuration, the new netisr implementation maintains the same basic assumptions as the previous implementation: a single, un-bound worker thread processes all deferred work, and direct dispatch is enabled by default wherever possible. Performance measurement shows a marginal performance improvement over the old implementation due to the use of batched dequeue. An rmlock is used to synchronize use and registration/unregistration using the framework; currently, synchronized use is disabled (replicating current netisr policy) due to a measurable 3%-6% hit in ping-pong micro-benchmarking. It will be enabled once further rmlock optimization has taken place. However, in practice, netisrs are rarely registered or unregistered at runtime. A new man page for netisr will follow, but since one doesn't currently exist, it hasn't been updated. This change is not appropriate for MFC, although the polling shutdown handler should be merged to 7-STABLE. Bump __FreeBSD_version. Reviewed by: bz
# f6dfe47a	30-Apr-2009	Marko Zec <zec@FreeBSD.org>	Permit buiding kernels with options VIMAGE, restricted to only a single active network stack instance. Turning on options VIMAGE at compile time yields the following changes relative to default kernel build: 1) V_ accessor macros for virtualized variables resolve to structure fields via base pointers, instead of being resolved as fields in global structs or plain global variables. As an example, V_ifnet becomes: options VIMAGE: ((struct vnet_net ) vnet_net)->_ifnet default build: vnet_net_0._ifnet options VIMAGE_GLOBALS: ifnet 2) INIT_VNET_ macros will declare and set up base pointers to be used by V_ accessor macros, instead of resolving to whitespace: INIT_VNET_NET(ifp->if_vnet); becomes struct vnet_net vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET]; 3) Memory for vnet modules registered via vnet_mod_register() is now allocated at run time in sys/kern/kern_vimage.c, instead of per vnet module structs being declared as globals. If required, vnet modules can now request the framework to provide them with allocated bzeroed memory by filling in the vmi_size field in their vmi_modinfo structures. 4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are extended to hold a pointer to the parent vnet. options VIMAGE builds will fill in those fields as required. 5) curvnet is introduced as a new global variable in options VIMAGE builds, always pointing to the default and only struct vnet. 6) struct sysctl_oid has been extended with additional two fields to store major and minor virtualization module identifiers, oid_v_subs and oid_v_mod. SYSCTL_V_ family of macros will fill in those fields accordingly, and store the offset in the appropriate vnet container struct in oid_arg1. In sysctl handlers dealing with virtualized sysctls, the SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target variable and make it available in arg1 variable for further processing. Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have been deleted. Reviewed by: bz, rwatson Approved by: julian (mentor)
# 093f25f8	26-Apr-2009	Marko Zec <zec@FreeBSD.org>	In preparation for turning on options VIMAGE in next commits, rearrange / replace / adjust several INIT_VNET_* initializer macros, all of which currently resolve to whitespace. Reviewed by: bz (an older version of the patch) Approved by: julian (mentor)
# b132600a	19-Apr-2009	Robert Watson <rwatson@FreeBSD.org>	In divert_packet(), lock the interface address list before iterating over it in search of an address. MFC after: 2 weeks
# 86425c62	11-Apr-2009	Robert Watson <rwatson@FreeBSD.org>	Update stats in struct ipstat using four new macros, IPSTAT_ADD(), IPSTAT_INC(), IPSTAT_SUB(), and IPSTAT_DEC(), rather than directly manipulating the fields across the kernel. This will make it easier to change the implementation of these statistics, such as using per-CPU versions of the data structures. MFC after: 3 days
# 2f4afd21	03-Feb-2009	Randall Stewart <rrs@FreeBSD.org>	Adds support for SCTP checksum offload. This means we, like TCP and UDP, move the checksum calculation into the IP routines when there is no hardware support we call into the normal SCTP checksum routine. The next round of SCTP updates will use this functionality. Of course the IGB driver needs a few updates to support the new intel controller set that actually does SCTP csum offload too. Reviewed by: gnn, rwatson, kmacy
# 385195c0	10-Dec-2008	Marko Zec <zec@FreeBSD.org>	Conditionally compile out V_ globals while instantiating the appropriate container structures, depending on VIMAGE_GLOBALS compile time option. Make VIMAGE_GLOBALS a new compile-time option, which by default will not be defined, resulting in instatiations of global variables selected for V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be effectively compiled out. Instantiate new global container structures to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0, vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0. Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_ macros resolve either to the original globals, or to fields inside container structures, i.e. effectively #ifdef VIMAGE_GLOBALS #define V_rt_tables rt_tables #else #define V_rt_tables vnet_net_0._rt_tables #endif Update SYSCTL_V_*() macros to operate either on globals or on fields inside container structs. Extend the internal kldsym() lookups with the ability to resolve selected fields inside the virtualization container structs. This applies only to the fields which are explicitly registered for kldsym() visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently this is done only in sys/net/if.c. Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code, and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in turn result in proper code being generated depending on VIMAGE_GLOBALS. De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c which were prematurely V_irtualized by automated V_ prepending scripts during earlier merging steps. PF virtualization will be done separately, most probably after next PF import. Convert a few variable initializations at instantiation to initialization in init functions, most notably in ipfw. Also convert TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in initializer functions. Discussed at: devsummit Strassburg Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
# 4b79449e	02-Dec-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Rather than using hidden includes (with cicular dependencies), directly include only the header files needed. This reduces the unneeded spamming of various headers into lots of files. For now, this leaves us with very few modules including vnet.h and thus needing to depend on opt_route.h. Reviewed by: brooks, gnn, des, zec, imp Sponsored by: The FreeBSD Foundation
# 97021c24	26-Nov-2008	Marko Zec <zec@FreeBSD.org>	Merge more of currently non-functional (i.e. resolving to whitespace) macros from p4/vimage branch. Do a better job at enclosing all instantiations of globals scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks. De-virtualize and mark as const saorder_state_alive and saorder_state_any arrays from ipsec code, given that they are never updated at runtime, so virtualizing them would be pointless. Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
# bc97ba51	19-Nov-2008	Julian Elischer <julian@FreeBSD.org>	Fix a scope problem in the multiple routing table code that stopped the SO_SETFIB socket option from working correctly. Obtained from: Ironport MFC after: 3 days
# 44e33a07	19-Nov-2008	Marko Zec <zec@FreeBSD.org>	Change the initialization methodology for global variables scheduled for virtualization. Instead of initializing the affected global variables at instatiation, assign initial values to them in initializer functions. As a rule, initialization at instatiation for such variables should never be introduced again from now on. Furthermore, enclose all instantiations of such global variables in #ifdef VIMAGE_GLOBALS blocks. Essentialy, this change should have zero functional impact. In the next phase of merging network stack virtualization infrastructure from p4/vimage branch, the new initialization methology will allow us to switch between using global variables and their counterparts residing in virtualization containers with minimum code churn, and in the long run allow us to intialize multiple instances of such container structures. Discussed at: devsummit Strassburg Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
# d7f03759	19-Oct-2008	Ulf Lilleengen <lulf@FreeBSD.org>	- Import the HEAD csup code which is the basis for the cvsmode work.
# f08ef6c5	17-Oct-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Add cr_canseeinpcb() doing checks using the cached socket credentials from inp_cred which is also available after the socket is gone. Switch cr_canseesocket consumers to cr_canseeinpcb. This removes an extra acquisition of the socket lock. Reviewed by: rwatson MFC after: 3 months (set timer; decide then)
# 8b615593	02-Oct-2008	Marko Zec <zec@FreeBSD.org>	Step 1.5 of importing the network stack virtualization infrastructure from the vimage project, as per plan established at devsummit 08/08: http://wiki.freebsd.org/Image/Notes200808DevSummit Introduce INIT_VNET_() initializer macros, VNET_FOREACH() iterator macros, and CURVNET_SET() context setting macros, all currently resolving to NOPs. Prepare for virtualization of selected SYSCTL objects by introducing a family of SYSCTL_V_() macros, currently resolving to their global counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT(). Move selected #defines from sys/sys/vimage.h to newly introduced header files specific to virtualized subsystems (sys/net/vnet.h, sys/netinet/vinet.h etc.). All the changes are verified to have zero functional impact at this point in time by doing MD5 comparision between pre- and post-change object files(). () netipsec/keysock.c did not validate depending on compile time options. Implemented by: julian, bz, brooks, zec Reviewed by: julian, bz, brooks, kris, rwatson, ... Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
# 603724d3	17-Aug-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Commit step 1 of the vimage project, (network stack) virtualization work done by Marko Zec (zec@). This is the first in a series of commits over the course of the next few weeks. Mark all uses of global variables to be virtualized with a V_ prefix. Use macros to map them back to their global names for now, so this is a NOP change only. We hope to have caught at least 85-90% of what is needed so we do not invalidate a lot of outstanding patches again. Obtained from: //depot/projects/vimage-commit2/... Reviewed by: brooks, des, ed, mav, julian, jamie, kris, rwatson, zec, ... (various people I forgot, different versions) md5 (with a bit of help) Sponsored by: NLnet Foundation, The FreeBSD Foundation X-MFC after: never V_Commit_Message_Reviewed_By: more people than the patch
# d185578a	27-Jul-2008	Alexander Motin <mav@FreeBSD.org>	According to in_pcb.h protocol binding information has double locking. It allows access it while list travercing holding only global pcbinfo lock.
# 3656a4fe	20-Apr-2008	Robert Watson <rwatson@FreeBSD.org>	Read lock, rather than write lock, the inpcb when transmitting with or delivering to an IP divert socket. MFC after: 3 months
# 8501a69c	17-Apr-2008	Robert Watson <rwatson@FreeBSD.org>	Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to explicitly select write locking for all use of the inpcb mutex. Update some pcbinfo lock assertions to assert locked rather than write-locked, although in practice almost all uses of the pcbinfo rwlock main exclusive, and all instances of inpcb lock acquisition are exclusive. This change should introduce (ideally) little functional change. However, it lays the groundwork for significantly increased parallelism in the TCP/IP code. MFC after: 3 months Tested by: kris (superset of committered patch)
# 30d239bc	24-Oct-2007	Robert Watson <rwatson@FreeBSD.org>	Merge first in a series of TrustedBSD MAC Framework KPI changes from Mac OS X Leopard--rationalize naming for entry points to the following general forms: mac_<object>_<method/action> mac_<object>_check_<method/action> The previous naming scheme was inconsistent and mostly reversed from the new scheme. Also, make object types more consistent and remove spaces from object types that contain multiple parts ("posix_sem" -> "posixsem") to make mechanical parsing easier. Introduce a new "netinet" object type for certain IPv4/IPv6-related methods. Also simplify, slightly, some entry point names. All MAC policy modules will need to be recompiled, and modules not updates as part of this commit will need to be modified to conform to the new KPI. Sponsored by: SPARTA (original patches against Mac OS X) Obtained from: TrustedBSD Project, Apple Computer
# 4b421e2d	07-Oct-2007	Mike Silbersack <silby@FreeBSD.org>	Add FBSDID to all files in netinet so that people can more easily include file version information in bug reports. Approved by: re (kensmith)
# b244c8ad	06-Aug-2007	Christian S.J. Peron <csjp@FreeBSD.org>	Over the past couple of years, there have been a number of reports relating the use of divert sockets to dead locks. A number of LORs have been reported between divert and a number of other network subsystems including: IPSEC, Pfil, multicast, ipfw and others. Other dead locks could occur because of recursive entry into the IP stack. This change should take care of most if not all of these issues. A summary of the changes follow: - We disallow multicast operations on divert sockets. It really doesn't make semantic sense to allow this, since typically you would set multicast parameters on multicast end points. NOTE: As a part of this change, we actually dis-allow multicast options on any socket that IS a divert socket OR IS NOT a SOCK_RAW or SOCK_DGRAM family - We check to see if there are any socket options that have been specified on the socket, and if there was (which is very un-common and also probably doesnt make sense to support) we duplicate the mbuf carrying the options. - We then drop the INP/INFO locks over the call to ip_output(). It should be noted that since we no longer support multicast operations on divert sockets and we have duplicated any socket options, we no longer need the reference to the pcb to be coherent. - Finally, we replaced the call to ip_input() to use netisr queuing. This should remove the recursive entry into the IP stack from divert. By dropping the locks over the call to ip_output() we eliminate all the lock ordering issues above. By switching over to netisr on the inbound path, we can no longer recursively enter the ip_input() code via divert. I have tested this change by using the following command: ipfwpcap -r 8000 - \| tcpdump -r - -nn -v This should exercise the input and re-injection (outbound) path, which is very similar to the work load performed by natd(8). Additionally, I have run some ospf daemons which have a heavy reliance on raw sockets and multicast. Approved by: re@ (kensmith) MFC after: 1 month LOR: 163 LOR: 181 LOR: 202 LOR: 203 Discussed with: julian, andre et al (on freebsd-net) In collaboration with: bms [1], rwatson [2] [1] bms helped out with the multicast decisions [2] rwatson submitted the original netisr patches and came up with some of the original ideas on how to combat this issue.
# 54d642bb	11-May-2007	Robert Watson <rwatson@FreeBSD.org>	Reduce network stack oddness: implement .pru_sockaddr and .pru_peeraddr protocol entry points using functions named proto_getsockaddr and proto_getpeeraddr rather than proto_setsockaddr and proto_setpeeraddr. While it's true that sockaddrs are allocated and set, the net effect is to retrieve (get) the socket address or peer address from a socket, not set it, so align names to that intent.
# 169db7b2	11-May-2007	Robert Watson <rwatson@FreeBSD.org>	Remove unneeded wrappers for in_setsockaddr() and in_setpeeraddr(), which used to exist so pcbinfo locks could be acquired, but are no longer required as a result of socket/pcb reference model refinements.
# f2565d68	10-May-2007	Robert Watson <rwatson@FreeBSD.org>	Move universally to ANSI C function declarations, with relatively consistent style(9)-ish layout.
# 84ca8aa6	01-May-2007	Robert Watson <rwatson@FreeBSD.org>	Remove unused pcbinfo arguments to in_setsockaddr() and in_setpeeraddr().
# 712fc218	30-Apr-2007	Robert Watson <rwatson@FreeBSD.org>	Rename some fields of struct inpcbinfo to have the ipi_ prefix, consistent with the naming of other structure field members, and reducing improper grep matches. Clean up and comment structure fields in structure definition.
# 08651e1f	29-Dec-2006	John Baldwin <jhb@FreeBSD.org>	Some whitespace nits and remove a few casts.
# acd3428b	06-Nov-2006	Robert Watson <rwatson@FreeBSD.org>	Sweep kernel replacing suser(9) calls with priv(9) calls, assigning specific privilege names to a broad range of privileges. These may require some future tweaking. Sponsored by: nCircle Network Security, Inc. Obtained from: TrustedBSD Project Discussed on: arch@ Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri, Alex Lyashkov <umka at sevcity dot net>, Skip Ford <skip dot ford at verizon dot net>, Antoine Brodin <antoine dot brodin at laposte dot net>
# aed55708	22-Oct-2006	Robert Watson <rwatson@FreeBSD.org>	Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead. This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd. Obtained from: TrustedBSD Project Sponsored by: SPARTA
# d915b280	18-Jul-2006	Stephan Uphoff <ups@FreeBSD.org>	Fix race conditions on enumerating pcb lists by moving the initialization ( and where appropriate the destruction) of the pcb mutex to the init/finit functions of the pcb zones. This allows locking of the pcb entries and race condition free comparison of the generation count. Rearrange locking a bit to avoid extra locking operation to update the generation count in in_pcballoc(). (in_pcballoc now returns the pcb locked) I am planning to convert pcb list handling from a type safe to a reference count model soon. ( As this allows really freeing the PCBs) Reviewed by: rwatson@, mohans@ MFC after: 1 week
# 4b97d7af	29-Jun-2006	Yaroslav Tykhiy <ytykhiy@gmail.com>	There is a consensus that ifaddr.ifa_addr should never be NULL, except in places dealing with ifaddr creation or destruction; and in such special places incomplete ifaddrs should never be linked to system-wide data structures. Therefore we can eliminate all the superfluous checks for "ifa->ifa_addr != NULL" and get ready to the system crashing honestly instead of masking possible bugs. Suggested by: glebius, jhb, ru
# 4f590175	21-Apr-2006	Paul Saab <ps@FreeBSD.org>	Allow for nmbclusters and maxsockets to be increased via sysctl. An eventhandler is used to update all the various zones that depend on these values.
# a34f6c1e	03-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Correct incorrect assertion in div_bind(): inp must not be NULL here. Reported by: tegge MFC after: 3 months
# 14ba8add	01-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Update in_pcb-derived basic socket types following changes to pru_abort(), pru_detach(), and in_pcbdetach(): - Universally support and enforce the invariant that so_pcb is never NULL, converting dozens of unnecessary NULL checks into assertions, and eliminating dozens of unnecessary error handling cases in protocol code. - In some cases, eliminate unnecessary pcbinfo locking, as it is no longer required to ensure so_pcb != NULL. For example, in protocol shutdown methods, and in raw IP send. - Abort and detach protocol switch methods no longer return failures, nor attempt to free sockets, as the socket layer does this. - Invoke in_pcbfree() after in_pcbdetach() in order to free the detached in_pcb structure for a socket. MFC after: 3 months
# bc725eaf	01-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Chance protocol switch method pru_detach() so that it returns void rather than an error. Detaches do not "fail", they other occur or the protocol flags SS_PROTOREF to take ownership of the socket. soclose() no longer looks at so_pcb to see if it's NULL, relying entirely on the protocol to decide whether it's time to free the socket or not using SS_PROTOREF. so_pcb is now entirely owned and managed by the protocol code. Likewise, no longer test so_pcb in other socket functions, such as soreceive(), which have no business digging into protocol internals. Protocol detach routines no longer try to free the socket on detach, this is performed in the socket code if the protocol permits it. In rts_detach(), no longer test for rp != NULL in detach, and likewise in other protocols that don't permit a NULL so_pcb, reduce the incidence of testing for it during detach. netinet and netinet6 are not fully updated to this change, which will be in an upcoming commit. In their current state they may leak memory or panic. MFC after: 3 months
# 303989a2	09-Nov-2005	Ruslan Ermilov <ru@FreeBSD.org>	Use sparse initializers for "struct domain" and "struct protosw", so they are easier to follow for the human being.
# b3cf6808	13-May-2005	Gleb Smirnoff <glebius@FreeBSD.org>	In div_output() explicitly set m->m_nextpkt to NULL. If divert socket is not userland, but ng_ksocket, then m->m_nextpkt may be non-NULL. In this case we would panic in sbappend.
# fd94099e	05-May-2005	Colin Percival <cperciva@FreeBSD.org>	If we are going to 1. Copy a NULL-terminated string into a fixed-length buffer, and 2. copyout that buffer to userland, we really ought to 0. Zero the entire buffer first. Security: FreeBSD-SA-05:08.kmem
# c398230b	06-Jan-2005	Warner Losh <imp@FreeBSD.org>	/* -> /*- for license, minor formatting changes
# c1384b5a	18-Nov-2004	Gleb Smirnoff <glebius@FreeBSD.org>	- Since divert protocol is not connection oriented, remove SS_ISCONNECTED flag from divert sockets. - Remove div_disconnect() method, since it shouldn't be called now. - Remove div_abort() method. It was never called directly, since protocol doesn't have listen queue. It was called only from div_disconnect(), which is removed now. Reviewed by: rwatson, maxim Approved by: julian (mentor) MT5 after: 1 week MT4 after: 1 month
# ea0bd576	12-Nov-2004	Gleb Smirnoff <glebius@FreeBSD.org>	Fix ng_ksocket(4) operation as a divert socket, which is pretty useful and has been broken twice: - in the beginning of div_output() replace KASSERT with assignment, as it was in rev. 1.83. [1] [to be MFCed] - refactor changes introduced in rev. 1.100: do not prepend a new tag unconditionally. Before doing this check whether we have one. [2] A small note for all hacking in this area: when divert socket is not a real userland, but ng_ksocket(4), we receive _the same_ mbufs, that we transmitted to socket. These mbufs have rcvif, the tags we've put on them. And we should treat them correctly. Discussed with: mlaier [1] Silence from: green [2] Reviewed by: maxim Approved by: julian (mentor) MFC after: 1 week
# e21e4c19	11-Nov-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Add missing '=' Spotted by: obrien
# 756d52a1	08-Nov-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Initialize struct pr_userreqs in new/sparse style and fill in common default elements in net_init_domain(). This makes it possible to grep these structures and see any bogosities.
# 84bb6a2e	25-Oct-2004	Andre Oppermann <andre@FreeBSD.org>	IPDIVERT is a module now and tell the other parts of the kernel about it. IPDIVERT depends on IPFIREWALL being loaded or compiled into the kernel.
# 24fc79b0	22-Oct-2004	Andre Oppermann <andre@FreeBSD.org>	Refuse to unload the ipdivert module unless the 'force' flag is given to kldunload. Reflect the fact that IPDIVERT is a loadable module in the divert(4) and ipfw(8) man pages.
# 57bbe2e1	19-Oct-2004	Andre Oppermann <andre@FreeBSD.org>	Destroy the UMA zone on unload.
# 2de1a9eb	19-Oct-2004	Andre Oppermann <andre@FreeBSD.org>	Slightly extend the locking during unload to fully cover the protocol deregistration. This does not entirely close the race but narrows the even previously extremely small chance of a race some more.
# 279128e2	19-Oct-2004	Robert Watson <rwatson@FreeBSD.org>	Annotate a newly introduced race present due to the unloading of protocols: it is possible for sockets to be created and attached to the divert protocol between the test for sockets present and successful unload of the registration handler. We will need to explore more mature APIs for unregistering the protocol and then draining consumers, or an atomic test-and-unregister mechanism.
# 72584fd2	19-Oct-2004	Andre Oppermann <andre@FreeBSD.org>	Convert IPDIVERT into a loadable module. This makes use of the dynamic loadability of protocols. The call to divert_packet() is done through a function pointer. All semantics of IPDIVERT remain intact. If IPDIVERT is not loaded ipfw will refuse to install divert rules and natd will complain about 'protocol not supported'. Once it is loaded both will work and accept rules and open the divert socket. The module can only be unloaded if no divert sockets are open. It does not close any divert sockets when an unload is requested but will return EBUSY instead.
# 6daf7ebd	02-Oct-2004	Brian Feldman <green@FreeBSD.org>	Add support to IPFW for classification based on "diverted" status (that is, input via a divert socket).
# b5d47ff5	04-Sep-2004	John-Mark Gurney <jmg@FreeBSD.org>	fix up socket/ip layer violation... don't assume/know that SO_DONTROUTE == IP_ROUTETOIF and SO_BROADCAST == IP_ALLOWBROADCAST...
# 9b932e9e	17-Aug-2004	Andre Oppermann <andre@FreeBSD.org>	Convert ipfw to use PFIL_HOOKS. This is change is transparent to userland and preserves the ipfw ABI. The ipfw core packet inspection and filtering functions have not been changed, only how ipfw is invoked is different. However there are many changes how ipfw is and its add-on's are handled: In general ipfw is now called through the PFIL_HOOKS and most associated magic, that was in ip_input() or ip_output() previously, is now done in ipfw_check_[in\|out]() in the ipfw PFIL handler. IPDIVERT is entirely handled within the ipfw PFIL handlers. A packet to be diverted is checked if it is fragmented, if yes, ip_reass() gets in for reassembly. If not, or all fragments arrived and the packet is complete, divert_packet is called directly. For 'tee' no reassembly attempt is made and a copy of the packet is sent to the divert socket unmodified. The original packet continues its way through ip_input/output(). ipfw 'forward' is done via m_tag's. The ipfw PFIL handlers tag the packet with the new destination sockaddr_in. A check if the new destination is a local IP address is made and the m_flags are set appropriately. ip_input() and ip_output() have some more work to do here. For ip_input() the m_flags are checked and a packet for us is directly sent to the 'ours' section for further processing. Destination changes on the input path are only tagged and the 'srcrt' flag to ip_forward() is set to disable destination checks and ICMP replies at this stage. The tag is going to be handled on output. ip_output() again checks for m_flags and the 'ours' tag. If found, the packet will be dropped back to the IP netisr where it is going to be picked up by ip_input() again and the directly sent to the 'ours' section. When only the destination changes, the route's 'dst' is overwritten with the new destination from the forward m_tag. Then it jumps back at the route lookup again and skips the firewall check because it has been marked with M_SKIP_FIREWALL. ipfw 'forward' has to be compiled into the kernel with 'option IPFIREWALL_FORWARD' to enable it. DUMMYNET is entirely handled within the ipfw PFIL handlers. A packet for a dummynet pipe or queue is directly sent to dummynet_io(). Dummynet will then inject it back into ip_input/ip_output() after it has served its time. Dummynet packets are tagged and will continue from the next rule when they hit the ipfw PFIL handlers again after re-injection. BRIDGING and IPFW_ETHER are not changed yet and use ipfw_chk() directly as they did before. Later this will be changed to dedicated ETHER PFIL_HOOKS. More detailed changes to the code: conf/files Add netinet/ip_fw_pfil.c. conf/options Add IPFIREWALL_FORWARD option. modules/ipfw/Makefile Add ip_fw_pfil.c. net/bridge.c Disable PFIL_HOOKS if ipfw for bridging is active. Bridging ipfw is still directly invoked to handle layer2 headers and packets would get a double ipfw when run through PFIL_HOOKS as well. netinet/ip_divert.c Removed divert_clone() function. It is no longer used. netinet/ip_dummynet.[ch] Neither the route 'ro' nor the destination 'dst' need to be stored while in dummynet transit. Structure members and associated macros are removed. netinet/ip_fastfwd.c Removed all direct ipfw handling code and replace it with the new 'ipfw forward' handling code. netinet/ip_fw.h Removed 'ro' and 'dst' from struct ip_fw_args. netinet/ip_fw2.c (Re)moved some global variables and the module handling. netinet/ip_fw_pfil.c New file containing the ipfw PFIL handlers and module initialization. netinet/ip_input.c Removed all direct ipfw handling code and replace it with the new 'ipfw forward' handling code. ip_forward() does not longer require the 'next_hop' struct sockaddr_in argument. Disable early checks if 'srcrt' is set. netinet/ip_output.c Removed all direct ipfw handling code and replace it with the new 'ipfw forward' handling code. netinet/ip_var.h Add ip_reass() as general function. (Used from ipfw PFIL handlers for IPDIVERT.) netinet/raw_ip.c Directly check if ipfw and dummynet control pointers are active. netinet/tcp_input.c Rework the 'ipfw forward' to local code to work with the new way of forward tags. netinet/tcp_sack.c Remove include 'opt_ipfw.h' which is not needed here. sys/mbuf.h Remove m_claim_next() macro which was exclusively for ipfw 'forward' and is no longer needed. Approved by: re (scottl)
# 420a2811	11-Aug-2004	Andre Oppermann <andre@FreeBSD.org>	Backout removal of UMA_ZONE_NOFREE flag for all zones which are established for structures with timers in them. It might be that a timer might fire even when the associated structure has already been free'd. Having type- stable storage in this case is beneficial for graceful failure handling and debugging. Discussed with: bosko, tegge, rwatson
# 4efb805c	11-Aug-2004	Andre Oppermann <andre@FreeBSD.org>	Remove the UMA_ZONE_NOFREE flag to all uma_zcreate() calls in the IP and TCP code. This flag would have prevented giving back excessive free slabs to the global pool after a transient peak usage.
# f0cada84	02-Aug-2004	Andre Oppermann <andre@FreeBSD.org>	o Move all parts of the IP reassembly process into the function ip_reass() to make it fully self-contained. o ip_reass() now returns a new mbuf with the reassembled packet and ip->ip_len including the IP header. o Computation of the delayed checksum is moved into divert_packet(). Reviewed by: silby
# e3e244bf	27-Jun-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Rwatson, write 100 times for tomorrow: First unlock, then assign NULL to pointer.
# 1e4d7da7	26-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Reduce the number of unnecessary unlock-relocks on socket buffer mutexes associated with performing a wakeup on the socket buffer: - When performing an sbappend*() followed by a so[rw]wakeup(), explicitly acquire the socket buffer lock and use the _locked() variants of both calls. Note that the _locked() sowakeup() versions unlock the mutex on return. This is done in uipc_send(), divert_packet(), mroute socket_send(), raw_append(), tcp_reass(), tcp_input(), and udp_append(). - When the socket buffer lock is dropped before a sowakeup(), remove the explicit unlock and use the _locked() sowakeup() variant. This is done in soisdisconnecting(), soisdisconnected() when setting the can't send/ receive flags and dropping data, and in uipc_rcvd() which adjusting back-pressure on the sockets. For UNIX domain sockets running mpsafe with a contention-intensive SMP mysql benchmark, this results in a 1.6% query rate improvement due to reduce mutex costs.
# bb7479a6	21-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Acquire socket lock around frobbing of socket state in divert sockets.
# ffcbc0e4	21-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Prefer use of the inpcb as a MAC label source for outgoing packets sent via divert sockets, when available.
# 310e7ceb	12-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Socket MAC labels so_label and so_peerlabel are now protected by SOCK_LOCK(so): - Hold socket lock over calls to MAC entry points reading or manipulating socket labels. - Assert socket lock in MAC entry point implementations. - When externalizing the socket label, first make a thread-local copy while holding the socket lock, then release the socket lock to externalize to userspace.
# c1d587c8	10-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Remove unneeded Giant acquisition in divert_packet(), which is left over from debug.mpsafenet affecting only the forwarding plane. Giant is now acquired in the ithread/netisr or in the system call code.
# f36cfd49	07-Apr-2004	Warner Losh <imp@FreeBSD.org>	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999 and email from Peter Wemm, Alan Cox and Robert Watson. Approved by: core, peter, alc, rwatson
# b0330ed9	27-Mar-2004	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Reduce 'td' argument to 'cred' (struct ucred) argument in those functions: - in_pcbbind(), - in_pcbbind_setup(), - in_pcbconnect(), - in_pcbconnect_setup(), - in6_pcbbind(), - in6_pcbconnect(), - in6_pcbsetport(). "It should simplify/clarify things a great deal." --rwatson Requested by: rwatson Reviewed by: rwatson, ume
# 6823b823	27-Mar-2004	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Remove unused argument. Reviewed by: ume
# 47934cef	25-Feb-2004	Don Lewis <truckman@FreeBSD.org>	Split the mlock() kernel code into two parts, mlock(), which unpacks the syscall arguments and does the suser() permission check, and kern_mlock(), which does the resource limit checking and calls vm_map_wire(). Split munlock() in a similar way. Enable the RLIMIT_MEMLOCK checking code in kern_mlock(). Replace calls to vslock() and vsunlock() in the sysctl code with calls to kern_mlock() and kern_munlock() so that the sysctl code will obey the wired memory limits. Nuke the vslock() and vsunlock() implementations, which are no longer used. Add a member to struct sysctl_req to track the amount of memory that is wired to handle the request. Modify sysctl_wire_old_buffer() to return an error if its call to kern_mlock() fails. Only wire the minimum of the length specified in the sysctl request and the length specified in its argument list. It is recommended that sysctl handlers that use sysctl_wire_old_buffer() should specify reasonable estimates for the amount of data they want to return so that only the minimum amount of memory is wired no matter what length has been specified by the request. Modify the callers of sysctl_wire_old_buffer() to look for the error return. Modify sysctl_old_user to obey the wired buffer length and clean up its implementation. Reviewed by: bms
# ac9d7e26	25-Feb-2004	Max Laier <mlaier@FreeBSD.org>	Re-remove MT_TAGs. The problems with dummynet have been fixed now. Tested by: -current, bms(mentor), me Approved by: bms(mentor), sam
# 36e8826f	17-Feb-2004	Max Laier <mlaier@FreeBSD.org>	Backout MT_TAG removal (i.e. bring back MT_TAGs) for now, as dummynet is not working properly with the patch in place. Approved by: bms(mentor)
# 1094bdca	13-Feb-2004	Max Laier <mlaier@FreeBSD.org>	This set of changes eliminates the use of MT_TAG "pseudo mbufs", replacing them mostly with packet tags (one case is handled by using an mbuf flag since the linkage between "caller" and "callee" is direct and there's no need to incur the overhead of a packet tag). This is (mostly) work from: sam Silence from: -arch Approved by: bms(mentor), sam, rwatson
# 5bd311a5	25-Nov-2003	Sam Leffler <sam@FreeBSD.org>	Split the "inp" mutex class into separate classes for each of divert, raw, tcp, udp, raw6, and udp6 sockets to avoid spurious witness complaints. Reviewed by: rwatson Approved by: re (rwatson)
# 97d8d152	20-Nov-2003	Andre Oppermann <andre@FreeBSD.org>	Introduce tcp_hostcache and remove the tcp specific metrics from the routing table. Move all usage and references in the tcp stack from the routing table metrics to the tcp hostcache. It caches measured parameters of past tcp sessions to provide better initial start values for following connections from or to the same source or destination. Depending on the network parameters to/from the remote host this can lead to significant speedups for new tcp connections after the first one because they inherit and shortcut the learning curve. tcp_hostcache is designed for multiple concurrent access in SMP environments with high contention and is hash indexed by remote ip address. It removes significant locking requirements from the tcp stack with regard to the routing table. Reviewed by: sam (mentor), bms Reviewed by: -net, -current, core@kame.net (IPv6 parts) Approved by: re (scottl)
# a557af22	17-Nov-2003	Robert Watson <rwatson@FreeBSD.org>	Introduce a MAC label reference in 'struct inpcb', which caches the MAC label referenced from 'struct socket' in the IPv4 and IPv6-based protocols. This permits MAC labels to be checked during network delivery operations without dereferencing inp->inp_socket to get to so->so_label, which will eventually avoid our having to grab the socket lock during delivery at the network layer. This change introduces 'struct inpcb' as a labeled object to the MAC Framework, along with the normal circus of entry points: initialization, creation from socket, destruction, as well as a delivery access control check. For most policies, the inpcb label will simply be a cache of the socket label, so a new protocol switch method is introduced, pr_sosetlabel() to notify protocols that the socket layer label has been updated so that the cache can be updated while holding appropriate locks. Most protocols implement this using pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use the the worker function in_pcbsosetlabel(), which calls into the MAC Framework to perform a cache update. Biba, LOMAC, and MLS implement these entry points, as do the stub policy, and test policy. Reviewed by: sam, bms Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 63346129	16-Nov-2003	Brian Feldman <green@FreeBSD.org>	Fix a few cases where MT_TAG-type "fake mbufs" are created on the stack, but do not have mh_nextpkt initialized. Somtimes what's there is "1", and the ip_input() code pukes trying to m_free() it, rendering divert sockets and such broken. This really underscores the need to get rid of MT_TAG. Reviewed by: rwatson
# 252f24a2	08-Nov-2003	Sam Leffler <sam@FreeBSD.org>	divert socket fixups: o pickup Giant in divert_packet to protect sbappendaddr since it can be entered through MPSAFE callouts or through ip_input when mpsafenet is 1 o add missing locking on output o add locking to abort and shutdown o add a ctlinput handler to invalidate held routing table references on an ICMP redirect (may not be needed) Supported by: FreeBSD Foundation
# 9bf40ede	31-Oct-2003	Brooks Davis <brooks@FreeBSD.org>	Replace the if_name and if_unit members of struct ifnet with new members if_xname, if_dname, and if_dunit. if_xname is the name of the interface and if_dname/unit are the driver name and instance. This change paves the way for interface renaming and enhanced pseudo device creation and configuration symantics. Approved By: re (in principle) Reviewed By: njl, imp Tested On: i386, amd64, sparc64 Obtained From: NetBSD (if_xname)
# 26f91065	04-Sep-2003	Sam Leffler <sam@FreeBSD.org>	o add locking o move the global divsrc socket address to a local variable instead of locking it Sponsored by: FreeBSD Foundation
# fe584538	08-Apr-2003	Dag-Erling Smørgrav <des@FreeBSD.org>	Introduce an M_ASSERTPKTHDR() macro which performs the very common task of asserting that an mbuf has a packet header. Use it instead of hand- rolled versions wherever applicable. Submitted by: Hiten Pandya <hiten@unixdaemons.com>
# a163d034	18-Feb-2003	Warner Losh <imp@FreeBSD.org>	Back out M_* changes, per decision of the TRB. Approved by: trb
# 4ee6e70e	28-Jan-2003	Poul-Henning Kamp <phk@FreeBSD.org>	Check bounds for index before dereferencing memory past end of array. Found by: FlexeLint
# 44956c98	21-Jan-2003	Alfred Perlstein <alfred@FreeBSD.org>	Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
# 4d3ffc98	29-Oct-2002	Bill Fenner <fenner@FreeBSD.org>	Renumber IPPROTO_DIVERT out of the range of valid IP protocol numbers. This allows socket() to return an error when the kernel is not built with IPDIVERT, and doesn't prevent future applications from using the "borrowed" IP protocol number. The sysctl net.inet.raw.olddiverterror controls whether opening a socket with the "borrowed" IP protocol fails with an accompanying kernel printf; this code should last only a couple of releases. Approved by: re
# 56e77afa	24-Oct-2002	Maxime Henrion <mux@FreeBSD.org>	Fix kernel build on sparc64 in the IPDIVERT case.
# 5d846453	15-Oct-2002	Sam Leffler <sam@FreeBSD.org>	Replace aux mbufs with packet tags: o instead of a list of mbufs use a list of m_tag structures a la openbsd o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit ABI/module number cookie o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and use this in defining openbsd-compatible m_tag_find and m_tag_get routines o rewrite KAME use of aux mbufs in terms of packet tags o eliminate the most heavily used aux mbufs by adding an additional struct inpcb parameter to ip_output and ip6_output to allow the IPsec code to locate the security policy to apply to outbound packets o bump __FreeBSD_version so code can be conditionalized o fixup ipfilter's call to ip_output based on __FreeBSD_version Reviewed by: julian, luigi (silent), -arch, -net, darren Approved by: julian, silence from everyone else Obtained from: openbsd (mostly) MFC after: 1 month
# d3990b06	31-Jul-2002	Robert Watson <rwatson@FreeBSD.org>	Introduce support for Mandatory Access Control and extensible kernel access control. Invoke the MAC framework to label mbuf created using divert sockets. These labels may later be used for access control on delivery to another socket, or to an interface. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI LAbs
# a5924d61	23-Jun-2002	Luigi Rizzo <luigi@FreeBSD.org>	fix a typo in a comment
# 2b25acc1	22-Jun-2002	Luigi Rizzo <luigi@FreeBSD.org>	Remove (almost all) global variables that were used to hold packet forwarding state ("annotations") during ip processing. The code is considerably cleaner now. The variables removed by this change are: ip_divert_cookie used by divert sockets ip_fw_fwd_addr used for transparent ip redirection last_pkt used by dynamic pipes in dummynet Removal of the first two has been done by carrying the annotations into volatile structs prepended to the mbuf chains, and adding appropriate code to add/remove annotations in the routines which make use of them, i.e. ip_input(), ip_output(), tcp_input(), bdg_forward(), ether_demux(), ether_output_frame(), div_output(). On passing, remove a bug in divert handling of fragmented packet. Now it is the fragment at offset 0 which sets the divert status of the whole packet, whereas formerly it was the last incoming fragment to decide. Removal of last_pkt required a change in the interface of ip_fw_chk() and dummynet_io(). On passing, use the same mechanism for dummynet annotations and for divert/forward annotations. option IPFIREWALL_FORWARD is effectively useless, the code to implement it is very small and is now in by default to avoid the obfuscation of conditionally compiled code. NOTES: * there is at least one global variable left, sro_fwd, in ip_output(). I am not sure if/how this can be removed. * I have deliberately avoided gratuitous style changes in this commit to avoid cluttering the diffs. Minor stule cleanup will likely be necessary * this commit only focused on the IP layer. I am sure there is a number of global variables used in the TCP and maybe UDP stack. * despite the number of files touched, there are absolutely no API's or data structures changed by this commit (except the interfaces of ip_fw_chk() and dummynet_io(), which are internal anyways), so an MFC is quite safe and unintrusive (and desirable, given the improved readability of the code). MFC after: 10 days
# 7a9378e7	11-Jun-2002	Jeffrey Hsu <hsu@FreeBSD.org>	Remember to initialize the control block head mutex.
# 3d9baf34	11-Jun-2002	Jeffrey Hsu <hsu@FreeBSD.org>	Fix typo. Submitted by: Kyunghwan Kim <redjade@atropos.snu.ac.kr>
# f76fcf6d	10-Jun-2002	Jeffrey Hsu <hsu@FreeBSD.org>	Lock up inpcb. Submitted by: Jennifer Yang <yangjihui@yahoo.com>
# 4cc20ab1	31-May-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Back out my lats commit of locking down a socket, it conflicts with hsu's work. Requested by: hsu
# 243917fe	19-May-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Lock down a socket, milestone 1. o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a socket buffer. The mutex in the receive buffer also protects the data in struct socket. o Determine the lock strategy for each members in struct socket. o Lock down the following members: - so_count - so_options - so_linger - so_state o Remove *_locked() socket APIs. Make the following socket APIs touching the members above now require a locked socket: - sodisconnect() - soisconnected() - soisconnecting() - soisdisconnected() - soisdisconnecting() - sofree() - soref() - sorele() - sorwakeup() - sotryfree() - sowakeup() - sowwakeup() Reviewed by: alfred
# 960ed29c	29-Apr-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Revert the change of #includes in sys/filedesc.h and sys/socketvar.h. Requested by: bde Since locking sigio_lock is usually followed by calling pgsigio(), move the declaration of sigio_lock and the definitions of SIGIO_*() to sys/signalvar.h. While I am here, sort include files alphabetically, where possible.
# ad278afd	09-Apr-2002	John Baldwin <jhb@FreeBSD.org>	Change the first argument of prison_xinpcb() to be a thread pointer instead of a proc pointer so that prison_xinpcb() can use td_ucred.
# 44731cab	01-Apr-2002	John Baldwin <jhb@FreeBSD.org>	Change the suser() API to take advantage of td_ucred as well as do a general cleanup of the API. The entire API now consists of two functions similar to the pre-KSE API. The suser() function takes a thread pointer as its only argument. The td_ucred member of this thread must be valid so the only valid thread pointers are curthread and a few kernel threads such as thread0. The suser_cred() function takes a pointer to a struct ucred as its first argument and an integer flag as its second argument. The flag is currently only used for the PRISON_ROOT flag. Discussed on: smp@
# 69c2d429	19-Mar-2002	Jeff Roberson <jeff@FreeBSD.org>	Switch vm_zone.h with uma.h. Change over to uma interfaces.
# fd8e4ebc	18-Feb-2002	Mike Barcroft <mike@FreeBSD.org>	o Move NTOHL() and associated macros into <sys/param.h>. These are deprecated in favor of the POSIX-defined lowercase variants. o Change all occurrences of NTOHL() and associated marcros in the source tree to use the lowercase function variants. o Add missing license bits to sparc64's <machine/endian.h>. Approved by: jake o Clean up <machine/endian.h> files. o Remove unused __uint16_swap_uint32() from i386's <machine/endian.h>. o Remove prototypes for non-existent bswapXX() functions. o Include <machine/endian.h> in <arpa/inet.h> to define the POSIX-required ntohl() family of functions. o Do similar things to expose the ntohl() family in libstand, <netinet/in.h>, and <sys/param.h>. o Prepend underscores to the ntohl() family to help deal with complexities associated with having MD (asm and inline) versions, and having to prevent exposure of these functions in other headers that happen to make use of endian-specific defines. o Create weak aliases to the canonical function name to help deal with third-party software forgetting to include an appropriate header. o Remove some now unneeded pollution from <sys/types.h>. o Add missing <arpa/inet.h> includes in userland. Tested on: alpha, i386 Reviewed by: bde, jake, tmm
# 6e551fb6	10-Dec-2001	David E. O'Brien <obrien@FreeBSD.org>	Update to C99, s/__FUNCTION__/__func__/, also don't use ANSI string concatenation.
# ce178806	07-Nov-2001	Robert Watson <rwatson@FreeBSD.org>	o Replace reference to 'struct proc' with 'struct thread' in 'struct sysctl_req', which describes in-progress sysctl requests. This permits sysctl handlers to have access to the current thread, permitting work on implementing td->td_ucred, migration of suser() to using struct thread to derive the appropriate ucred, and allowing struct thread to be passed down to other code, such as network code where td is not currently available (and curproc is used). o Note: netncp and netsmb are not updated to reflect this change, as they are not currently KSE-adapted. Reviewed by: julian Obtained from: TrustedBSD Project
# b40ce416	12-Sep-2001	Julian Elischer <julian@FreeBSD.org>	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
# f0ffb944	03-Sep-2001	Julian Elischer <julian@FreeBSD.org>	Patches from Keiichi SHIMA <keiichi@iij.ad.jp> to make ip use the standard protosw structure again. Obtained from: Well, KAME I guess.
# 13cf67f3	26-Jul-2001	Hajimu UMEMOTO <ume@FreeBSD.org>	move ipsec security policy allocation into in_pcballoc, before making pcbs available to the outside world. otherwise, we will see inpcb without ipsec security policy attached (-> panic() in ipsec.c). Obtained from: KAME MFC after: 3 days
# fc2ffbe6	04-Feb-2001	Poul-Henning Kamp <phk@FreeBSD.org>	Mechanical change to use <sys/queue.h> macro API instead of fondling implementation details. Created with: sed(1) Reviewed by: md5(1)
# cf9fa8e7	29-Oct-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Move suser() and suser_xxx() prototypes and a related #define from <sys/proc.h> to <sys/systm.h>. Correctly document the #includes needed in the manpage. Add one now needed #include of <sys/systm.h>. Remove the consequent 48 unused #includes of <sys/proc.h>.
# e30177e0	14-Sep-2000	Ruslan Ermilov <ru@FreeBSD.org>	Follow BSD/OS and NetBSD, keep the ip_id field in network order all the time. Requested by: wollman
# 04287599	31-Aug-2000	Ruslan Ermilov <ru@FreeBSD.org>	Fixed broken ICMP error generation, unified conversion of IP header fields between host and network byte order. The details: o icmp_error() now does not add IP header length. This fixes the problem when icmp_error() is called from ip_forward(). In this case the ip_len of the original IP datagram returned with ICMP error was wrong. o icmp_error() expects all three fields, ip_len, ip_id and ip_off in host byte order, so DTRT and convert these fields back to network byte order before sending a message. This fixes the problem described in PR 16240 and PR 20877 (ip_id field was returned in host byte order). o ip_ttl decrement operation in ip_forward() was moved down to make sure that it does not corrupt the copy of original IP datagram passed later to icmp_error(). o A copy of original IP datagram in ip_forward() was made a read-write, independent copy. This fixes the problem I first reported to Garrett Wollman and Bill Fenner and later put in audit trail of PR 16240: ip_output() (not always) converts fields of original datagram to network byte order, but because copy (mcopy) and its original (m) most likely share the same mbuf cluster, ip_output()'s manipulations on original also corrupted the copy. o ip_output() now expects all three fields, ip_len, ip_off and (what is significant) ip_id in host byte order. It was a headache for years that ip_id was handled differently. The only compatibility issue here is the raw IP socket interface with IP_HDRINCL socket option set and a non-zero ip_id field, but ip.4 manual page was unclear on whether in this case ip_id field should be in host or network byte order.
# 3e065e76	30-Aug-2000	Ruslan Ermilov <ru@FreeBSD.org>	Fixed the bug that div_bind() always returned zero even if there was an error (broken in rev 1.9).
# cec335f9	03-Aug-2000	Ruslan Ermilov <ru@FreeBSD.org>	Make netstat(1) to be aware of divert(4) sockets.
# 7a04c4f8	02-May-2000	Paul Richards <paul@FreeBSD.org>	Force the address of the socket to be INADDR_ANY immediately before calling in_pcbbind so that in_pcbbind sees a valid address if no address was specified (since divert sockets ignore them). PR: 17552 Reviewed by: Brian
# 0ba9128b	07-Jan-2000	Yoshinobu Inoue <shin@FreeBSD.org>	prevent kernel panic which happens when either of IPSEC and IPDIVERT is enabled. Confirmed by: Eugene M. Kim <ab@astralblue.com>
# 6a800098	22-Dec-1999	Yoshinobu Inoue <shin@FreeBSD.org>	IPSEC support in the kernel. pr_input() routines prototype is also changed to support IPSEC and IPV6 chained protocol headers. Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project
# 8948e4ba	05-Dec-1999	Archie Cobbs <archie@FreeBSD.org>	Miscellaneous fixes/cleanups relating to ipfw and divert(4): - Implement 'ipfw tee' (finally) - Divert packets by calling new function divert_packet() directly instead of going through protosw[]. - Replace kludgey global variable 'ip_divert_port' with a function parameter to divert_packet() - Replace kludgey global variable 'frag_divert_port' with a function parameter to ip_reass() - style(9) fixes Reviewed by: julian, green
# c3aac50f	27-Aug-1999	Peter Wemm <peter@FreeBSD.org>	$Id$ -> $FreeBSD$
# f711d546	27-Apr-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Suser() simplification: 1: s/suser/suser_xxx/ 2: Add new function: suser(struct proc ), prototyped in <sys/proc.h>. 3: s/suser_xxx($[a-zA-Z0-9_]$->p_ucred, \&\1->p_acflag)/suser(\1)/ The remaining suser_xxx() calls will be scrutinized and dealt with later. There may be some unneeded #include <sys/cred.h>, but they are left as an exercise for Bruce. More changes to the suser() API will come along with the "jail" code.
# a0c091ad	07-Feb-1999	Julian Elischer <julian@FreeBSD.org>	remove leftover garbage line.
# b0935ca2	07-Feb-1999	Julian Elischer <julian@FreeBSD.org>	Fix for PR 9309. Divert was not feeding clean data to ifa_ifwithaddr() so it was giving bad results. Submitted by: kseel <kseel@utcorp.com>, Ruslan Ermilov <ru@ucb.crimea.ua>
# 2127f260	04-Dec-1998	Archie Cobbs <archie@FreeBSD.org>	Examine all occurrences of sprintf(), strcat(), and str[n]cpy() for possible buffer overflow problems. Replaced most sprintf()'s with snprintf(); for others cases, added terminating NUL bytes where appropriate, replaced constants like "16" with sizeof(), etc. These changes include several bug fixes, but most changes are for maintainability's sake. Any instance where it wasn't "immediately obvious" that a buffer overflow could not occur was made safer. Reviewed by: Bruce Evans <bde@zeta.org.au> Reviewed by: Matthew Dillon <dillon@apollo.backplane.com> Reviewed by: Mike Spengler <mks@networkcs.com>
# efe39c6a	06-Jul-1998	Julian Elischer <julian@FreeBSD.org>	Bring back some slight cleanups from 2.2
# 7d82bea5	02-Jul-1998	Julian Elischer <julian@FreeBSD.org>	Remove out of date comment.
# b3adeeb2	01-Jul-1998	Julian Elischer <julian@FreeBSD.org>	Remove the option to keep IPFW diversion backwards compatible WRT diversion reinjection. No-one has been bitten by the new behaviour that I know of.
# 0cab7536	11-Jun-1998	Julian Elischer <julian@FreeBSD.org>	include opt_ipdivert.h so we get correct options
# bab04eb8	11-Jun-1998	Julian Elischer <julian@FreeBSD.org>	Allow diverted packets from the transmit side to remember if they had a recv interface and allow that state to be available after re-injection for further tests.
# 3ed81d03	06-Jun-1998	Julian Elischer <julian@FreeBSD.org>	Fix wrong data type for a pointer.
# c977d4c7	06-Jun-1998	Julian Elischer <julian@FreeBSD.org>	clean up the changes made to ipfw over the last weeks (should make the ipfw lkm work again)
# e256a933	05-Jun-1998	Julian Elischer <julian@FreeBSD.org>	Reverse the default sense of the IPFW/DIVERT reinjection code so that the new behaviour is now default. Solves the "infinite loop in diversion" problem when more than one diversion is active. Man page changes follow. The new code is in -stable as the NON default option.
# bb60f459	25-May-1998	Julian Elischer <julian@FreeBSD.org>	Add optional code to change the way that divert and ipfw work together. Prior to this change, Accidental recursion protection was done by the diverted daemon feeding back the divert port number it got the packet on, as the port number on a sendto(). IPFW knew not to redivert a packet to this port (again). Processing of the ruleset started at the beginning again, skipping that divert port. The new semantic (which is how we should have done it the first time) is that the port number in the sendto() is the rule number AFTER which processing should restart, and on a recvfrom(), the port number is the rule number which caused the diversion. This is much more flexible, and also more intuitive. If the user uses the same sockaddr received when resending, processing resumes at the rule number following that that caused the diversion. The user can however select to resume rule processing at any rule. (0 is restart at the beginning) To enable the new code use option IPFW_DIVERT_RESTART This should become the default as soon as people have looked at it a bit
# 436c7212	25-May-1998	Julian Elischer <julian@FreeBSD.org>	Hide the interface name in the sin_zero section of the sockaddr_in passed to the user process for incoming packets. When the sockaddr_in is passed back to the divert socket later, use thi sas the primary interface lookup and only revert to the IP address when the name fails. This solves a long standing bug with divert sockets: When two interfaces had the same address (P2P for example) the interface "assigned" to the reinjected packet was sometimes incorect. Probably we should define a "sockaddr_div" to officially hold this extended information in teh same manner as sockaddr_dl.
# 25e75fb3	25-May-1998	Julian Elischer <julian@FreeBSD.org>	Take the user's "IGNORE_DIVERT" argument from where the user put it and not from the PCB which HAPPENS to contain the same number most of the time, but not always.
# 98271db4	15-May-1998	Garrett Wollman <wollman@FreeBSD.org>	Convert socket structures to be type-stable and add a version number. Define a parameter which indicates the maximum number of sockets in a system, and use this to size the zone allocators used for sockets and for certain PCBs. Convert PF_LOCAL PCB structures to be type-stable and add a version number. Define an external format for infomation about socket structures and use it in several places. Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through sysctl(3) without blocking network interrupts for an unreasonable length of time. This probably still has some bugs and/or race conditions, but it seems to work well enough on my machines. It is now possible for `netstat' to get almost all of its information via the sysctl(3) interface rather than reading kmem (changes to follow).
# 8781d8e9	28-Mar-1998	Bruce Evans <bde@FreeBSD.org>	Fixed style bugs (mostly) in previous commit.
# 3d4d47f3	24-Mar-1998	Garrett Wollman <wollman@FreeBSD.org>	Use the zone allocator to allocate inpcbs and tcpcbs. Each protocol creates its own zone; this is used particularly by TCP which allocates both inpcb and tcpcb in a single allocation. (Some hackery ensures that the tcpcb is reasonably aligned.) Also keep track of the number of pcbs of each type allocated, and keep a generation count (instance version number) for future use.
# 0b08f5f7	05-Feb-1998	Eivind Eklund <eivind@FreeBSD.org>	Back out DIAGNOSTIC changes.
# 47cfdb16	04-Feb-1998	Eivind Eklund <eivind@FreeBSD.org>	Turn DIAGNOSTIC into a new-style option.
# c3229e05	27-Jan-1998	David Greenman <dg@FreeBSD.org>	Improved connection establishment performance by doing local port lookups via a hashed port list. In the new scheme, in_pcblookup() goes away and is replaced by a new routine, in_pcblookup_local() for doing the local port check. Note that this implementation is space inefficient in that the PCB struct is now too large to fit into 128 bytes. I might deal with this in the future by using the new zone allocator, but I wanted these changes to be extensively tested in their current form first. Also: 1) Fixed off-by-one errors in the port lookup loops in in_pcbbind(). 2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash() to do the initialial hash insertion. 3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability. 4) Added a new routine, in_pcbremlists() to remove the PCB from the various hash lists. 5) Added/deleted comments where appropriate. 6) Removed unnecessary splnet() locking. In general, the PCB functions should be called at splnet()...there are unfortunately a few exceptions, however. 7) Reorganized a few structs for better cache line behavior. 8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in the future, however. These changes have been tested on wcarchive for more than a month. In tests done here, connection establishment overhead is reduced by more than 50 times, thus getting rid of one of the major networking scalability problems. Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult. WARNING: Anything that knows about inpcb and tcpcb structs will have to be recompiled; at the very least, this includes netstat(1).
# 1d5e9e22	08-Jan-1998	Eivind Eklund <eivind@FreeBSD.org>	Make INET a proper option. This will not make any of object files that LINT create change; there might be differences with INET disabled, but hardly anything compiled before without INET anyway. Now the 'obvious' things will give a proper error if compiled without inet - ipx_ip, ipfw, tcp_debug. The only thing that _should_ work (but can't be made to compile reasonably easily) is sppp :-( This commit move struct arpcom from <netinet/if_ether.h> to <net/if_arp.h>.
# 86b3ebce	18-Dec-1997	David Greenman <dg@FreeBSD.org>	Call in_pcballoc() at splnet(). As near as I can tell, this won't fix any instability problems, but it was wrong nonetheless and will be required in an upcoming round of PCB changes.
# f8f6cbba	13-Sep-1997	Peter Wemm <peter@FreeBSD.org>	Update network code to use poll support.
# 5bfe67ef	13-Sep-1997	Peter Wemm <peter@FreeBSD.org>	Some mbuf -> sockaddr changes seem to have been missed here.
# 1fd0b058	02-Aug-1997	Bruce Evans <bde@FreeBSD.org>	Removed unused #includes.
# e4676ba6	01-Jun-1997	Julian Elischer <julian@FreeBSD.org>	Submitted by: Whistle Communications (archie Cobbs) these are quite extensive additions to the ipfw code. they include a change to the API because the old method was broken, but the user view is kept the same. The new code allows a particular match to skip forward to a particular line number, so that blocks of rules can be used without checking all the intervening rules. There are also many more ways of rejecting connections especially TCP related, and many many more ... see the man page for a complete description.
# b34db546	01-Jun-1997	Peter Wemm <peter@FreeBSD.org>	typo fix, s/imp/inp'; move lookup call inside splnet since there were comments on it being outside.
# 159fe49b	25-May-1997	Peter Wemm <peter@FreeBSD.org>	Uninitialised inp variable in div_bind(). Submitted by: Åge Røbekk <aagero@aage.priv.no>
# 9f907986	24-May-1997	Peter Wemm <peter@FreeBSD.org>	Attempt to convert the ip_divert code to use the new-style protocol request switch. I needed 'LINT' to compile for other reasons so I kinda got the blood on my hands. Note: I don't know how to test this, I don't know if it works correctly.
# ca98b82c	02-Apr-1997	David Greenman <dg@FreeBSD.org>	Reorganize elements of the inpcb struct to take better advantage of cache lines. Removed the struct ip proto since only a couple of chars were actually being used in it. Changed the order of compares in the PCB hash lookup to take advantage of partial cache line fills (on PPro). Discussed-with: wollman
# ddd79a97	03-Mar-1997	David Greenman <dg@FreeBSD.org>	Improved performance of hash algorithm while (hopefully) not reducing the quality of the hash distribution. This does not fix a problem dealing with poor distribution when using lots of IP aliases and listening on the same port on every one of them...some other day perhaps; fixing that requires significant code changes. The use of xor was inspired by David S. Miller <davem@jenolan.rutgers.edu>
# 6875d254	22-Feb-1997	Peter Wemm <peter@FreeBSD.org>	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# 7e05e70c	20-Feb-1997	Garrett Wollman <wollman@FreeBSD.org>	Fix the parameters of a call to in_setsockaddr().
# d81e4043	02-Feb-1997	Brian Somers <brian@FreeBSD.org>	Reset ip_divert_ignore to zero immediately after use - also, set it in the first place, independent of whether sin->sin_port is set. The result is that diverted packets that are being forwarded will be diverted once and only once on the way in (ip_input()) and again, once and only once on the way out (ip_output()) - twice in total. ICMP packets that don't contain a port will now also be diverted.
# 1130b656	14-Jan-1997	Jordan K. Hubbard <jkh@FreeBSD.org>	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# 59562606	13-Dec-1996	Garrett Wollman <wollman@FreeBSD.org>	Convert the interface address and IP interface address structures to TAILQs. Fix places which referenced these for no good reason that I can see (the references remain, but were fixed to compile again; they are still questionable).
# 93e0e116	10-Jul-1996	Julian Elischer <julian@FreeBSD.org>	Adding changes to ipfw and the kernel to support ip packet diversion.. This stuff should not be too destructive if the IPDIVERT is not compiled in.. be aware that this changes the size of the ip_fw struct so ipfw needs to be recompiled to use it.. more changes coming to clean this up.