Cross Reference: /freebsd-current/sys/kern/uipc

History log of /freebsd-current/sys/kern/uipc_socket.c
Revision	Date	Author	Comments
# 99b0270a	06-May-2024	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: hide socket hhook(9)s under SOCKET_HHOOK There are no in-tree consumers of these hooks. Reviewed by: stevek Differential Revision: https://reviews.freebsd.org/D44928
# a8acc2bf	23-Apr-2024	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: inherit SO_ACCEPTFILTER from listener to child This is crucial for operation of accept_filter(9). See added comment. Fixes: d29b95ecc0d049406d27a6c11939d40a46658733
# 81b4d1c4	08-Apr-2024	Stephen J. Kiernan <stevek@FreeBSD.org>	sockets: Add hhook in sonewconn for inheriting OSD specific data Added HHOOK_SOCKET_NEWCONN and bumped HHOOK_SOCKET_LAST Reviewed by: glebius, tuexen Obtained from: Juniper Networks, Inc. Differential Revision: https://reviews.freebsd.org/D44632
# aa32d7cb	06-Apr-2024	Jake Freeland <jfree@FreeBSD.org>	ktrace: Record socket violations with KTR_CAPFAIL Report restricted access to socket addresses and protocols while Capsicum violation tracing with CAPFAIL_ADDR and CAPFAIL_PROTO. Reviewed by: markj Approved by: markj (mentor) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D40681
# 681711b7	05-Apr-2024	Michael Tuexen <tuexen@FreeBSD.org>	uipc_socket: handle socket buffer locks in sopeeloff PR: 278171 Reviewed by: markj Fixes: a4fc41423f7d ("sockets: enable protocol specific socket buffers") MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D44640
# 15bfd7cf	22-Mar-2024	Gleb Smirnoff <glebius@FreeBSD.org>	soreceive_dgram: use M_WAITOK when we don't hold any locks
# 26389b30	22-Mar-2024	Gleb Smirnoff <glebius@FreeBSD.org>	soreceive_dgram: assert that a datagram has control or data
# d62c4607	18-Mar-2024	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: remove unused KPIs to manipulate sockets These KPIs were added in dd0e6c383a9f0 and through 15 years had zero use. They slightly remind what IfAPI does for struct ifnet. But IfAPI does that for the sake of large collection of NIC drivers not being aware of struct ifnet. For the sockets it is unclear what could be a large collection of externally written kernel modules that need extensively use sockets and not be aware of their internals at the same time. This isolation of a structure knowledge requires a lot of work, and just throwing in a few KPIs isn't helpful. Reviewed by: kib, olce, markj Differential Revision: https://reviews.freebsd.org/D44311
# 7ee47c3b	28-Feb-2024	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: in solisten_proto() don't call sbdestroy() on a PR_SOCKBUF A socket marked with PR_SOCKBUF has protocol specific socket buffers and will take care of the in its pr_listen method. Right now we don't have any sockets that can listen and are PR_SOCKBUF, but that will change soon.
# ce69e373	03-Feb-2024	Gleb Smirnoff <glebius@FreeBSD.org>	Revert "sockets: retire sorflush()" Provide a comment in sorflush() why the socket I/O sx(9) lock is actually important. This reverts commit 507f87a799cf0811ce30f0ae7f10ba19b2fd3db3.
# f79a8585	30-Jan-2024	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: garbage collect SS_ISCONFIRMING Fixes: 8df32b19dee92b5eaa4b488ae78dca6accfcb38e
# 507f87a7	16-Jan-2024	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: retire sorflush() With removal of dom_dispose method the function boils down to two meaningful function calls: socantrcvmore() and sbrelease(). The latter is only relevant for protocols that use generic socket buffers. The socket I/O sx(9) lock acquisition in sorflush() is not relevant for shutdown(2) operation as it doesn't do any I/O that may interleave with read(2) or write(2). The socket buffer mutex acquisition inside sbrelease() is what guarantees thread safety. This sx(9) acquisition in soshutdown() can be tracked down to 4.4BSD times, where it used to be sblock(), and it was carried over through the years evolving together with sockets with no reconsideration of why do we carry it over. I can't tell if that sblock() made sense back then, but it doesn't make any today. Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D43415
# 289bee16	16-Jan-2024	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: remove dom_dispose and PR_RIGHTS Passing file descriptors (rights) via sockets is a feature specific to PF_UNIX only, so fully isolate the logic into uipc_usrreq.c. Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D43414
# 5bba2728	16-Jan-2024	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: make pr_shutdown fully protocol specific method Disassemble a one-for-all soshutdown() into protocol specific methods. This creates a small amount of copy & paste, but makes code a lot more self documented, as protocol specific method would execute only the code that is relevant to that protocol and nothing else. This also fixes a couple recent regressions and reduces risk of future regressions. The extended KPI for the new pr_shutdown removes need for the extra pr_flush which was added for the sake of SCTP which could not perform its shutdown properly with the old one. Particularly for SCTP this change streamlines a lot of code. Some notes on why certain parts of code were copied or were not to certain protocols: * The (SS_ISCONNECTED \| SS_ISCONNECTING \| SS_ISDISCONNECTING) check is needed only for those protocols that may be connected or disconnected. * The above reduces into only SS_ISCONNECTED for those protocols that always connect instantly. * The ENOTCONN and continue processing hack is left only for datagram protocols. * The SOLISTENING(so) block is copied to those protocols that listen(2). * sorflush() on SHUT_RD is copied almost to every protocol, but that will be refactored later. * wakeup(&so->so_timeo) is copied to protocols that can make a non-instant connect(2), can SO_LINGER or can accept(2). There are three protocols (netgraph(4), Bluetooth, SDP) that did not have pr_shutdown, but old soshutdown() would still perform sorflush() on SHUT_RD for them and also wakeup(9). Those protocols partially supported shutdown(2) returning EOPNOTSUP for SHUT_WR/SHUT_RDWR, now they fully lost shutdown(2) support. I'm pretty sure netgraph(4) and Bluetooth are okay about that and SDP is almost abandoned anyway. Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D43413
# c3276e02	16-Jan-2024	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: make shutdown(2) how argument a enum Reviwed by: tuexen Differential Revision: https://reviews.freebsd.org/D43412
# 59ce044a	08-Jan-2024	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: on shutdown(2) do sorflush() only in case of generic sockbuf This is a quick plug to fix panic with Netlink which has protocol specific buffers. Note that PF_UNIX/SOCK_DGRAM, which also has its own buffers, avoids the panic due to being SOCK_DGRAM. A correct but more complicated fix that needs to be done is to merge pr_shutdown, pr_flush and dom_dispose into one protocol method that may call sorflush for generic sockets or do their own stuff for protocol which has own buffers. Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D43367 Reported-by: syzbot+a58e1615881c01a51653@syzkaller.appspotmail.com
# 0fac350c	30-Nov-2023	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: don't malloc/free sockaddr memory on getpeername/getsockname Just like it was done for accept(2) in cfb1e92912b4, use same approach for two simplier syscalls that return socket addresses. Although, these two syscalls aren't performance critical, this change generalizes some code between 3 syscalls trimming code size. Following example of accept(2), provide VNET-aware and INVARIANT-checking wrappers sopeeraddr() and sosockaddr() around protosw methods. Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D42694
# cfb1e929	30-Nov-2023	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: don't malloc/free sockaddr memory on accept(2) Let the accept functions provide stack memory for protocols to fill it in. Generic code should provide sockaddr_storage, specialized code may provide smaller structure. While rewriting accept(2) make 'addrlen' a true in/out parameter, reporting required length in case if provided length was insufficient. Our manual page accept(2) and POSIX don't explicitly require that, but one can read the text as they do. Linux also does that. Update tests accordingly. Reviewed by: rscheff, tuexen, zlei, dchagin Differential Revision: https://reviews.freebsd.org/D42635
# 29363fb4	23-Nov-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove ancient SCCS tags. Remove ancient SCCS tags from the tree, automated scripting, with two minor fixup to keep things compiling. All the common forms in the tree were removed with a perl script. Sponsored by: Netflix
# f64a688d	13-Nov-2023	Brooks Davis <brooks@FreeBSD.org>	Remove gratuitous copyouts of unchanged struct mac. The get operations change the data pointed to by the structure, but do not update the contents of the struct. Mark the struct mac arguments of mac_[gs]etsockopt_*label() and mac_check_structmac_consistent() const to prevent this from changing in the future. Reviewed by: markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D14488
# 978be1ee	09-Oct-2023	Zhenlei Huang <zlei@FreeBSD.org>	sockets: Add sysctl flag CTLFLAG_TUN to loader tunable The sysctl variable 'kern.ipc.maxsockets' is actually a loader tunable. Add sysctl flag CTLFLAG_TUN to it so that `sysctl -T` will report it correctly. No functional change intended. Reviewed by: kib, imp MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D42113
# f4410241	09-Sep-2023	Greg Becker <becker.greg@att.net>	sockets: re-check socket state after call to pr_rcvd() Socket state may have changed after dropping the receive buffer lock in order to call pr_rcvd(). If the buffer is empty, re-check the state after reaquiring the lock and skip calling sbwait() if the socket is in error or the peer has closed. PR: 212716 Reviewed by: markj, glebius MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D41783
# 685dc743	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove $FreeBSD$: one-line .c pattern Remove /^[\s]__FBSDID$"\$FreeBSD\$"$;?\s*\n/
# d29b95ec	14-Aug-2023	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: on accept(2) don't copy all of so_options to new socket As uncovered by e3ba0d6adde3 we are copying lots of irrelevant options from the listener to an accepted socket, even those that aren't relevant to a non-listener, e.g. SO_REUSE*, SO_ACCEPTFILTER. Stop doing that and provide a fixed opt-in list for options to be inherited. Ideally we shall not inherit anything at all. For compatibility inherit a set of options that are meaningful for a non-listening socket of a protocol that can listen(2). Differential Revision: https://reviews.freebsd.org/D41412 Fixes: e3ba0d6adde3c694f46a30b3b67eba43a7099395
# 4824d788	30-Apr-2023	Eugene Grosbein <eugen@FreeBSD.org>	listen(2): improve administrator control over logging As documented in listen.2 manual page, the kernel emits a LOG_DEBUG syslog message if a socket listen queue overflows. For some appliances, it may be desirable to change the priority to some higher value like LOG_INFO while keeping other debugging suppressed. OTOH there are cases when such overflows are normal and expected. Then it may be desirable to suppress overflow logging altogether, so that dmesg buffer is not flooded over long run. In addition to existing sysctl kern.ipc.sooverinterval, introduce new sysctl kern.ipc.sooverprio that defaults to 7 (LOG_DEBUG) to preserve current behavior. It may be changed to any value in a range of 0..7 for corresponding priority or to -1 to suppress logging. Document it in the listen.2 manual page. MFC after: 1 month
# b4b33821	21-Mar-2023	Mark Johnston <markj@FreeBSD.org>	ktls: Fix interlocking between ktls_enable_rx() and listen(2) The TCP_TXTLS_ENABLE and TCP_RXTLS_ENABLE socket option handlers check whether the socket is listening socket and fail if so, but this check is racy. Since we have to lock the socket buffer later anyway, defer the check to that point. ktls_enable_tx() locks the send buffer's I/O lock, which will fail if the socket is a listening socket, so no explicit checks are needed. In ktls_enable_rx(), which does not acquire the I/O lock (see the review for some discussion on this), use an explicit SOLISTENING() check after locking the recv socket buffer. Otherwise, a concurrent solisten_proto() call can trigger crashes and memory leaks by wiping out socket buffers as ktls_enable_*() is modifying them. Also make sure that a KTLS-enabled socket can't be converted to a listening socket, and use SOCK_(SEND\|RECV)BUF_LOCK macros instead of the old ones while here. Add some simple regression tests involving listen(2). Reported by: syzkaller MFC after: 2 weeks Reviewed by: gallatin, glebius, jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D38504
# 636b19ea	14-Feb-2023	Mark Johnston <markj@FreeBSD.org>	tcp: Disallow re-connection of a connected socket soconnectat() tries to ensure that one cannot connect a connected socket. However, the check is racy and does not really prevent two threads from attempting to connect the same TCP socket. Modify tcp_connect() and tcp6_connect() to perform the check again, this time synchronized by the inpcb lock, under which we call soisconnecting(). Reported by: syzkaller Reviewed by: glebius MFC after: 2 weeks Sponsored by: Klara, Inc. Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D38507
# a0102dee	01-Feb-2023	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: in sousrsend() pass down the error to aio(4) This somewhat undermines the initial goal of sousrsend() to have all the special error handling for a write on a socket in a single place. The aio(4) needs to see EWOULDBLOCK to re-schedule the job. Because aio(4) handles return from soreceive() and sousrsend() with the same code, we can't check for (error == 0 && done < job_nbytes). Keeping this exclusion for aio(4) seems a lesser evil. Fixes: 7a2c93b86ef75390a60a4b4d6e3911b36221dfbe
# 7a2c93b8	14-Dec-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: provide sousrsend() that does socket specific error handling Sockets have special handling for EPIPE on a write, that was spread out into several places. Treating transient errors is also special - if protocol is atomic, than we should ignore any changes to uio_resid, a transient error means the write had completely failed (see d2b3a0ed31e). - Provide sousrsend() that expects a valid uio, and leave sosend() for kernel consumers only. Do all special error handling right here. - In dofilewrite() don't do special handling of error for DTYPE_SOCKET. - For send(2), write(2) and aio_write(2) call into sousrsend() and remove error handling for kern_sendit(), soo_write() and soaio_process_job(). PR: 265087 Reported by: rz-rpi03 at h-ka.de Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35863
# ebdf27b6	10-Dec-2022	Mateusz Guzik <mjg@FreeBSD.org>	uipc: remove accept_mtx It is unused since 779f106aa169256b ("Listening sockets improvements.") Sponsored by: Rubicon Communications, LLC ("Netgate")
# bc0d4076	11-Oct-2022	Michael Tuexen <tuexen@FreeBSD.org>	Revert "listen(): improve POSIX compliance" This reverts commit 76e6e4d72f8d3da7d19242f303bc95461fde7fb9. Several programs in the tree use -1 instead of INT_MAX to use the maximum value. Thanks to Eugene Grosbein for pointing this out.
# 76e6e4d7	11-Oct-2022	Michael Tuexen <tuexen@FreeBSD.org>	listen(): improve POSIX compliance Ensure that a negative backlog argument is handled as it if was 0. Reviewed by: markj@, glebius@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D31821
# f6696856	27-Sep-2022	Alexander V. Chernikov <melifaro@FreeBSD.org>	protocols: make socket buffers ioctl handler changeable Allow to set custom per-protocol handlers for the socket buffers ioctls by introducing pr_setsbopt callback with the default value set to the currently-used sbsetopt(). Reviewed by: glebius Differential Revision: https://reviews.freebsd.org/D36746
# e80062a2	08-Sep-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: avoid call to soisconnected() on transition to ESTABLISHED This call existed since pre-FreeBSD times, and it is hard to understand why it was there in the first place. After 6f3caa6d815 it definitely became necessary always and commit message from f1ee30ccd60 confirms that. Now that 6f3caa6d815 is effectively backed out by 07285bb4c22, the call appears to be useful only for sockets that landed on the incomplete queue, e.g. sockets that have accept_filter(9) enabled on them. Provide a new TCP flag to mark connections that are known to be on the incomplete queue, and call soisconnected() only for those connections. Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D36488
# 24af7808	30-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	protosw: repair protocol selection logic in socket(2) Pointy hat to: glebius Fixes: 61f7427f02a307d28af674a12c45dd546e3898e4
# 61f7427f	30-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	protosw: cleanup protocols that existed merely to provide pr_input Since 4.4BSD the protosw was used to implement socket types created by socket(2) syscall and at the same to demultiplex incoming IPv4 datagrams (later copied to IPv6). This story ended with 78b1fc05b20. These entries (e.g. IPPROTO_ICMP) in inetsw that were added to catch packets in ip_input(), they would also be returned by pffindproto() if user says socket(AF_INET, SOCK_RAW, IPPROTO_ICMP). Thus, for raw sockets to work correctly, all the entries were pointing at raw_usrreq differentiating only in the value of pr_protocol. With 78b1fc05b20 all these entries are no longer needed, as ip_protox is independent of protosw. Any socket syscall requesting SOCK_RAW type would end up with rip_protosw. And this protosw has its pr_protocol set to 0, allowing to mark socket with any protocol. For IPv6 raw socket the change required two small fixes: o Validate user provided protocol value o Always use protocol number stored in inp in rip6_attach, instead of protosw value, which is now always 0. Differential revision: https://reviews.freebsd.org/D36380
# 8624f434	30-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	divert: declare PF_DIVERT domain and stop abusing PF_INET The divert(4) is not a protocol of IPv4. It is a socket to intercept packets from ipfw(4) to userland and re-inject them back. It can divert and re-inject IPv4 and IPv6 packets today, but potentially it is not limited to these two protocols. The IPPROTO_DIVERT does not belong to known IP protocols, it doesn't even fit into u_char. I guess, the implementation of divert(4) was done the way it is done basically because it was easier to do it this way, back when protocols for sockets were intertwined with IP protocols and domains were statically compiled in. Moving divert(4) out of inetsw accomplished two important things: 1) IPDIVERT is getting much closer to be not dependent on INET. This will be finalized in following changes. 2) Now divert socket no longer aliases with raw IPv4 socket. Domain/proto selection code won't need a hack for SOCK_RAW and multiple entries in inetsw implementing different flavors of raw socket can merge into one without requirement of raw IPv4 being the last member of dom_protosw. Differential revision: https://reviews.freebsd.org/D36379
# e7d02be1	17-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	protosw: refactor protosw and domain static declaration and load o Assert that every protosw has pr_attach. Now this structure is only for socket protocols declarations and nothing else. o Merge struct pr_usrreqs into struct protosw. This was suggested in 1996 by wollman@ (see 7b187005d18ef), and later reiterated in 2006 by rwatson@ (see 6fbb9cf860dcd). o Make struct domain hold a variable sized array of protosw pointers. For most protocols these pointers are initialized statically. Those domains that may have loadable protocols have spacers. IPv4 and IPv6 have 8 spacers each (andre@ dff3237ee54ea). o For inetsw and inet6sw leave a comment noting that many protosw entries very likely are dead code. o Refactor pf_proto_[un]register() into protosw_[un]register(). o Isolate pr_*_notsupp() methods into uipc_domain.c Reviewed by: melifaro Differential revision: https://reviews.freebsd.org/D36232
# f277746e	12-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	protosw: change prototype for pr_control For some reason protosw.h is used during world complation and userland is not aware of caddr_t, a relic from the first version of C. Broken buildworld is good reason to get rid of yet another caddr_t in kernel. Fixes: 886fc1e80490fb03e72e306774766cbb2c733ac6
# 07285bb4	10-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	tcp: utilize new solisten_clone() and solisten_enqueue() This streamlines cloning of a socket from a listener. Now we do not drop the inpcb lock during creation of a new socket, do not do useless state transitions, and put a fully initialized socket+inpcb+tcpcb into the listen queue. Before this change, first we would allocate the socket and inpcb+tcpcb via tcp_usr_attach() as TCPS_CLOSED, link them into global list of pcbs, unlock pcb and put this onto incomplete queue (see 6f3caa6d815). Then, after sonewconn() we would lock it again, transition into TCPS_SYN_RECEIVED, insert into inpcb hash, finalize initialization of tcpcb. And then, in call into tcp_do_segment() and upon transition to TCPS_ESTABLISHED call soisconnected(). This call would lock the listening socket once again with a LOR protection sequence and then we would relocate the socket onto the complete queue and only now it is ready for accept(2). Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D36064
# 8f5a0a2e	10-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: provide solisten_clone(), solisten_enqueue() as alternative KPI to sonewconn(). The latter has three stages: - check the listening socket queue limits - allocate a new socket - call into protocol attach method - link the new socket into the listen queue of the listening socket The attach method, originally designed for a creation of socket by the socket(2) syscall has slightly different semantics than attach of a socket cloned by listener. Make it possible for protocols to call into the first stage, then perform a different attach, and then call into the final stage. The first stage, that checks limits and clones a socket is called solisten_clone(), and the function that enqueues the socket is solisten_enqueue(). Reviewed by: tuexen Differential revision: https://reviews.freebsd.org/D36063
# be1f485d	25-Jul-2022	Alexander V. Chernikov <melifaro@FreeBSD.org>	sockets: add MSG_TRUNC flag handling for recvfrom()/recvmsg(). Implement Linux-variant of MSG_TRUNC input flag used in recv(), recvfrom() and recvmsg(). Posix defines MSG_TRUNC as an output flag, indicating packet/datagram truncation. Linux extended it a while (~15+ years) ago to act as input flag, resulting in returning the full packet size regarless of the input buffer size. It's a (relatively) popular pattern to do recvmsg( MSG_PEEK \| MSG_TRUNC) to get the packet size, allocate the buffer and issue another call to fetch the packet. In particular, it's popular in userland netlink code, which is the primary driving factor of this change. This commit implements the MSG_TRUNC support for SOCK_DGRAM sockets (udp, unix and all soreceive_generic() users). PR: kern/176322 Reviewed by: pauamma(doc) Differential Revision: https://reviews.freebsd.org/D35909 MFC after: 1 month
# c261510e	08-Jul-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: fix setsockopt(SO_RCVTIMEO) on a listening socket MFC after: 3 weeks
# d8596171	04-Jul-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: use only soref()/sorele() as socket reference count o Retire SS_FDREF as it is basically a debug flag on top of already existing soref()/sorele(). o Convert SS_PROTOREF into soref()/sorele(). o Change reference model for the listen queues, see below. o Make sofree() private. The correct KPI to use is only sorele(). o Make soabort() respect the model and sorele() instead of sofree(). Note on listening queues. Until now the sockets on a queue had zero reference count. And the reference were given only upon accept(2). The assumption was that there is no way to see the queued socket from anywhere except its head. This is not true, since queued sockets already have pcbs, which are linked at least into the global pcb lists. With this change we put the reference right in the sonewconn() and on accept(2) path we just hand the existing reference to the file descriptor. Differential revision: https://reviews.freebsd.org/D35679
# bc760564	04-Jul-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: use positive flag for file descriptor socket reference Rename SS_NOFDREF to SS_FDREF and flip all bitwise operations. Mark sockets created by socreate() with SS_FDREF. This change is mostly illustrative. With it we see that SS_FDREF is a debugging flag, since: * socreate() takes a reference with soref(). * on accept path solisten_dequeue() takes a reference with soref() and then soaccept() sets SS_FDREF. * soclose() checks SS_FDREF, removes it and does sorele(). Reviewed by: tuexen Differential revision: https://reviews.freebsd.org/D35678
# 66c8e3fc	30-Jun-2022	Gleb Smirnoff <glebius@FreeBSD.org>	socket: fix listen(2) on an already listening socket Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35669 Fixes: 141fe2dceeaeefaaffc2242c8652345a081e825a
# a4fc4142	24-Jun-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: enable protocol specific socket buffers Split struct sockbuf into common shared fields and protocol specific union, where protocols are free to implement whatever buffer they want. Such protocols should mark themselves with PR_SOCKBUF and are expected to initialize their buffers in their pr_attach and tear them down in pr_detach. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35299
# f6379f7f	16-Jun-2022	Mark Johnston <markj@FreeBSD.org>	socket: Fix a race between kevent(2) and listen(2) When locking the knote list for a socket, we check whether the socket is a listening socket in order to select the appropriate mutex; a listening socket uses the socket lock, while data sockets use socket buffer mutexes. If SOLISTENING(so) is false and the knote lock routine locks a socket buffer, then it must re-check whether the socket is a listening socket since solisten_proto() could have changed the socket's identity while we were blocked on the socket buffer lock. Reported by: syzkaller Reviewed by: glebius MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35492
# a8e286bb	03-Jun-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: use socket buffer mutexes in struct socket directly Convert more generic socket code to not use sockbuf compat pointer. Continuation of 4328318445a.
# 37351133	14-May-2022	Rick Macklem <rmacklem@FreeBSD.org>	uipc_socket.c: Modify MSG_TLSAPPDATA to only do Alert Records Without this patch, the MSG_TLSAPPDATA flag would cause soreceive_generic() to return ENXIO for any non-application data record in a TLS receive stream. This works ok for TLS1.2, since Alert records appear to be the only non-application data records received. However, for TLS1.3, there can be post-handshake handshake records, such as NewSessionKey sent to the client from the server. These handshake records cannot be handled by the upcall which does an SSL_read() with length == 0. It appears that the client can simply throw away these NewSessionKey records, but to do so, it needs to receive them within the kernel. This patch modifies the semantics of MSG_TLSAPPDATA slightly, so that it only applies to Alert records and not Handshake records. It is needed to allow the krpc to work with KTLS1.3. Reviewed by: hselasky MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D35170
# 43283184	12-May-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: use socket buffer mutexes in struct socket directly Since c67f3b8b78e the sockbuf mutexes belong to the containing socket, and socket buffers just point to it. In 74a68313b50 macros that access this mutex directly were added. Go over the core socket code and eliminate code that reaches the mutex by dereferencing the sockbuf compatibility pointer. This change requires a KPI change, as some functions were given the sockbuf pointer only without any hint if it is a receive or send buffer. This change doesn't cover the whole kernel, many protocols still use compatibility pointers internally. However, it allows operation of a protocol that doesn't use them. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35152
# 2e4e5ee2	12-May-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: delete stale comment from sofree() First paragraph refers to old past "we used to" and is no longer important today. Second paragraph has just a wrong statement that socket buffer is destroyed before pru_detach.
# a982ce04	09-May-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: remove the socket-on-stack hack from sorflush() The hack can be tracked down to 4.4BSD, where copy was performed under splimp() and then after splx() dom_dispose was called. Stevens has a chapter on this function, but he doesn't answer why this trick is necessary. Why can't we call into dom_dispose under splimp()? Anyway, with multithreaded kernel the hack seems to be necessary to avoid LORs between socket buffer lock and different filesystem locks, especially network file systems. The new socket buffers KPI sbcut() from 1d2df300e9b allow us to get rid of the hack. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35125
# 42f2fa99	09-May-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: don't call dom_dispose() on a listening socket sorflush() already did the right thing, so only sofree() needed a fix. Turn check into assertion in our only dom_dispose method. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35124
# c17418a0	09-May-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: assert that any protocol with PR_RIGHTS has dom_dispose() Through the entire history only PF_UNIX has this feature. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35123
# 97f8198e	09-May-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: make SO_SND/SO_RCV a enum Not a functional change now. The enum will also be used for other socket buffer related KPIs.
# aeb91e95	26-Mar-2022	Alexander Leidinger <netchild@FreeBSD.org>	Log euid, rgid and jail on listen queue overflow If you have numerous jails with multiple similar services running, this helps to narrow down which services this log is referring to.
# 5de79eed	07-Feb-2022	Mark Johnston <markj@FreeBSD.org>	ktls: Disallow transmitting empty frames outside of TLS 1.0/CBC mode There was nothing preventing one from sending an empty fragment on an arbitrary KTLS TX-enabled socket, but ktls_frame() asserts that this could not happen. Though the transmit path handles this case for TLS 1.0 with AES-CBC, we should be strict and allow empty fragments only in modes where it is explicitly allowed. Modify sosend_generic() to reject writes to a KTLS-enabled socket if the number of data bytes is zero, so that userspace cannot trigger the aforementioned assertion. Add regression tests to exercise this case. Reported by: syzkaller Reviewed by: gallatin, jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D34195
# fe27f1db	25-Dec-2021	Alexander Motin <mav@FreeBSD.org>	kern: Remove CTLFLAG_NEEDGIANT from some sysctls. MFC after: 2 weeks
# f9978339	20-Dec-2021	Hans Petter Selasky <hselasky@FreeBSD.org>	Remove dead code. The variable orig_resid is always set to zero right after the while loop where it is cleared. Reviewed by: gallatin@ and glebius@ Differential Revision: https://reviews.freebsd.org/D33589 MFC after: 1 week Sponsored by: NVIDIA Networking
# e3ba94d4	09-Nov-2021	John Baldwin <jhb@FreeBSD.org>	Don't require the socket lock for sorele(). Previously, sorele() always required the socket lock and dropped the lock if the released reference was not the last reference. Many callers locked the socket lock just before calling sorele() resulting in a wasted lock/unlock when not dropping the last reference. Move the previous implementation of sorele() into a new sorele_locked() function and use it instead of sorele() for various places in uipc_socket.c that called sorele() while already holding the socket lock. The sorele() macro now uses refcount_release_if_not_last() try to drop the socket reference without locking the socket. If that shortcut fails, it locks the socket and calls sorele_locked(). Reviewed by: kib, markj Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D32741
# c441592a	03-Nov-2021	Allan Jude <allanjude@FreeBSD.org>	Allow kern.ipc.maxsockets to be set to current value without error Normally setting kern.ipc.maxsockets returns EINVAL if the new value is not greater than the previous value. This can cause spurious error messages when sysctl.conf is processed multiple times, or when automation systems try to ensure the sysctl is set to the correct value. If the value is unchanged, then just do nothing. PR: 243532 Reviewed by: markj MFC after: 3 days Sponsored by: Modirum MDPay Sponsored by: Klara Inc. Differential Revision: https://reviews.freebsd.org/D32775
# a37e4fd1	01-Oct-2021	Gleb Smirnoff <glebius@FreeBSD.org>	Re-style dfcef8771484 to keep the code and variables related to listening sockets separated from code for generic sockets. No objection: markj
# ade1daa5	16-Sep-2021	Mark Johnston <markj@FreeBSD.org>	socket: Synchronize soshutdown() with listen(2) and AIO To handle shutdown(SHUT_RD) we flush the receive buffer of the socket. This may involve searching for control messages of type SCM_RIGHTS, since we need to close the file references. Closing arbitrary files with socket buffer locks held is undesirable, mainly due to lock ordering issues, so we instead make a copy of the socket buffer and operate on that without any locks. Fields in the original buffer are cleared. This behaviour clobbered the AIO job queue associated with a receive buffer. It could also cause us to leak a KTLS session reference. Reorder socket buffer fields to address this. An alternate solution would be to remove the hack in sorflush(), but this is not quite feasible (yet). In particular, though sorflush() flags the sockbuf with SBS_CANTRCVMORE, it is possible for more data to be queued - the flag just prevents userspace from reading more data. I suspect we should fix this; SBS_CANTRCVMORE represents a terminal state and protocols can likely just drop any data destined for such a buffer. Many of them already do, but in some cases the check is racy, and some KPI churn will be needed to fix everything. This approach is more straightforward for now. Reported by: syzbot+104d8ee3430361cb2795@syzkaller.appspotmail.com Reported by: syzbot+5bd2e7d05f84a59d0d1b@syzkaller.appspotmail.com Reviewed by: jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31976
# 883761f0	16-Sep-2021	Mark Johnston <markj@FreeBSD.org>	socket: Remove NOFREE from the socket zone This flag was added during the transition away from the legacy zone allocator, commit c897b81311792ccf6a93feff2a405e2ae53f664e. The old zone allocator effectively provided _NOFREE semantics, but it seems that they are not required for sockets. In particular, we use reference counting to keep sockets live. One somewhat dangerous case is sonewconn(), which returns a pointer to a socket with reference count 0. This socket is still effectively owned by the listening socket. Protocols must therefore be careful to synchronize sonewconn() calls with their pru_close implementations, since for listening sockets soclose() will abort the child sockets. For example, TCP holds the listening socket's PCB read locked across the sonewconn() call, which blocks tcp_usr_close(), and sofree() synchronizes with a concurrent soabort() of the nascent socket. However, _NOFREE semantics are not required here. Eliminating _NOFREE has several benefits: it enables use-after-free detection (e.g., by KASAN) and lets the system reclaim memory from the socket zone under memory pressure. No functional change intended. Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31975
# 6b288408	16-Sep-2021	Mark Johnston <markj@FreeBSD.org>	socket: Add assertions around naked refcount decrements Sockets in a listen queue hold a reference to the parent listening socket. Several code paths release this reference manually when moving a child socket out of the queue. Replace comments about the expected post-decrement refcount value with assertions. Use refcount_load() instead of a plain load. No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31974
# dfcef877	16-Sep-2021	Mark Johnston <markj@FreeBSD.org>	socket: Fix a use-after-free in soclose() After releasing the fd reference to a socket "so", we should avoid testing SOLISTENING(so) since the socket may have been freed. Instead, directly test whether the list of unaccepted sockets is empty. Fixes: f4bb1869ddd2 ("Consistently use the SOLISTENING() macro") Pointy hat: markj MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31973
# fa0463c3	14-Sep-2021	Mark Johnston <markj@FreeBSD.org>	socket: De-duplicate SBLOCKWAIT() definitions MFC after: 1 week Sponsored by: The FreeBSD Foundation
# 141fe2dc	10-Sep-2021	Mark Johnston <markj@FreeBSD.org>	aio: Interlock with listen(2) soo_aio_queue() did not handle the possibility that the provided socket is a listening socket. Up until recently, to fix this one would have to acquire the socket lock first and check, since the socket buffer locks were destroyed by listen(2). Now that the socket buffer locks belong to the socket, simply check SOLISTENING(so) after acquiring them, and make listen(2) return an error if any AIO jobs are enqueued on the socket. Add a couple of simple regression test cases. Note that this fixes things only for the default AIO implementation; cxgbe(4)'s TCP offload has a separate pru_aio_queue implementation which requires its own solution. Reported by: syzbot+c8aa122fa2c6a4e2a28b@syzkaller.appspotmail.com Reported by: syzbot+39af117d43d4f0faf512@syzkaller.appspotmail.com Reported by: syzbot+60cceb9569145a0b993b@syzkaller.appspotmail.com Reported by: syzbot+2d522c5db87710277ca5@syzkaller.appspotmail.com Reviewed by: tuexen, gallatin, jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31901
# 523d58aa	07-Sep-2021	Mark Johnston <markj@FreeBSD.org>	socket: Remove unneeded SOLISTENING checks Now that SOCK_IO_*_LOCK() checks for listening sockets, we can eliminate some racy SOLISTENING() checks. No functional change intended. Reviewed by: tuexen MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31660
# bd4a39cc	07-Sep-2021	Mark Johnston <markj@FreeBSD.org>	socket: Properly interlock when transitioning to a listening socket Currently, most protocols implement pru_listen with something like the following: SOCK_LOCK(so); error = solisten_proto_check(so); if (error) { SOCK_UNLOCK(so); return (error); } solisten_proto(so); SOCK_UNLOCK(so); solisten_proto_check() fails if the socket is connected or connecting. However, the socket lock is not used during I/O, so this pattern is racy. The change modifies solisten_proto_check() to additionally acquire socket buffer locks, and the calling thread holds them until solisten_proto() or solisten_proto_abort() is called. Now that the socket buffer locks are preserved across a listen(2), this change allows socket I/O paths to properly interlock with listen(2). This fixes a large number of syzbot reports, only one is listed below and the rest will be dup'ed to it. Reported by: syzbot+9fece8a63c0e27273821@syzkaller.appspotmail.com Reviewed by: tuexen, gallatin MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31659
# f94acf52	07-Sep-2021	Mark Johnston <markj@FreeBSD.org>	socket: Rename sb(un)lock() and interlock with listen(2) In preparation for moving sockbuf locks into the containing socket, provide alternative macros for the sockbuf I/O locks: SOCK_IO_SEND_(UN)LOCK() and SOCK_IO_RECV_(UN)LOCK(). These operate on a socket rather than a socket buffer. Note that these locks are used only to prevent concurrent readers and writters from interleaving I/O. When locking for I/O, return an error if the socket is a listening socket. Currently the check is racy since the sockbuf sx locks are destroyed during the transition to a listening socket, but that will no longer be true after some follow-up changes. Modify a few places to check for errors from sblock()/SOCK_IO_(SEND\|RECV)_LOCK() where they were not before. In particular, add checks to sendfile() and sorflush(). Reviewed by: tuexen, gallatin MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31657
# 7045b160	28-Jul-2021	Roy Marples <roy@marples.name>	socket: Implement SO_RERROR SO_RERROR indicates that receive buffer overflows should be handled as errors. Historically receive buffer overflows have been ignored and programs could not tell if they missed messages or messages had been truncated because of overflows. Since programs historically do not expect to get receive overflow errors, this behavior is not the default. This is really really important for programs that use route(4) to keep in sync with the system. If we loose a message then we need to reload the full system state, otherwise the behaviour from that point is undefined and can lead to chasing bogus bug reports. Reviewed by: philip (network), kbowling (transport), gbe (manpages) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D26652
# a1002174	14-Jun-2021	Mark Johnston <markj@FreeBSD.org>	Consistently use the SOCKBUF_MTX() and SOCK_MTX() macros This makes it easier to change the socket locking protocols. No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation
# f4bb1869	14-Jun-2021	Mark Johnston <markj@FreeBSD.org>	Consistently use the SOLISTENING() macro Some code was using it already, but in many places we were testing SO_ACCEPTCONN directly. As a small step towards fixing some bugs involving synchronization with listen(2), make the kernel consistently use SOLISTENING(). No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation
# aa341db3	25-May-2021	John Baldwin <jhb@FreeBSD.org>	Rename m_unmappedtouio() to m_unmapped_uiomove(). This function doesn't only copy data into a uio but instead is a variant of uiomove() similar to uiomove_fromphys(). Reviewed by: gallatin, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30444
# 916c61a5	21-May-2021	Mark Johnston <markj@FreeBSD.org>	Fix handling of errors from pru_send(PRUS_NOTREADY) PRUS_NOTREADY indicates that the caller has not yet populated the chain with data, and so it is not ready for transmission. This is used by sendfile (for async I/O) and KTLS (for encryption). In particular, if pru_send returns an error, the caller is responsible for freeing the chain since other implicit references to the data buffers exist. For async sendfile, it happens that an error will only be returned if the connection was dropped, in which case tcp_usr_ready() will handle freeing the chain. But since KTLS can be used in conjunction with the regular socket I/O system calls, many more error cases - which do not result in the connection being dropped - are reachable. In these cases, KTLS was effectively assuming success. So: - Change sosend_generic() to free the mbuf chain if pru_send(PRUS_NOTREADY) fails. Nothing else owns a reference to the chain at that point. - Similarly, in vn_sendfile() change the !async I/O && KTLS case to free the chain. - If async I/O is still outstanding when pru_send fails in vn_sendfile(), set an error in the sfio structure so that the connection is aborted and the mbuf chain is freed. Reviewed by: gallatin, tuexen Discussed with: jhb MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30349
# b295c5dd	18-May-2021	Lv Yunlong <lylgood@foxmail.com>	socket: Release cred reference later in sodealloc() We dereference so->so_cred to update the per-uid socket buffer accounting, so the crfree() call must be deferred until after that point. PR: 255869 MFC after: 1 week
# d8acd268	12-May-2021	Mark Johnston <markj@FreeBSD.org>	Fix mbuf leaks in various pru_send implementations The various protocol implementations are not very consistent about freeing mbufs in error paths. In general, all protocols must free both "m" and "control" upon an error, except if PRUS_NOTREADY is specified (this is only implemented by TCP and unix(4) and requires further work not handled in this diff), in which case "control" still must be freed. This diff plugs various leaks in the pru_send implementations. Reviewed by: tuexen MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30151
# 3aaaa2ef	28-Apr-2021	Thomas Munro <tmunro@FreeBSD.org>	poll(2): Add POLLRDHUP. Teach poll(2) to support Linux-style POLLRDHUP events for sockets, if requested. Triggered when the remote peer shuts down writing or closes its end. Reviewed by: kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D29757
# 27457983	07-Apr-2021	Mark Johnston <markj@FreeBSD.org>	capsicum: Limit socket operations in capability mode Capsicum did not prevent certain privileged networking operations, specifically creation of raw sockets and network configuration ioctls. However, these facilities can be used to circumvent some of the restrictions that capability mode is supposed to enforce. Add capability mode checks to disallow network configuration ioctls and creation of sockets other than PF_LOCAL and SOCK_DGRAM/STREAM/SEQPACKET internet sockets. Reviewed by: oshogbo Discussed with: emaste Reported by: manu Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D29423
# f187d6df	15-Mar-2021	Kyle Evans <kevans@FreeBSD.org>	base: remove if_wg(4) and associated utilities, manpage After length decisions, we've decided that the if_wg(4) driver and related work is not yet ready to live in the tree. This driver has larger security implications than many, and thus will be held to more scrutiny than other drivers. Please also see the related message sent to the freebsd-hackers@ and freebsd-arch@ lists by Kyle Evans <kevans@FreeBSD.org> on 2021/03/16, with the subject line "Removing WireGuard Support From Base" for additional context.
# 74ae3f3e	14-Mar-2021	Kyle Evans <kevans@FreeBSD.org>	if_wg: import latest fixup work from the wireguard-freebsd project This is the culmination of about a week of work from three developers to fix a number of functional and security issues. This patch consists of work done by the following folks: - Jason A. Donenfeld <Jason@zx2c4.com> - Matt Dunwoodie <ncon@noconroy.net> - Kyle Evans <kevans@FreeBSD.org> Notable changes include: - Packets are now correctly staged for processing once the handshake has completed, resulting in less packet loss in the interim. - Various race conditions have been resolved, particularly w.r.t. socket and packet lifetime (panics) - Various tests have been added to assure correct functionality and tooling conformance - Many security issues have been addressed - if_wg now maintains jail-friendly semantics: sockets are created in the interface's home vnet so that it can act as the sole network connection for a jail - if_wg no longer fails to remove peer allowed-ips of 0.0.0.0/0 - if_wg now exports via ioctl a format that is future proof and complete. It is additionally supported by the upstream wireguard-tools (which we plan to merge in to base soon) - if_wg now conforms to the WireGuard protocol and is more closely aligned with security auditing guidelines Note that the driver has been rebased away from using iflib. iflib poses a number of challenges for a cloned device trying to operate in a vnet that are non-trivial to solve and adds complexity to the implementation for little gain. The crypto implementation that was previously added to the tree was a super complex integration of what previously appeared in an old out of tree Linux module, which has been reduced to crypto.c containing simple boring reference implementations. This is part of a near-to-mid term goal to work with FreeBSD kernel crypto folks and take advantage of or improve accelerated crypto already offered elsewhere. There's additional test suite effort underway out-of-tree taking advantage of the aforementioned jail-friendly semantics to test a number of real-world topologies, based on netns.sh. Also note that this is still a work in progress; work going further will be much smaller in nature. MFC after: 1 month (maybe)
# 504ebd61	20-Jan-2021	Kyle Evans <kevans@FreeBSD.org>	kern: sonewconn: set so_options before pru_attach() Protocol attachment has historically been able to observe and modify so->so_options as needed, and it still can for newly created sockets. 779f106aa169 moved this to after pru_attach() when we re-acquire the lock on the listening socket. Restore the historical behavior so that pru_attach implementations can consistently use it. Note that some pru_attach() do currently rely on this, though that may change in the future. D28265 contains a change to remove the use in TCP and IB/SDP bits, as resetting the requested linger time on incoming connections seems questionable at best. This does move the assignment out from under the head's listen lock, but glebius notes that head won't be going away and applications cannot assume any specific ordering with a race between a connection coming in and the application changing socket options anyways. Discussed-with: glebius MFC-after: 1 week
# 924d1c9a	08-Feb-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	Revert "SO_RERROR indicates that receive buffer overflows should be handled as errors." Wrong version of the change was pushed inadvertenly. This reverts commit 4a01b854ca5c2e5124958363b3326708b913af71.
# 4a01b854	07-Feb-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	SO_RERROR indicates that receive buffer overflows should be handled as errors. Historically receive buffer overflows have been ignored and programs could not tell if they missed messages or messages had been truncated because of overflows. Since programs historically do not expect to get receive overflow errors, this behavior is not the default. This is really really important for programs that use route(4) to keep in sync with the system. If we loose a message then we need to reload the full system state, otherwise the behaviour from that point is undefined and can lead to chasing bogus bug reports.
# 34af05ea	03-Dec-2020	Kyle Evans <kevans@FreeBSD.org>	kern: soclose: don't sleep on SO_LINGER w/ timeout=0 This is a valid scenario that's handled in the various protocol layers where it makes sense (e.g., tcp_disconnect and sctp_disconnect). Given that it indicates we should immediately drop the connection, it makes little sense to sleep on it. This could lead to panics with INVARIANTS. On non-INVARIANTS kernels, this could result in the thread hanging until a signal interrupts it if the protocol does not mark the socket as disconnected for whatever reason. Reported by: syzbot+e625d92c1dd74e402c81@syzkaller.appspotmail.com Reviewed by: glebius, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27407
# e90afaa0	08-Nov-2020	Mateusz Guzik <mjg@FreeBSD.org>	kqueue: save space by using only one func pointer for assertions
# 6fed89b1	01-Sep-2020	Mateusz Guzik <mjg@FreeBSD.org>	kern: clean up empty lines in .c and .h files
# 102829aa	19-Aug-2020	Rick Macklem <rmacklem@FreeBSD.org>	Add the MSG_TLSAPPDATA flag to indicate "return ENXIO" for non-application TLS data records. The kernel RPC cannot process non-application data records when using TLS. It must to an upcall to a userspace daemon that will call SSL_read() to process them. This patch adds a new flag called MSG_TLSAPPDATA that the kernel RPC can use to tell sorecieve() to return ENXIO instead of a non-application data record, when that is what is at the top of the receive queue. I put the code in #ifdef KERN_TLS/#endif, although it will build without that, so that it is recognized as only useful when KERN_TLS is enabled. The alternative to doing this is to have the kernel RPC re-queue the non-application data message after receiving it, but that seems more complicated and might introduce message ordering issues when there are multiple non-application data records one after another. I do not know what, if any, changes will be required to support TLS1.3. Reviewed by: glebius Differential Revision: https://reviews.freebsd.org/D25923
# 0f70a148	29-Jul-2020	John Baldwin <jhb@FreeBSD.org>	Properly handle a closed TLS socket with pending receive data. If the remote end closes a TLS socket and the socket buffer still contains not-yet-decrypted TLS records but no decrypted TLS records, soreceive needs to block or fail with EWOULDBLOCK. Previously it was trying to return data and dereferencing a NULL pointer. Reviewed by: np Sponsored by: Chelsio Differential Revision: https://reviews.freebsd.org/D25838
# 3c0e5685	23-Jul-2020	John Baldwin <jhb@FreeBSD.org>	Add support for KTLS RX via software decryption. Allow TLS records to be decrypted in the kernel after being received by a NIC. At a high level this is somewhat similar to software KTLS for the transmit path except in reverse. Protocols enqueue mbufs containing encrypted TLS records (or portions of records) into the tail of a socket buffer and the KTLS layer decrypts those records before returning them to userland applications. However, there is an important difference: - In the transmit case, the socket buffer is always a single "record" holding a chain of mbufs. Not-yet-encrypted mbufs are marked not ready (M_NOTREADY) and released to protocols for transmit by marking mbufs ready once their data is encrypted. - In the receive case, incoming (encrypted) data appended to the socket buffer is still a single stream of data from the protocol, but decrypted TLS records are stored as separate records in the socket buffer and read individually via recvmsg(). Initially I tried to make this work by marking incoming mbufs as M_NOTREADY, but there didn't seemed to be a non-gross way to deal with picking a portion of the mbuf chain and turning it into a new record in the socket buffer after decrypting the TLS record it contained (along with prepending a control message). Also, such mbufs would also need to be "pinned" in some way while they are being decrypted such that a concurrent sbcut() wouldn't free them out from under the thread performing decryption. As such, I settled on the following solution: - Socket buffers now contain an additional chain of mbufs (sb_mtls, sb_mtlstail, and sb_tlscc) containing encrypted mbufs appended by the protocol layer. These mbufs are still marked M_NOTREADY, but soreceive*() generally don't know about them (except that they will block waiting for data to be decrypted for a blocking read). - Each time a new mbuf is appended to this TLS mbuf chain, the socket buffer peeks at the TLS record header at the head of the chain to determine the encrypted record's length. If enough data is queued for the TLS record, the socket is placed on a per-CPU TLS workqueue (reusing the existing KTLS workqueues and worker threads). - The worker thread loops over the TLS mbuf chain decrypting records until it runs out of data. Each record is detached from the TLS mbuf chain while it is being decrypted to keep the mbufs "pinned". However, a new sb_dtlscc field tracks the character count of the detached record and sbcut()/sbdrop() is updated to account for the detached record. After the record is decrypted, the worker thread first checks to see if sbcut() dropped the record. If so, it is freed (can happen when a socket is closed with pending data). Otherwise, the header and trailer are stripped from the original mbufs, a control message is created holding the decrypted TLS header, and the decrypted TLS record is appended to the "normal" socket buffer chain. (Side note: the SBCHECK() infrastucture was very useful as I was able to add assertions there about the TLS chain that caught several bugs during development.) Tested by: rmacklem (various versions) Relnotes: yes Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D24628
# 95033af9	18-Jun-2020	Mark Johnston <markj@FreeBSD.org>	Add the SCTP_SUPPORT kernel option. This is in preparation for enabling a loadable SCTP stack. Analogous to IPSEC/IPSEC_SUPPORT, the SCTP_SUPPORT kernel option must be configured in order to support a loadable SCTP implementation. Discussed with: tuexen MFC after: 2 weeks Sponsored by: The FreeBSD Foundation
# 2684603c	28-May-2020	John Baldwin <jhb@FreeBSD.org>	Permit SO_NO_DDP and SO_NO_OFFLOAD to be read via getsockopt(2). MFC after: 2 weeks Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D24627
# 469f2e9e	27-May-2020	Rick Macklem <rmacklem@FreeBSD.org>	Fix sosend() for the case where mbufs are passed in while doing ktls. For kernel tls, sosend() needs to call ktls_frame() on the mbuf list to be sent. Without this patch, this was only done when sosend()'s arguments used a uio_iov and not when an mbuf list is passed in. At this time, sosend() is never called with an mbuf list argument when kernel tls is in use, but will be once nfs-over-tls has been incorporated into head. Reviewed by: gallatin, glebius Differential Revision: https://reviews.freebsd.org/D24674
# 0532a7a2	14-May-2020	Konstantin Belousov <kib@FreeBSD.org>	Fix r361037. Reorder flag manipulations and use barrier to ensure that the program order is followed by compiler and CPU, for unlocked reader of so_state. In collaboration with: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24842
# 39845728	14-May-2020	Konstantin Belousov <kib@FreeBSD.org>	Fix spurious ENOTCONN from closed unix domain socket other' side. Sometimes, when doing read(2) over unix domain socket, for which the other side socket was closed, read(2) returns -1/ENOTCONN instead of EOF AKA zero-size read. This is because soreceive_generic() does not lock socket when testing the so_state SS_ISCONNECTED\|SS_ISCONNECTING flags. It could end up that we do not observe so->so_rcv.sb_state bit SBS_CANTRCVMORE, and then miss SS_ flags. Change the test to check that the socket was never connected before returning ENOTCONN, by adding all state bits for connected. Reported and tested by: pho In collaboration with: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24819
# 6edfd179	02-May-2020	Gleb Smirnoff <glebius@FreeBSD.org>	Step 4.1: mechanically rename M_NOMAP to M_EXTPG Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598
# 03066893	27-Apr-2020	Rick Macklem <rmacklem@FreeBSD.org>	Fix sosend_generic() so that it can handle a list of ext_pgs mbufs. Without this patch, sosend_generic() will try to use top->m_pkthdr.len, assuming that the first mbuf has a pkthdr. When a list of ext_pgs mbufs is passed in, the first mbuf is not a pkthdr and cannot be post-r359919. As such, the value of top->m_pkthdr.len is bogus (0 for my testing). This patch fixes sosend_generic() to handle this case, calculating the total length via m_length() for this case. There is currently nothing that hands a list of ext_pgs mbufs to sosend_generic(), but the nfs-over-tls kernel RPC code in projects/nfs-over-tls will do that and was used to test this patch. Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24568
# f1f93475	27-Apr-2020	John Baldwin <jhb@FreeBSD.org>	Initial support for kernel offload of TLS receive. - Add a new TCP_RXTLS_ENABLE socket option to set the encryption and authentication algorithms and keys as well as the initial sequence number. - When reading from a socket using KTLS receive, applications must use recvmsg(). Each successful call to recvmsg() will return a single TLS record. A new TCP control message, TLS_GET_RECORD, will contain the TLS record header of the decrypted record. The regular message buffer passed to recvmsg() will receive the decrypted payload. This is similar to the interface used by Linux's KTLS RX except that Linux does not return the full TLS header in the control message. - Add plumbing to the TOE KTLS interface to request either transmit or receive KTLS sessions. - When a socket is using receive KTLS, redirect reads from soreceive_stream() into soreceive_generic(). - Note that this interface is currently only defined for TLS 1.1 and 1.2, though I believe we will be able to reuse the same interface and structures for 1.3.
# fb401f1b	14-Apr-2020	Jonathan T. Looney <jtl@FreeBSD.org>	Make sonewconn() overflow messages have per-socket rate-limits and values. sonewconn() emits debug-level messages when a listen socket's queue overflows. Currently, sonewconn() tracks overflows on a global basis. It will only log one message every 60 seconds, regardless of how many sockets experience overflows. And, when it next logs at the end of the 60 seconds, it records a single message referencing a single PCB with the total number of overflows across all sockets. This commit changes to per-socket overflow tracking. The code will now log one message every 60 seconds per socket. And, the code will provide per-socket queue length and overflow counts. It also provides a way to change the period between log messages using a sysctl. Reviewed by: jhb (previous version), bcr (manpages) MFC after: 2 weeks Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D24316
# f6ab9795	14-Apr-2020	Jonathan T. Looney <jtl@FreeBSD.org>	Print more detail as part of the sonewconn() overflow message. When a socket's listen queue overflows, sonewconn() emits a debug-level log message. These messages are sometimes useful to systems administrators in highlighting a process which is not keeping up with its listen queue. This commit attempts to enhance the usefulness of this message by printing more details about the socket's address. If all else fails, it will at least print the domain name of the socket. Reviewed by: bz, jhb, kbowling MFC after: 2 weeks Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D24272
# 7029da5c	26-Feb-2020	Pawel Biernacki <kaktus@FreeBSD.org>	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718
# f85e1a80	25-Feb-2020	Gleb Smirnoff <glebius@FreeBSD.org>	Make ktls_frame() never fail. Caller must supply correct mbufs. This makes sendfile code a bit simplier.
# 975b8f84	09-Oct-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Cleanup unneeded includes that crept in with r353292.
# 9e14430d	08-Oct-2019	John Baldwin <jhb@FreeBSD.org>	Add a TOE KTLS mode and a TOE hook for allocating TLS sessions. This adds the glue to allocate TLS sessions and invokes it from the TLS enable socket option handler. This also adds some counters for active TOE sessions. The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when TOE KTLS is in use on a socket, but cannot be set via setsockopt(). To simplify various checks, a TLS session now includes an explicit 'mode' member set to the value returned by TLSTX_TLS_MODE. Various places that used to check 'sw_encrypt' against NULL to determine software vs ifnet (NIC) TLS now check 'mode' instead. Reviewed by: np, gallatin Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21891
# b8a6e03f	07-Oct-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Widen NET_EPOCH coverage. When epoch(9) was introduced to network stack, it was basically dropped in place of existing locking, which was mutexes and rwlocks. For the sake of performance mutex covered areas were as small as possible, so became epoch covered areas. However, epoch doesn't introduce any contention, it just delays memory reclaim. So, there is no point to minimise epoch covered areas in sense of performance. Meanwhile entering/exiting epoch also has non-zero CPU usage, so doing this less often is a win. Not the least is also code maintainability. In the new paradigm we can assume that at any stage of processing a packet, we are inside network epoch. This makes coding both input and output path way easier. On output path we already enter epoch quite early - in the ip_output(), in the ip6_output(). This patch does the same for the input path. All ISR processing, network related callouts, other ways of packet injection to the network stack shall be performed in net_epoch. Any leaf function that walks network configuration now asserts epoch. Tricky part is configuration code paths - ioctls, sysctls. They also call into leaf functions, so some need to be changed. This patch would introduce more epoch recursions (see EPOCH_TRACE) than we had before. They will be cleaned up separately, as several of them aren't trivial. Note, that unlike a lock recursion the epoch recursion is safe and just wastes a bit of resources. Reviewed by: gallatin, hselasky, cy, adrian, kristof Differential Revision: https://reviews.freebsd.org/D19111
# b2e60773	26-Aug-2019	John Baldwin <jhb@FreeBSD.org>	Add kernel-side support for in-kernel TLS. KTLS adds support for in-kernel framing and encryption of Transport Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports offload of TLS for transmitted data. Key negotation must still be performed in userland. Once completed, transmit session keys for a connection are provided to the kernel via a new TCP_TXTLS_ENABLE socket option. All subsequent data transmitted on the socket is placed into TLS frames and encrypted using the supplied keys. Any data written to a KTLS-enabled socket via write(2), aio_write(2), or sendfile(2) is assumed to be application data and is encoded in TLS frames with an application data type. Individual records can be sent with a custom type (e.g. handshake messages) via sendmsg(2) with a new control message (TLS_SET_RECORD_TYPE) specifying the record type. At present, rekeying is not supported though the in-kernel framework should support rekeying. KTLS makes use of the recently added unmapped mbufs to store TLS frames in the socket buffer. Each TLS frame is described by a single ext_pgs mbuf. The ext_pgs structure contains the header of the TLS record (and trailer for encrypted records) as well as references to the associated TLS session. KTLS supports two primary methods of encrypting TLS frames: software TLS and ifnet TLS. Software TLS marks mbufs holding socket data as not ready via M_NOTREADY similar to sendfile(2) when TLS framing information is added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then called to schedule TLS frames for encryption. In the case of sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving the mbufs marked M_NOTREADY until encryption is completed. For other writes (vn_sendfile when pages are available, write(2), etc.), the PRUS_NOTREADY is set when invoking pru_send() along with invoking ktls_enqueue(). A pool of worker threads (the "KTLS" kernel process) encrypts TLS frames queued via ktls_enqueue(). Each TLS frame is temporarily mapped using the direct map and passed to a software encryption backend to perform the actual encryption. (Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if someone wished to make this work on architectures without a direct map.) KTLS supports pluggable software encryption backends. Internally, Netflix uses proprietary pure-software backends. This commit includes a simple backend in a new ktls_ocf.ko module that uses the kernel's OpenCrypto framework to provide AES-GCM encryption of TLS frames. As a result, software TLS is now a bit of a misnomer as it can make use of hardware crypto accelerators. Once software encryption has finished, the TLS frame mbufs are marked ready via pru_ready(). At this point, the encrypted data appears as regular payload to the TCP stack stored in unmapped mbufs. ifnet TLS permits a NIC to offload the TLS encryption and TCP segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS) is allocated on the interface a socket is routed over and associated with a TLS session. TLS records for a TLS session using ifnet TLS are not marked M_NOTREADY but are passed down the stack unencrypted. The ip_output_send() and ip6_output_send() helper functions that apply send tags to outbound IP packets verify that the send tag of the TLS record matches the outbound interface. If so, the packet is tagged with the TLS send tag and sent to the interface. The NIC device driver must recognize packets with the TLS send tag and schedule them for TLS encryption and TCP segmentation. If the the outbound interface does not match the interface in the TLS send tag, the packet is dropped. In addition, a task is scheduled to refresh the TLS send tag for the TLS session. If a new TLS send tag cannot be allocated, the connection is dropped. If a new TLS send tag is allocated, however, subsequent packets will be tagged with the correct TLS send tag. (This latter case has been tested by configuring both ports of a Chelsio T6 in a lagg and failing over from one port to another. As the connections migrated to the new port, new TLS send tags were allocated for the new port and connections resumed without being dropped.) ifnet TLS can be enabled and disabled on supported network interfaces via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported across both vlan devices and lagg interfaces using failover, lacp with flowid enabled, or lacp with flowid enabled. Applications may request the current KTLS mode of a connection via a new TCP_TXTLS_MODE socket option. They can also use this socket option to toggle between software and ifnet TLS modes. In addition, a testing tool is available in tools/tools/switch_tls. This is modeled on tcpdrop and uses similar syntax. However, instead of dropping connections, -s is used to force KTLS connections to switch to software TLS and -i is used to switch to ifnet TLS. Various sysctls and counters are available under the kern.ipc.tls sysctl node. The kern.ipc.tls.enable node must be set to true to enable KTLS (it is off by default). The use of unmapped mbufs must also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS. KTLS is enabled via the KERN_TLS kernel option. This patch is the culmination of years of work by several folks including Scott Long and Randall Stewart for the original design and implementation; Drew Gallatin for several optimizations including the use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records awaiting software encryption, and pluggable software crypto backends; and John Baldwin for modifications to support hardware TLS offload. Reviewed by: gallatin, hselasky, rrs Obtained from: Netflix Sponsored by: Netflix, Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21277
# 75697b16	18-Aug-2019	Andrey V. Elsukov <ae@FreeBSD.org>	Use TAILQ_FOREACH_SAFE() macro to avoid use after free in soclose(). PR: 239893 MFC after: 1 week
# a85b7f12	14-Jul-2019	Michael Tuexen <tuexen@FreeBSD.org>	Improve the input validation for l_linger. When using the SOL_SOCKET level socket option SO_LINGER, the structure struct linger is used as the option value. The component l_linger is of type int, but internally copied to the field so_linger of the structure struct socket. The type of so_linger is short, but it is assumed to be non-negative and the value is used to compute ticks to be stored in a variable of type int. Therefore, perform input validation on l_linger similar to the one performed by NetBSD and OpenBSD. Thanks to syzkaller for making me aware of this issue. Thanks to markj@ for pointing out that a similar check should be added to so_linger_set(). Reviewed by: markj@ MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20948
# 6d958292	02-Jul-2019	Mark Johnston <markj@FreeBSD.org>	Fix handling of errors from sblock() in soreceive_stream(). Previously we would attempt to unlock the socket buffer despite having failed to lock it. Simply return an error instead: no resources need to be released at this point, and doing so is consistent with soreceive_generic(). PR: 238789 Submitted by: Greg Becker <greg@codeconcepts.com> MFC after: 1 week
# 82334850	28-Jun-2019	John Baldwin <jhb@FreeBSD.org>	Add an external mbuf buffer type that holds multiple unmapped pages. Unmapped mbufs allow sendfile to carry multiple pages of data in a single mbuf, without mapping those pages. It is a requirement for Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web serving workloads when used by sendfile, due to effectively compressing socket buffers by an order of magnitude, and hence reducing cache misses. For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer now points to a struct mbuf_ext_pgs structure instead of a data buffer. This structure contains an array of physical addresses (this reduces cache misses compared to an earlier version that stored an array of vm_page_t pointers). It also stores additional fields needed for in-kernel TLS such as the TLS header and trailer data that are currently unused. To more easily detect these mbufs, the M_NOMAP flag is set in m_flags in addition to M_EXT. Various functions like m_copydata() have been updated to safely access packet contents (using uiomove_fromphys()), to make things like BPF safe. NIC drivers advertise support for unmapped mbufs on transmit via a new IFCAP_NOMAP capability. This capability can be toggled via the new 'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only transmit packet contents via DMA and use bus_dma, adding the capability to if_capabilities and if_capenable should be all that is required. If a NIC does not support unmapped mbufs, they are converted to a chain of mapped mbufs (using sf_bufs to provide the mapping) in ip_output or ip6_output. If an unmapped mbuf requires software checksums, it is also converted to a chain of mapped mbufs before computing the checksum. Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Discussed with: ae, kp (firewalls) Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616
# 1db2626a	27-Jun-2019	John Baldwin <jhb@FreeBSD.org>	Fix comment in sofree() to reference sbdestroy(). r160875 added sbdestroy() as a wrapper around sbrelease_internal to be called from sofree(), yet the comment added in the same revision to sofree() still mentions sbrelease_internal(). Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20488
# 3fe00ac4	03-Mar-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Remove bogus assert that I added in r319722. It is a legitimate case to call soabort() on a newborn socket created by sonewconn() in case if further setup of PCB failed. Code in sofree() handles such socket correctly. Submitted by: jtl, rrs MFC after: 3 weeks
# 7dff7eda	13-Jan-2019	Jason A. Harmening <jah@FreeBSD.org>	Handle SIGIO for listening sockets r319722 separated struct socket and parts of the socket I/O path into listening-socket-specific and dataflow-socket-specific pieces. Listening socket connection notifications are now handled by solisten_wakeup() instead of sowakeup(), but solisten_wakeup() does not currently post SIGIO to the owning process. PR: 234258 Reported by: Kenneth Adelman MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D18664
# bcc3cec4	09-Jan-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Simplify sosetopt() so that function has single return point. No functional change.
# 2f2ddd68	04-Jan-2019	Mark Johnston <markj@FreeBSD.org>	Support MSG_DONTWAIT in send(2). As it does for recv(2), MSG_DONTWAIT indicates that the call should not block, returning EAGAIN instead. Linux and OpenBSD both implement this, so the change makes porting easier, especially since we do not return EINVAL or so when unrecognized flags are specified. Submitted by: Greg V <greg@unrelenting.technology> Reviewed by: tuexen MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D18728
# 79db6fe7	22-Nov-2018	Mark Johnston <markj@FreeBSD.org>	Plug some networking sysctl leaks. Various network protocol sysctl handlers were not zero-filling their output buffers and thus would export uninitialized stack memory to userland. Fix a number of such handlers. Reported by: Thomas Barabosch, Fraunhofer FKIE Reviewed by: tuexen MFC after: 3 days Security: kernel memory disclosure Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18301
# e77f0bdc	18-Oct-2018	Jonathan T. Looney <jtl@FreeBSD.org>	r334853 added a "socket destructor" callback. However, as implemented, it was really a "socket close" callback. Update the socket destructor functionality to run when a socket is destroyed (rather than when it is closed). The original submitter has confirmed that this change satisfies the intended use case. Suggested by: rwatson Submitted by: Michio Honda <micchie at sfc.wide.ad.jp> Tested by: Michio Honda <micchie at sfc.wide.ad.jp> Approved by: re (kib) Differential Revision: https://reviews.freebsd.org/D17590
# ad7eb8ca	03-Oct-2018	Gleb Smirnoff <glebius@FreeBSD.org>	In PR 227259, a user is reporting that they have code which is using shutdown() to wakeup another thread blocked on a stream listen socket. This code is failing, while it used to work on FreeBSD 10 and still works on Linux. It seems reasonable to add another exception to support something users are actually doing, which used to work on FreeBSD 10, and still works on Linux. And, it seems like it should be acceptable to POSIX, as we still return ENOTCONN. This patch is different to what had been committed to stable/11, since code around listening sockets is different. Patch in D15019 is written by jtl@, slightly modified by me. PR: 227259 Obtained from: jtl Approved by: re (kib) Differential Revision: D15019
# 6b01d4d4	21-Aug-2018	Michael Tuexen <tuexen@FreeBSD.org>	Add SOL_SOCKET level socket option with name SO_DOMAIN to get the domain of a socket. This is helpful when testing and Solaris and Linux have the same socket option using the same name. Reviewed by: bcr@, rrs@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16791
# 3a20f06a	10-Jul-2018	Brooks Davis <brooks@FreeBSD.org>	Use uintptr_t alone when assigning to kvaddr_t variables. Suggested by: jhb
# 7524b4c1	06-Jul-2018	Brooks Davis <brooks@FreeBSD.org>	Correct breakage on 32-bit platforms from r335979.
# f38b68ae	05-Jul-2018	Brooks Davis <brooks@FreeBSD.org>	Make struct xinpcb and friends word-size independent. Replace size_t members with ksize_t (uint64_t) and pointer members (never used as pointers in userspace, but instead as unique idenitifiers) with kvaddr_t (uint64_t). This makes the structs identical between 32-bit and 64-bit ABIs. On 64-bit bit systems, the ABI is maintained. On 32-bit systems, this is an ABI breaking change. The ABI of most of these structs was previously broken in r315662. This also imposes a small API change on userspace consumers who must handle kernel pointers becoming virtual addresses. PR: 228301 (exp-run by antoine) Reviewed by: jtl, kib, rwatson (various versions) Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15386
# 0ea9d937	11-Jun-2018	Matt Macy <mmacy@FreeBSD.org>	limit change to fixing controlp handling pending review
# c34bf300	11-Jun-2018	Matt Macy <mmacy@FreeBSD.org>	soreceive_stream: correctly handle edge cases - non NULL controlp is not an error, returning EINVAL would cause X forwarding to fail - MSG_PEEK and MSG_WAITALL are fairly exceptional, but we still want to handle them - punt to soreceive_generic
# 1fbe13cf	08-Jun-2018	Jonathan T. Looney <jtl@FreeBSD.org>	Add a socket destructor callback. This allows kernel providers to set callbacks to perform additional cleanup actions at the time a socket is closed. Michio Honda presented a use for this at BSDCan 2018. (See https://www.bsdcan.org/2018/schedule/events/965.en.html .) Submitted by: Michio Honda <micchie at sfc.wide.ad.jp> (previous version) Reviewed by: lstewart (previous version) Differential Revision: https://reviews.freebsd.org/D15706
# 1a43cff9	06-Jun-2018	Sean Bruno <sbruno@FreeBSD.org>	Load balance sockets with new SO_REUSEPORT_LB option. This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple programs or threads to bind to the same port and incoming connections will be load balanced using a hash function. Most of the code was copied from a similar patch for DragonflyBSD. However, in DragonflyBSD, load balancing is a global on/off setting and can not be set per socket. This patch allows for simultaneous use of both the current SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system. Required changes to structures: Globally change so_options from 16 to 32 bit value to allow for more options. Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets. Limitations: As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or threads sharing the same socket). This is a substantially different contribution as compared to its original incarnation at svn r332894 and reverted at svn r332967. Thanks to rwatson@ for the substantive feedback that is included in this commit. Submitted by: Johannes Lundberg <johalun0@gmail.com> Obtained from: DragonflyBSD Relnotes: Yes Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D11003
# 7875017c	24-Apr-2018	Sean Bruno <sbruno@FreeBSD.org>	Revert r332894 at the request of the submitter. Submitted by: Johannes Lundberg <johalun0_gmail.com> Sponsored by: Limelight Networks
# 7b7796ee	23-Apr-2018	Sean Bruno <sbruno@FreeBSD.org>	Load balance sockets with new SO_REUSEPORT_LB option This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple programs or threads to bind to the same port and incoming connections will be load balanced using a hash function. Most of the code was copied from a similar patch for DragonflyBSD. However, in DragonflyBSD, load balancing is a global on/off setting and can not be set per socket. This patch allows for simultaneous use of both the current SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system. Required changes to structures Globally change so_options from 16 to 32 bit value to allow for more options. Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets. Limitations As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or threads sharing the same socket). Submitted by: Johannes Lundberg <johanlun0@gmail.com> Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D11003
# 6469bdcd	06-Apr-2018	Brooks Davis <brooks@FreeBSD.org>	Move most of the contents of opt_compat.h to opt_global.h. opt_compat.h is mentioned in nearly 180 files. In-progress network driver compabibility improvements may add over 100 more so this is closer to "just about everywhere" than "only some files" per the guidance in sys/conf/options. Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h is created on all architectures. Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the set of compiled files. Reviewed by: kib, cem, jhb, jtl Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14941
# 51369649	20-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.
# 584ab65a	14-Sep-2017	Gleb Smirnoff <glebius@FreeBSD.org>	Fix locking in soisconnected(). When a newborn socket moves from incomplete queue to complete one, we need to obtain the listening socket lock after the child, which is a wrong order. The old code did that in potentially endless loop of mtx_trylock(). The new one does only one attempt of mtx_trylock(), and in case of failure references listening socket, unlocks child and locks everything in right order. In case if listening socket shuts down during that, just bail out. Reported & tested by: Jason Eggleston <jeggleston llnw.com> Reported & tested by: Jason Wolfe <jason llnw.com>
# 555b3e2f	24-Aug-2017	Gleb Smirnoff <glebius@FreeBSD.org>	Third take on the r319685 and r320480. Actually allow for call soisconnected() via soisdisconnected(), and in the earlier unlock earlier to avoid lock recursion. This fixes a situation when a socket on accept queue is reset before being accepted. Reported by: Jason Eggleston <jeggleston llnw.com>
# 27d8bea8	21-Jul-2017	Michael Tuexen <tuexen@FreeBSD.org>	Fix getsockopt() for listening sockets when using SO_SNDBUF, SO_RCVBUF, SO_SNDLOWAT, SO_RCVLOWAT. Since r31972 it only worked for non-listening sockets. Sponsored by: Netflix, Inc.
# fe715b80	04-Jul-2017	Hans Petter Selasky <hselasky@FreeBSD.org>	After r319722 two fields were left uninitialized when transforming a socket structure into a listening socket. This resulted in an invalid instruction fault for all 32-bit platforms. When INVARIANTS is set the union where the two uninitialized fields reside gets properly zeroed. This patch ensures the two uninitialized fields are zeroed when INVARIANTS is undefined. For 64-bit platforms this issue was not visible because so->sol_upcall which is uninitialized overlaps with so->so_rcv.sb_state which is already zero during soalloc(); For 32-bit platforms this issue was visible and resulted in an invalid instruction fault, because so->sol_upcall overlaps with so->so_rcv.sb_sel which is always initialized to a valid data pointer during soalloc(). Verifying the offset locations mentioned above are identical is left as an exercise to the reader. PR: 220452 PR: 220358 Reviewed by: ae (network), gallatin Differential Revision: https://reviews.freebsd.org/D11475 Sponsored by: Mellanox Technologies
# 64290bef	24-Jun-2017	Gleb Smirnoff <glebius@FreeBSD.org>	Provide sbsetopt() that handles socket buffer related socket options. It distinguishes between data flow sockets and listening sockets, and in case of the latter doesn't change resource limits, since listening sockets don't hold any buffers, they only carry values to be inherited by their children.
# 2b8e036b	15-Jun-2017	Gleb Smirnoff <glebius@FreeBSD.org>	Plug read(2) and write(2) on listening sockets.
# 779f106a	08-Jun-2017	Gleb Smirnoff <glebius@FreeBSD.org>	Listening sockets improvements. o Separate fields of struct socket that belong to listening from fields that belong to normal dataflow, and unionize them. This shrinks the structure a bit. - Take out selinfo's from the socket buffers into the socket. The first reason is to support braindamaged scenario when a socket is added to kevent(2) and then listen(2) is cast on it. The second reason is that there is future plan to make socket buffers pluggable, so that for a dataflow socket a socket buffer can be changed, and in this case we also want to keep same selinfos through the lifetime of a socket. - Remove struct struct so_accf. Since now listening stuff no longer affects struct socket size, just move its fields into listening part of the union. - Provide sol_upcall field and enforce that so_upcall_set() may be called only on a dataflow socket, which has buffers, and for listening sockets provide solisten_upcall_set(). o Remove ACCEPT_LOCK() global. - Add a mutex to socket, to be used instead of socket buffer lock to lock fields of struct socket that don't belong to a socket buffer. - Allow to acquire two socket locks, but the first one must belong to a listening socket. - Make soref()/sorele() to use atomic(9). This allows in some situations to do soref() without owning socket lock. There is place for improvement here, it is possible to make sorele() also to lock optionally. - Most protocols aren't touched by this change, except UNIX local sockets. See below for more information. o Reduce copy-and-paste in kernel modules that accept connections from listening sockets: provide function solisten_dequeue(), and use it in the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4), infiniband, rpc. o UNIX local sockets. - Removal of ACCEPT_LOCK() global uncovered several races in the UNIX local sockets. Most races exist around spawning a new socket, when we are connecting to a local listening socket. To cover them, we need to hold locks on both PCBs when spawning a third one. This means holding them across sonewconn(). This creates a LOR between pcb locks and unp_list_lock. - To fix the new LOR, abandon the global unp_list_lock in favor of global unp_link_lock. Indeed, separating these two locks didn't provide us any extra parralelism in the UNIX sockets. - Now call into uipc_attach() may happen with unp_link_lock hold if, we are accepting, or without unp_link_lock in case if we are just creating a socket. - Another problem in UNIX sockets is that uipc_close() basicly did nothing for a listening socket. The vnode remained opened for connections. This is fixed by removing vnode in uipc_close(). Maybe the right way would be to do it for all sockets (not only listening), simply move the vnode teardown from uipc_detach() to uipc_close()? Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D9770
# b3244df7	06-Jun-2017	Gleb Smirnoff <glebius@FreeBSD.org>	Provide typedef for socket upcall function. While here change so_gen_t type to modern uint64_t.
# b94f68dc	06-Jun-2017	Gleb Smirnoff <glebius@FreeBSD.org>	Remove a piece of dead code.
# 971af2a3	02-Jun-2017	Gleb Smirnoff <glebius@FreeBSD.org>	Rename accept filter getopt/setopt functions, so that they are prefixed with module name and match other functions in the module. There is no functional change.
# 14315212	25-Apr-2017	Patrick Kelsey <pkelsey@FreeBSD.org>	Remove unnecessary check for NULL mbuf in soreceive_generic(). This check has been redundant since it was introduced in r162554. Reviewed by: emaste, glebius MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D10322
# 63649db0	14-Apr-2017	Maxim Sobolev <sobomax@FreeBSD.org>	Restore ability to shutdown DGRAM sockets, still forcing ENOTCONN to be returned by the shutdown(2) system call. This ability has been lost as part of the svn revision 285910. Reviewed by: ed, rwatson, glebius, hiren MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D10351
# 4b481ba0	01-Feb-2017	Hartmut Brandt <harti@FreeBSD.org>	Merge filt_soread and filt_solisten and decide what to do when checking for EVFILT_READ at the point of the check not when the event is registers. This fixes a problem with asio when accepting a connection. Reviewed by: kib@, Scott Mitchell
# f3e7afe2	18-Jan-2017	Hans Petter Selasky <hselasky@FreeBSD.org>	Implement kernel support for hardware rate limited sockets. - Add RATELIMIT kernel configuration keyword which must be set to enable the new functionality. - Add support for hardware driven, Receive Side Scaling, RSS aware, rate limited sendqueues and expose the functionality through the already established SO_MAX_PACING_RATE setsockopt(). The API support rates in the range from 1 to 4Gbytes/s which are suitable for regular TCP and UDP streams. The setsockopt(2) manual page has been updated. - Add rate limit function callback API to "struct ifnet" which supports the following operations: if_snd_tag_alloc(), if_snd_tag_modify(), if_snd_tag_query() and if_snd_tag_free(). - Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT flag, which tells if a network driver supports rate limiting or not. - This patch also adds support for rate limiting through VLAN and LAGG intermediate network devices. - How rate limiting works: 1) The userspace application calls setsockopt() after accepting or making a new connection to set the rate which is then stored in the socket structure in the kernel. Later on when packets are transmitted a check is made in the transmit path for rate changes. A rate change implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the destination network interface, which then sets up a custom sendqueue with the given rate limitation parameter. A "struct m_snd_tag" pointer is returned which serves as a "snd_tag" hint in the m_pkthdr for the subsequently transmitted mbufs. 2) When the network driver sees the "m->m_pkthdr.snd_tag" different from NULL, it will move the packets into a designated rate limited sendqueue given by the snd_tag pointer. It is up to the individual drivers how the rate limited traffic will be rate limited. 3) Route changes are detected by the NIC drivers in the ifp->if_transmit() routine when the ifnet pointer in the incoming snd_tag mismatches the one of the network interface. The network adapter frees the mbuf and returns EAGAIN which causes the ip_output() to release and clear the send tag. Upon next ip_output() a new "snd_tag" will be tried allocated. 4) When the PCB is detached the custom sendqueue will be released by a non-blocking ifp->if_snd_tag_free() call to the currently bound network interface. Reviewed by: wblock (manpages), adrian, gallatin, scottl (network) Differential Revision: https://reviews.freebsd.org/D3687 Sponsored by: Mellanox Technologies MFC after: 3 months
# 339efd75	16-Jan-2017	Maxim Sobolev <sobomax@FreeBSD.org>	Add a new socket option SO_TS_CLOCK to pick from several different clock sources to return timestamps when SO_TIMESTAMP is enabled. Two additional clock sources are: o nanosecond resolution realtime clock (equivalent of CLOCK_REALTIME); o nanosecond resolution monotonic clock (equivalent of CLOCK_MONOTONIC). In addition to this, this option provides unified interface to get bintime (equivalent of using SO_BINTIME), except it also supported with IPv6 where SO_BINTIME has never been supported. The long term plan is to depreciate SO_BINTIME and move everything to using SO_TS_CLOCK. Idea for this enhancement has been briefly discussed on the Net session during dev summit in Ottawa last June and the general input was positive. This change is believed to benefit network benchmarks/profiling as well as other scenarios where precise time of arrival measurement is necessary. There are two regression test cases as part of this commit: one extends unix domain test code (unix_cmsg) to test new SCM_XXX types and another one implementis totally new test case which exchanges UDP packets between two processes using both conventional methods (i.e. calling clock_gettime(2) before recv(2) and after send(2)), as well as using setsockopt()+recv() in receive path. The resulting delays are checked for sanity for all supported clock types. Reviewed by: adrian, gnn Differential Revision: https://reviews.freebsd.org/D9171
# 7d03ff1f	16-Jan-2017	Hiren Panchasara <hiren@FreeBSD.org>	Add kevent EVFILT_EMPTY for notification when a client has received all data i.e. everything outstanding has been acked. Reviewed by: bz, gnn (previous version) MFC after: 3 days Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D9150
# 14da48cb	06-Jan-2017	John Baldwin <jhb@FreeBSD.org>	Set MORETOCOME for AIO write requests on a socket. Add a MSG_MOREOTOCOME message flag. When this flag is set, sosend* set PRUS_MOREOTOCOME when invoking the protocol send method. The aio worker tasks for sending on a socket set this flag when there are additional write jobs waiting on the socket buffer. Reviewed by: adrian MFC after: 1 month Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D8955
# dd7d4f19	22-Nov-2016	Ruslan Bukin <br@FreeBSD.org>	Revert r306186 ("Adjust the sopt_val pointer on bigendian systems"). This logic doesn't work with bigger sopt_valsize (e.g. when ipfw passing 2048 bytes rule). Reported by: adrian Sponsored by: DARPA, AFRL
# 30f3bfe5	21-Sep-2016	Ruslan Bukin <br@FreeBSD.org>	Adjust the sopt_val pointer on bigendian systems (e.g. MIPS64EB). sooptcopyin() checks if size of data provided by user is <= than we can accept, else it strips down the size. On bigendian platforms we have to move pointer as well so we copy the actual data. Reviewed by: gnn Sponsored by: DARPA, AFRL Sponsored by: HEIF5 Differential Revision: https://reviews.freebsd.org/D7980
# 69a28758	15-Sep-2016	Ed Maste <emaste@FreeBSD.org>	Renumber license clauses in sys/kern to avoid skipping #3
# c3bef61e	15-Sep-2016	Kevin Lo <kevlo@FreeBSD.org>	Remove the 4.3BSD compatible macro m_copy(), use m_copym() instead. Reviewed by: gnn Differential Revision: https://reviews.freebsd.org/D7878
# 306e53bc	22-May-2016	Baptiste Daroussin <bapt@FreeBSD.org>	Fix typo introduced by me (not the submitter) when fixing typos
# 2fd642c8	22-May-2016	Baptiste Daroussin <bapt@FreeBSD.org>	Fix typos in the comments Submitted by: cipherwraith666@gmail.com (via github)
# e3043798	29-Apr-2016	Pedro F. Giffuni <pfg@FreeBSD.org>	sys/kern: spelling fixes in comments. No functional change.
# 8722384b	29-Apr-2016	John Baldwin <jhb@FreeBSD.org>	Introduce a new protocol hook pru_aio_queue. This allows a protocol to claim individual AIO requests instead of using the default socket AIO handling. Sponsored by: Chelsio Communications
# ab771750	21-Mar-2016	Maxim Konovalov <maxim@FreeBSD.org>	o "avaliable" -> "available". PR: 208141 Submitted by: Tyler Littlefield
# f3215338	01-Mar-2016	John Baldwin <jhb@FreeBSD.org>	Refactor the AIO subsystem to permit file-type-specific handling and improve cancellation robustness. Introduce a new file operation, fo_aio_queue, which is responsible for queueing and completing an asynchronous I/O request for a given file. The AIO subystem now exports library of routines to manipulate AIO requests as well as the ability to run a handler function in the "default" pool of AIO daemons to service a request. A default implementation for file types which do not include an fo_aio_queue method queues requests to the "default" pool invoking the fo_read or fo_write methods as before. The AIO subsystem permits file types to install a private "cancel" routine when a request is queued to permit safe dequeueing and cleanup of cancelled requests. Sockets now use their own pool of AIO daemons and service per-socket requests in FIFO order. Socket requests will not block indefinitely permitting timely cancellation of all requests. Due to the now-tight coupling of the AIO subsystem with file types, the AIO subsystem is now a standard part of all kernels. The VFS_AIO kernel option and aio.ko module are gone. Many file types may block indefinitely in their fo_read or fo_write callbacks resulting in a hung AIO daemon. This can result in hung user processes (when processes attempt to cancel all outstanding requests during exit) or a hung system. To protect against this, AIO requests are only permitted for known "safe" files by default. AIO requests for all file types can be enabled by setting the new vfs.aio.enable_usafe sysctl to a non-zero value. The AIO tests have been updated to skip operations on unsafe file types if the sysctl is zero. Currently, AIO requests on sockets and raw disks are considered safe and are enabled by default. aio_mlock() is also enabled by default. Reviewed by: cem, jilles Discussed with: kib (earlier version) Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D5289
# 7325dfbb	01-Feb-2016	Alfred Perlstein <alfred@FreeBSD.org>	Increase max allowed backlog for listen sockets from short to int. PR: 203922 Submitted by: White Knight <white_knight@2ch.net> MFC After: 4 weeks
# b114aa79	27-Jul-2015	Ed Schouten <ed@FreeBSD.org>	Make shutdown() return ENOTCONN as required by POSIX, part deux. Summary: Back in 2005, maxim@ attempted to fix shutdown() to return ENOTCONN in case the socket was not connected (r150152). This had to be rolled back (r150155), as it broke some of the existing programs that depend on this behavior. I reapplied this change on my system and indeed, syslogd failed to start up. I fixed this back in February (279016) and MFC'ed it to the supported stable branches. Apart from that, things seem to work out all right. Since at least Linux and Mac OS X do the right thing, I'd like to go ahead and give this another try. To keep old copies of syslogd working, only start returning ENOTCONN for recent binaries. I took a look at the XNU sources and they seem to test against both SS_ISCONNECTED, SS_ISCONNECTING and SS_ISDISCONNECTING, instead of just SS_ISCONNECTED. That seams reasonable, so let's do the same. Test Plan: This issue was uncovered while writing tests for shutdown() in CloudABI: https://github.com/NuxiNL/cloudlibc/blob/master/src/libc/sys/socket/shutdown_test.c#L26 Reviewers: glebius, rwatson, #manpages, gnn, #network Reviewed By: gnn, #network Subscribers: bms, mjg, imp Differential Revision: https://reviews.freebsd.org/D3039
# 1a7c14ae	24-Jul-2015	Xin LI <delphij@FreeBSD.org>	Fix a typo in comment. Submitted by: Yanhui Shen via twitter MFC after: 3 days
# 0c40f353	13-Jul-2015	Conrad Meyer <cem@FreeBSD.org>	Fix cleanup race between unp_dispose and unp_gc unp_dispose and unp_gc could race to teardown the same mbuf chains, which can lead to dereferencing freed filedesc pointers. This patch adds an IGNORE_RIGHTS flag on unpcbs marking the unpcb's RIGHTS as invalid/freed. The flag is protected by UNP_LIST_LOCK. To serialize against unp_gc, unp_dispose needs the socket object. Change the dom_dispose() KPI to take a socket object instead of an mbuf chain directly. PR: 194264 Differential Revision: https://reviews.freebsd.org/D3044 Reviewed by: mjg (earlier version) Approved by: markj (mentor) Obtained from: mjg MFC after: 1 month Sponsored by: EMC / Isilon Storage Division
# e9b70483	23-Feb-2015	Andrey V. Elsukov <ae@FreeBSD.org>	soreceive_generic() still has similar KASSERT(), therefore instead of remove KASSERT(), change it to check mbuf isn't NULL. Suggested by: kib MFC after: 1 week
# f21684bc	23-Feb-2015	Andrey V. Elsukov <ae@FreeBSD.org>	In some cases soreceive_dgram() can return no data, but has control message. This can happen when application is sending packets too big for the path MTU and recvmsg() will return zero (indicating no data) but there will be a cmsghdr with cmsg_type set to IPV6_PATHMTU. Remove KASSERT() which does NULL pointer dereference in such case. Also call m_freem() only when m isn't NULL. PR: 197882 MFC after: 1 week Sponsored by: Yandex LLC
# a76d4388	14-Feb-2015	Davide Italiano <davide@FreeBSD.org>	Don't access sockbuf fields directly, use accessor functions instead. It is safe to move the call to socantsendmore_locked() after sbdrop_locked() as long as we hold the sockbuf lock across the two calls. CR: D1805 Reviewed by: adrian, kmacy, julian, rwatson
# e834a840	20-Dec-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Revert r274494, r274712, r275955 and provide extra comments explaining why there could appear a zero-sized mbufs in socket buffers. A proper fix would be to divorce record socket buffers and stream socket buffers, and divorce pru_send that accepts normal data from pru_send that accepts control data.
# 5ad25ceb	15-Dec-2014	John Baldwin <jhb@FreeBSD.org>	Check for SS_NBIO in so->so_state instead of sb->sb_flags in soreceive_stream(). Differential Revision: https://reviews.freebsd.org/D1299 Reviewed by: bz, gnn MFC after: 1 week
# 651e4e6a	30-Nov-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Merge from projects/sendfile: extend protocols API to support sending not ready data: o Add new flag to pru_send() flags - PRUS_NOTREADY. o Add new protocol method pru_ready(). Sponsored by: Nginx, Inc. Sponsored by: Netflix
# 0f9d0a73	29-Nov-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Merge from projects/sendfile: o Introduce a notion of "not ready" mbufs in socket buffers. These mbufs are now being populated by some I/O in background and are referenced outside. This forces following implications: - An mbuf which is "not ready" can't be taken out of the buffer. - An mbuf that is behind a "not ready" in the queue neither. - If sockbet buffer is flushed, then "not ready" mbufs shouln't be freed. o In struct sockbuf the sb_cc field is split into sb_ccc and sb_acc. The sb_ccc stands for ""claimed character count", or "committed character count". And the sb_acc is "available character count". Consumers of socket buffer API shouldn't already access them directly, but use sbused() and sbavail() respectively. o Not ready mbufs are marked with M_NOTREADY, and ready but blocked ones with M_BLOCKED. o New field sb_fnrdy points to the first not ready mbuf, to avoid linear search. o New function sbready() is provided to activate certain amount of mbufs in a socket buffer. A special note on SCTP: SCTP has its own sockbufs. Unfortunately, FreeBSD stack doesn't yet allow protocol specific sockbufs. Thus, SCTP does some hacks to make itself compatible with FreeBSD: it manages sockbufs on its own, but keeps sb_cc updated to inform the stack of amount of data in them. The new notion of "not ready" data isn't supported by SCTP. Instead, only a mechanical substitute is done: s/sb_cc/sb_ccc/. A proper solution would be to take away struct sockbuf from struct socket and allow protocols to implement their own socket buffers, like SCTP already does. This was discussed with rrs@. Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 67af272b	19-Nov-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Do not allocate zero-length mbuf in sosend_generic(). Found by: pho Sponsored by: Nginx, Inc.
# 6bf6b25e	14-Nov-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Merge from projects/sendfile: Use sbcut_locked() instead of manually editing a sockbuf. Sponsored by: Nginx, Inc.
# cfa6009e	12-Nov-2014	Gleb Smirnoff <glebius@FreeBSD.org>	In preparation of merging projects/sendfile, transform bare access to sb_cc member of struct sockbuf to a couple of inline functions: sbavail() and sbused() Right now they are equal, but once notion of "not ready socket buffer data", will be checked in, they are going to be different. Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 71426637	08-Sep-2014	Hiroki Sato <hrs@FreeBSD.org>	- Make hhook_run_socket() vnet-aware instead of adding CURVNET_SET() around the function calls. - Fix a memory leak and stats in the case that hhook_run_socket() fails in soalloc(). PR: 193265
# 9e739a5a	06-Sep-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Fix for r271182. Submitted by: mjg Pointy hat to: me, submitter and everyone who urged me to commit
# d9257d8b	05-Sep-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Set vnet context before accessing V_socket_hhh[]. Submitted by: "Hiroo Ono (小野寛生)" <hiroo.ono+freebsd gmail.com>
# e86447ca	26-Aug-2014	Gleb Smirnoff <glebius@FreeBSD.org>	- Remove socket file operations declaration from sys/file.h. - Make them static in sys_socket.c. - Provide generic invfo_truncate() instead of soo_truncate(). Sponsored by: Netflix Sponsored by: Nginx, Inc.
# ed063112	21-Aug-2014	Hiroki Sato <hrs@FreeBSD.org>	Fix a panic which occurs in a VIMAGE-enabled kernel after r270158, and separate socket_hhook_register() part and put it into VNET_SYS{,UN}INIT() handler. Discussed with: marcel
# 4ec73712	18-Aug-2014	Marcel Moolenaar <marcel@FreeBSD.org>	For vendors like Juniper, extensibility for sockets is important. A good example is socket options that aren't necessarily generic. To this end, OSD is added to the socket structure and hooks are defined for key operations on sockets. These are: o soalloc() and sodealloc() o Get and set socket options o Socket related kevent filters. One aspect about hhook that appears to be not fully baked is the return semantics (the return value from the hook is ignored in hhook_run_hooks() at the time of commit). To support return values, the socket_hhook_data structure contains a 'status' field to hold return values. Submitted by: Anuranjan Shukla <anshukla@juniper.net> Obtained from: Juniper Networks, Inc.
# 4295aa92	03-Aug-2014	Davide Italiano <davide@FreeBSD.org>	Fix an overflow in getsockopt(). optval isn't big enough to hold sbintime_t. Re-introduce r255030 behaviour capping socket timeouts to INT_32 if they're too large. CR: https://phabric.freebsd.org/D433 Reported by: demon Reviewed by: bde [1], jhb [2] MFC after: 2 weeks
# 1e0a021e	26-Jul-2014	Marcel Moolenaar <marcel@FreeBSD.org>	The accept filter code is not specific to the FreeBSD IPv4 network stack, so it really should not be under "optional inet". The fact that uipc_accf.c lives under kern/ lends some weight to making it a "standard" file. Moving kern/uipc_accf.c from "optional inet" to "standard" eliminates the need for #ifdef INET in kern/uipc_socket.c. Also, this meant the net.inet.accf.unloadable sysctl needed to move, as net.inet does not exist without networking compiled in (as it lives in netinet/in_proto.c.) The new sysctl has been named net.accf.unloadable. In order to support existing accept filter sysctls, the net.inet.accf node has been added netinet/in_proto.c. Submitted by: Steve Kiernan <stevek@juniper.net> Obtained from: Juniper Networks, Inc.
# d978bbea	16-Jan-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Simplify wait/nowait code, eventually killing last remnant of historical mbuf(9) allocator flag. Sponsored by: Nginx, Inc.
# 16ef0fa8	08-Nov-2013	Hiren Panchasara <hiren@FreeBSD.org>	Fix typo in a comment.
# f7a3a2a5	31-Oct-2013	Maksim Yevmenkin <emax@FreeBSD.org>	Rate limit (to once per minute) "Listen queue overflow" message in sonewconn(). Reviewed by: scottl, lstewart Obtained from: Netflix, Inc MFC after: 2 weeks
# 3846a822	16-Sep-2013	Konstantin Belousov <kib@FreeBSD.org>	Remove zero-copy sockets code. It only worked for anonymous memory, and the equivalent functionality is now provided by sendfile(2) over posix shared memory filedescriptor. Remove the cow member of struct vm_page, and rearrange the remaining members. While there, make hold_count unsigned. Requested and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation Approved by: re (delphij)
# 7729cbf1	01-Sep-2013	Davide Italiano <davide@FreeBSD.org>	Fix socket buffer timeouts precision using the new sbintime_t KPI instead of relying on the tvtohz() workaround. The latter has been introduced lately by jhb@ (r254699) in order to have a fix that can be backported to STABLE. Reported by: Vitja Makarov <vitja.makarov at gmail dot com> Reviewed by: jhb (earlier version)
# e289e9f2	29-Aug-2013	John Baldwin <jhb@FreeBSD.org>	Don't return an error for socket timeouts that are too large. Just cap them to INT_MAX ticks instead. PR: kern/181416 (r254699 really) Requested by: bde MFC after: 2 weeks
# e77c507d	23-Aug-2013	John Baldwin <jhb@FreeBSD.org>	Use tvtohz() to convert a socket buffer timeout to a tick value rather than using a home-rolled version. The home-rolled version could result in shorter-than-requested sleeps. Reported by: Vitja Makarov <vitja.makarov@gmail.com> MFC after: 2 weeks
# 6753da13	08-May-2013	Andre Oppermann <andre@FreeBSD.org>	When the accept queue is full print the number of already pending new connections instead of by how many we're over the limit, which is always 1. Noticed by: jmallet MFC after: 1 week
# f89d4c3a	06-May-2013	Andre Oppermann <andre@FreeBSD.org>	Back out r249318, r249320 and r249327 due to a heisenbug most likely related to a race condition in the ipi_hash_lock with the exact cause currently unknown but under investigation.
# cd31b6dd	30-Apr-2013	Jilles Tjoelker <jilles@FreeBSD.org>	socket: Make shutdown() wake up a blocked accept(). A blocking accept (and some other operations) waits on &so->so_timeo. Once it wakes up, it will detect the SBS_CANTRCVMORE bit. The error from accept() is [ECONNABORTED] which is not the nicest one -- the thread calling accept() needs to know out-of-band what is happening. A spurious wakeup on so->so_timeo appears harmless (sleep retried) except when lingering on close (SO_LINGER, and in that case there is no descriptor to call shutdown() on) so this should be fairly safe. A shutdown() already woke up a blocked accept() for TCP sockets, but not for Unix domain sockets. This fix is generic for all domains. This patch was sent to -hackers@ and -net@ on April 5. MFC after: 2 weeks
# d58a9653	09-Apr-2013	Jim Harris <jimharris@FreeBSD.org>	Fix the build.
# e8b3186b	09-Apr-2013	Andre Oppermann <andre@FreeBSD.org>	Change certain heavily used network related mutexes and rwlocks to reside on their own cache line to prevent false sharing with other nearby structures, especially for those in the .bss segment. NB: Those mutexes and rwlocks with variables next to them that get changed on every invocation do not benefit from their own cache line. Actually it may be net negative because two cache misses would be incurred in those cases.
# a307eb26	29-Mar-2013	Gleb Smirnoff <glebius@FreeBSD.org>	When soreceive_generic() hands off an mbuf from buffer, clear its pointer to next record, since next record belongs to the buffer, and shouldn't be leaked. The ng_ksocket(4) used to clear this pointer itself, but the correct place is here. Sponsored by: Nginx, Inc
# c2e3c52e	19-Mar-2013	Jilles Tjoelker <jilles@FreeBSD.org>	Implement SOCK_CLOEXEC, SOCK_NONBLOCK and MSG_CMSG_CLOEXEC. This change allows creating file descriptors with close-on-exec set in some situations. SOCK_CLOEXEC and SOCK_NONBLOCK can be OR'ed in socket() and socketpair()'s type parameter, and MSG_CMSG_CLOEXEC to recvmsg() makes file descriptors (SCM_RIGHTS) atomically close-on-exec. The numerical values for SOCK_CLOEXEC and SOCK_NONBLOCK are as in NetBSD. MSG_CMSG_CLOEXEC is the first free bit for MSG_. The SOCK_ flags are not passed to MAC because this may cause incorrect failures and can be done later via fcntl() anyway. On the other hand, audit is expected to cope with the new flags. For MSG_CMSG_CLOEXEC, unp_externalize() is extended to take a flags argument. Reviewed by: kib
# fbb34710	11-Mar-2013	Michael Tuexen <tuexen@FreeBSD.org>	Return an error if sctp_peeloff() fails because a socket can't be allocated. MFC after: 3 days
# 7493f24e	02-Mar-2013	Pawel Jakub Dawidek <pjd@FreeBSD.org>	- Implement two new system calls: int bindat(int fd, int s, const struct sockaddr addr, socklen_t addrlen); int connectat(int fd, int s, const struct sockaddr name, socklen_t namelen); which allow to bind and connect respectively to a UNIX domain socket with a path relative to the directory associated with the given file descriptor 'fd'. - Add manual pages for the new syscalls. - Make the new syscalls available for processes in capability mode sandbox. - Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on the directory descriptor for the syscalls to work. - Update audit(4) to support those two new syscalls and to handle path in sockaddr_un structure relative to the given directory descriptor. - Update procstat(1) to recognize the new capability rights. - Document the new capability rights in cap_rights_limit(2). Sponsored by: The FreeBSD Foundation Discussed with: rwatson, jilles, kib, des
# 6e0b6746	07-Dec-2012	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Configure UMA warnings for the following zones: - unp_zone: kern.ipc.maxsockets limit reached - socket_zone: kern.ipc.maxsockets limit reached - zone_mbuf: kern.ipc.nmbufs limit reached - zone_clust: kern.ipc.nmbclusters limit reached - zone_jumbop: kern.ipc.nmbjumbop limit reached - zone_jumbo9: kern.ipc.nmbjumbo9 limit reached - zone_jumbo16: kern.ipc.nmbjumbo16 limit reached Note that those warnings are printed not often than every five minutes and can be globally turned off by setting sysctl/tunable vm.zone_warnings to 0. Discussed on: arch Obtained from: WHEEL Systems MFC after: 2 weeks
# 94b0ae5d	07-Dec-2012	Pawel Jakub Dawidek <pjd@FreeBSD.org>	- Make socket_zone static - it is used only in this file. - Update maxsockets on uma_zone_set_max(). Obtained from: WHEEL Systems
# 68412f41	07-Dec-2012	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Style cleanups.
# b08d12d9	06-Dec-2012	Kevin Lo <kevlo@FreeBSD.org>	- according to POSIX, make socket(2) return EAFNOSUPPORT rather than EPROTONOSUPPORT if the address family is not supported. - introduce pffinddomain() to find a domain by family and use it as appropriate. Reviewed by: glebius
# eb1b1807	05-Dec-2012	Gleb Smirnoff <glebius@FreeBSD.org>	Mechanically substitute flags from historic mbuf allocator with malloc(9) flags within sys. Exceptions: - sys/contrib not touched - sys/mbuf.h edited manually
# 358c7f47	27-Nov-2012	Andre Oppermann <andre@FreeBSD.org>	Fix r243627 by testing against the head socket instead of the socket just created. MFC after: 1 week X-MFC-with: r243627
# ead46972	27-Nov-2012	Andre Oppermann <andre@FreeBSD.org>	Base the mbuf related limits on the available physical memory or kernel memory, whichever is lower. The overall mbuf related memory limit must be set so that mbufs (and clusters of various sizes) can't exhaust physical RAM or KVM. The limit is set to half of the physical RAM or KVM (whichever is lower) as the baseline. In any normal scenario we want to leave at least half of the physmem/kvm for other kernel functions and userspace to prevent it from swapping too easily. Via a tunable kern.maxmbufmem the limit can be upped to at most 3/4 of physmem/kvm. At the same time divorce maxfiles from maxusers and set maxfiles to physpages / 8 with a floor based on maxusers. This way busy servers can make use of the significantly increased mbuf limits with a much larger number of open sockets. Tidy up ordering in init_param2() and check up on some users of those values calculated here. Out of the overall mbuf memory limit 2K clusters and 4K (page size) clusters to get 1/4 each because these are the most heavily used mbuf sizes. 2K clusters are used for MTU 1500 ethernet inbound packets. 4K clusters are used whenever possible for sends on sockets and thus outbound packets. The larger cluster sizes of 9K and 16K are limited to 1/6 of the overall mbuf memory limit. When jumbo MTU's are used these large clusters will end up only on the inbound path. They are not used on outbound, there it's still 4K. Yes, that will stay that way because otherwise we run into lots of complications in the stack. And it really isn't a problem, so don't make a scene. Normal mbufs (256B) weren't limited at all previously. This was problematic as there are certain places in the kernel that on allocation failure of clusters try to piece together their packet from smaller mbufs. The mbuf limit is the number of all other mbuf sizes together plus some more to allow for standalone mbufs (ACK for example) and to send off a copy of a cluster. Unfortunately there isn't a way to set an overall limit for all mbuf memory together as UMA doesn't support such a limiting. NB: Every cluster also has an mbuf associated with it. Two examples on the revised mbuf sizing limits: 1GB KVM: 512MB limit for mbufs 419,430 mbufs 65,536 2K mbuf clusters 32,768 4K mbuf clusters 9,709 9K mbuf clusters 5,461 16K mbuf clusters 16GB RAM: 8GB limit for mbufs 33,554,432 mbufs 1,048,576 2K mbuf clusters 524,288 4K mbuf clusters 155,344 9K mbuf clusters 87,381 16K mbuf clusters These defaults should be sufficient for even the most demanding network loads. MFC after: 1 month
# 2c3142c8	27-Nov-2012	Andre Oppermann <andre@FreeBSD.org>	Fix a race on listen socket teardown where while draining the accept queues a new socket/connection may be added to the queue due to a race on the ACCEPT_LOCK. The submitted patch is slightly changed in comments, teardown and locking order and extended with KASSERT's. Submitted by: Vijay Singh <vijju.singh-at-gmail-dot-com> Found by: His team. MFC after: 1 week
# e8ad36ab	28-Oct-2012	Andre Oppermann <andre@FreeBSD.org>	In soreceive_stream() don't drop an already dequeued mbuf chain by overwriting the return mbuf pointer with newly received data after a loop. Instead append the new mbuf chain to the existing one. Fix up sb_lastrecord when dequeuing mbuf's so that sbappend_stream() doesn't get confused. For the remainder copy case in the mbuf delivery part deduct the copied length len instead of the whole mbuf length. Additionally don't depend on 'n' being being available which isn't true in the case of MSG_PEEK. Fix the MSG_WAITALL case by comparing against sb_hiwat. Before it was looping for every receive as sb_lowat normally is zero. Add comment about issue with (MSG_WAITALL \| MSG_PEEK) which isn't properly handled. Submitted by: trociny (except for the change in last paragraph)
# fdd1b7f5	28-Oct-2012	Andre Oppermann <andre@FreeBSD.org>	Add logging for socket attach failures in sonewconn() during accept(2). Include the pointer to the PCB so it can be attributed to a particular application by corresponding it to "netstat -A" output. MFC after: 2 weeks
# e37e60c3	23-Oct-2012	Andre Oppermann <andre@FreeBSD.org>	Replace the ill-named ZERO_COPY_SOCKET kernel option with two more appropriate named kernel options for the very distinct send and receive path. "options SOCKET_SEND_COW" enables VM page copy-on-write based sending of data on an outbound socket. NB: The COW based send mechanism is not safe and may result in kernel crashes. "options SOCKET_RECV_PFLIP" enables VM kernel/userspace page flipping for special disposable pages attached as external storage to mbufs. Only the naming of the kernel options is changed and their corresponding #ifdef sections are adjusted. No functionality is added or removed. Discussed with: alc (mechanism and limitations of send side COW)
# dc00208e	20-Oct-2012	Andre Oppermann <andre@FreeBSD.org>	Grammar fixes to r241781. Submitted by: alc
# 2bdf61ca	19-Oct-2012	Andre Oppermann <andre@FreeBSD.org>	Hide the unfortunate named sysctl kern.ipc.somaxconn from sysctl -a output and replace it with a new visible sysctl kern.ipc.acceptqueue of the same functionality. It specifies the maximum length of the accept queue on a listen socket. The old kern.ipc.somaxconn remains available for reading and writing for compatibility reasons so that existing programs, scripts and configurations continue to work. There no plans to ever remove the orginal and now hidden kern.ipc.somaxconn.
# 1490de00	20-Oct-2012	Andre Oppermann <andre@FreeBSD.org>	Tidy up somaxconn (accept queue limit) and related functions and move it together into one place.
# 4b62fe5b	18-Oct-2012	Andre Oppermann <andre@FreeBSD.org>	Move socket UMA zone initialization functionality together into one place.
# cf8e6069	19-Oct-2012	Andre Oppermann <andre@FreeBSD.org>	Move UMA socket zone initialization from uipc_domain.c to uipc_socket.c into one place next to its other related functions to avoid confusion.
# d10733a8	18-Oct-2012	Andre Oppermann <andre@FreeBSD.org>	Remove unnecessary includes from sosend_copyin() and fix a couple of style issues.
# 1d147759	18-Oct-2012	Andre Oppermann <andre@FreeBSD.org>	Remove double-wrapping of #ifdef ZERO_COPY_SOCKETS within zero copy specialized sosend_copyin() helper function.
# 48b5c741	02-Oct-2012	Garrett Wollman <wollman@FreeBSD.org>	Fix spelling of the function name in two assertion messages.
# bb9f214f	02-Sep-2012	Mikolaj Golub <trociny@FreeBSD.org>	In soreceive_generic() remove the optimization for the case when MSG_WAITALL is set, and it is possible to do the entire receive operation at once if we block (resid <= hiwat). Actually it might make the recv(2) with MSG_WAITALL flag get stuck when there is enough space in the receiver buffer to satisfy the request but not enough to open the window closed previously due to the buffer being full. The issue can be reproduced using the following scenario: On the sender side do 2 send(2) requests: 1) data of size much smaller than SOBUF_SIZE (e.g. SOBUF_SIZE / 10); 2) data of size equal to SOBUF_SIZE. On the receiver side do 2 recv(2) requests with MSG_WAITALL flag set: 1) recv() data of SOBUF_SIZE / 10 size; 2) recv() data of SOBUF_SIZE size; We totally fill the receiver buffer with one SOBUF_SIZE/10 size request and partial SOBUF_SIZE request. When the first request is processed we get SOBUF_SIZE/10 free space. It is just enough to receive the rest of bytes for the second request, and soreceive_generic() blocks in the part that is a subject of this change waiting for the rest. But the window was closed when the buffer was filled and to avoid silly window syndrome it opens only when available space is larger than sb_hiwat/4 or maxseg. So it is stuck and pending data is only sent via TCP window probes. Discussed with: kib (long ago) MFC after: 2 weeks
# 2ad099fc	02-Sep-2012	Mikolaj Golub <trociny@FreeBSD.org>	In soreceive_generic() when checking if the type of mbuf has changed check it for MT_CONTROL type too, otherwise the assertion "m->m_type == MT_DATA" below may be triggered by the following scenario: - the sender sends some data (MT_DATA) and then a file descriptor (MT_CONTROL); - the receiver calls recv(2) with a MSG_WAITALL asking for data larger than the receive buffer (uio_resid > hiwat). MFC after: 2 week
# e71a7957	03-Jul-2012	Mikolaj Golub <trociny@FreeBSD.org>	Fix KASSERT message. MFC after: 3 days
# 60a30588	03-Apr-2012	Navdeep Parhar <np@FreeBSD.org>	- Remove redundant call to pr_ctloutput from code that handles SO_SETFIB. - Add a check for errors during copyin while here. Reviewed by: julian, bz MFC after: 2 weeks
# 747d2fa1	26-Feb-2012	Konstantin Belousov <kib@FreeBSD.org>	Add SO_PROTOCOL/SO_PROTOTYPE socket SOL_SOCKET-level option to get the socket protocol number. This is useful since the socket type can be implemented by different protocols in the same protocol family, e.g. SOCK_STREAM may be provided by both TCP and SCTP. Submitted by: Jukka A. Ukkonen <jau iki fi> PR: kern/162352 Discussed with: bz Reviewed by: glebius MFC after: 2 weeks
# 9493639e	26-Feb-2012	Konstantin Belousov <kib@FreeBSD.org>	Remove apparently redundand checks for socket so_proto being non-NULL from sosetopt() and sogetopt(). No exposed sockets may have so_proto invalid. Discussed with: bz, rwatson Reviewed by: glebius MFC after: 2 weeks
# 526d0bd5	20-Feb-2012	Konstantin Belousov <kib@FreeBSD.org>	Fix found places where uio_resid is truncated to int. Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from the usermode. Discussed with: bde, das (previous versions) MFC after: 1 month
# cf8b8325	04-Feb-2012	Hiroki Sato <hrs@FreeBSD.org>	Fix input validation in SO_SETFIB. Reviewed by: bz MFC after: 1 day
# ee799639	03-Feb-2012	Bjoern A. Zeeb <bz@FreeBSD.org>	Add SO_SETFIB option support on PF_INET6 sockets and allow inheriting the FIB number from the process, as set by setfib(2), on socket creation. Sponsored by: Cisco Systems, Inc.
# ea4d9a14	14-Nov-2011	Robert Millan <rmh@FreeBSD.org>	Remove a few bits of FreeBSD 2.x compatibility code. Approved by: kib (mentor)
# 6aba400a	25-Aug-2011	Attilio Rao <attilio@FreeBSD.org>	Fix a deficiency in the selinfo interface: If a selinfo object is recorded (via selrecord()) and then it is quickly destroyed, with the waiters missing the opportunity to awake, at the next iteration they will find the selinfo object destroyed, causing a PF#. That happens because the selinfo interface has no way to drain the waiters before to destroy the registered selinfo object. Also this race is quite rare to get in practice, because it would require a selrecord(), a poll request by another thread and a quick destruction of the selrecord()'ed selinfo object. Fix this by adding the seldrain() routine which should be called before to destroy the selinfo objects (in order to avoid such case), and fix the present cases where it might have already been called. Sometimes, the context is safe enough to prevent this type of race, like it happens in device drivers which installs selinfo objects on poll callbacks. There, the destruction of the selinfo object happens at driver detach time, when all the filedescriptors should be already closed, thus there cannot be a race. For this case, mfi(4) device driver can be set as an example, as it implements a full correct logic for preventing this from happening. Sponsored by: Sandvine Incorporated Reported by: rstone Tested by: pluknet Reviewed by: jhb, kib Approved by: re (bz) MFC after: 3 weeks
# 695da99e	08-Jul-2011	Andre Oppermann <andre@FreeBSD.org>	In the experimental soreceive_stream(): o Move the non-blocking socket test below the SBS_CANTRCVMORE so that EOF is correctly returned on a remote connection close. o In the non-blocking socket test compare SS_NBIO against the so->so_state field instead of the incorrect sb->sb_state field. o Simplify the ENOTCONN test by removing cases that can't occur. Submitted by: trociny (with some further tweaks by committer) Tested by: trociny
# 1c6e7fa7	07-Jul-2011	Andre Oppermann <andre@FreeBSD.org>	Remove the TCP_SORECEIVE_STREAM compile time option. The use of soreceive_stream() for TCP still has to be enabled with the loader tuneable net.inet.tcp.soreceive_stream. Suggested by: trociny and others
# 3204c8e5	29-May-2011	Mikolaj Golub <trociny@FreeBSD.org>	In soreceive_generic(), if MSG_WAITALL is set but the request is larger than the receive buffer, we have to receive in sections. When notifying the protocol that some data has been drained the lock is released for a moment. Returning we block waiting for the rest of data. There is a race, when data could arrive while the lock was released and then the connection stalls in sbwait. Fix this by checking for data before blocking and skip blocking if there are some. PR: kern/154504 Reported by: Andrey Simonenko <simon@comsys.ntu-kpi.kiev.ua> Tested by: Andrey Simonenko <simon@comsys.ntu-kpi.kiev.ua> Reviewed by: rwatson Approved by: kib (co-mentor) MFC after: 2 weeks
# 1fb51a12	16-Feb-2011	Bjoern A. Zeeb <bz@FreeBSD.org>	Mfp4 CH=177274,177280,177284-177285,177297,177324-177325 VNET socket push back: try to minimize the number of places where we have to switch vnets and narrow down the time we stay switched. Add assertions to the socket code to catch possibly unset vnets as seen in r204147. While this reduces the number of vnet recursion in some places like NFS, POSIX local sockets and some netgraph, .. recursions are impossible to fix. The current expectations are documented at the beginning of uipc_socket.c along with the other information there. Sponsored by: The FreeBSD Foundation Sponsored by: CK Software GmbH Reviewed by: jhb Tested by: zec Tested by: Mikolaj Golub (to.my.trociny gmail.com) MFC after: 2 weeks
# f7e6ce6d	12-Feb-2011	Daniel Eischen <deischen@FreeBSD.org>	Allow the SO_SETFIB socket option to select the default (0) routing table. Reviewed by: julian
# 0028e524	11-Feb-2011	Bjoern A. Zeeb <bz@FreeBSD.org>	Mfp4 CH=177255: Make VNET_ASSERT() available with either VNET_DEBUG or INVARIANTS. Change the syntax to match KASSERT() to allow more flexible panic messages rather than having a printf with hardcoded arguments before panic. Adjust the few assertions we have to the new format (and enhance the output). Sponsored by: The FreeBSD Foundation Sponsored by: CK Software GmbH Reviewed by: jhb MFC after: 2 weeks
# 5c9d0a9a	12-Nov-2010	Luigi Rizzo <luigi@FreeBSD.org>	This commit implements the SO_USER_COOKIE socket option, which lets you tag a socket with an uint32_t value. The cookie can then be used by the kernel for various purposes, e.g. setting the skipto rule or pipe number in ipfw (this is the reason SO_USER_COOKIE has been implemented; however there is nothing ipfw-specific in its implementation). The ipfw-related code that uses the optopn will be committed separately. This change adds a field to 'struct socket', but the struct is not part of any driver or userland-visible ABI so the change should be harmless. See the discussion at http://lists.freebsd.org/pipermail/freebsd-ipfw/2009-October/004001.html Idea and code from Paul Joe, small modifications and manpage changes by myself. Submitted by: Paul Joe MFC after: 1 week
# a7d5f7eb	19-Oct-2010	Jamie Gritton <jamie@FreeBSD.org>	A new jail(8) with a configuration file, to replace the work currently done by /etc/rc.d/jail.
# adb6aa9a	18-Sep-2010	Robert Watson <rwatson@FreeBSD.org>	With reworking of the socket life cycle in 7.x, the need for a "sotryfree()" was eliminated: all references to sockets are explicitly managed by sorele() and the protocols. As such, garbage collect sotryfree(), and update sofree() comments to make the new world order more clear. MFC after: 3 days Reported by: Anuranjan Shukla <anshukla at juniper dot net>
# af9ba7d8	07-Aug-2010	Michael Tuexen <tuexen@FreeBSD.org>	Fix a bug where MSG_TRUNC was not returned in all necessary cases for SOCK_DGRAM socket. MSG_TRUNC was only returned when some mbufs could not be copied to the application. If some data was left in the last mbuf, it was correctly discarded, but MSG_TRUNC was not set. Reviewed by: bz MFC after: 3 weeks
# 9b47fa59	01-Jun-2010	Robert Watson <rwatson@FreeBSD.org>	Merge r208601 from head to stable/8: When close() is called on a connected socket pair, SO_ISCONNECTED might be set but be cleared before the call to sodisconnect(). In this case, ENOTCONN is returned: suppress this error rather than returning it to userspace so that close() doesn't report an error improperly. PR: kern/144061 Reported by: Matt Reimer <mreimer at vpop.net>, Nikolay Denev <ndenev at gmail.com>, Mikolaj Golub <to.my.trociny at gmail.com> Approved by: re (kib)
# e35973e4	27-May-2010	Robert Watson <rwatson@FreeBSD.org>	When close() is called on a connected socket pair, SO_ISCONNECTED might be set but be cleared before the call to sodisconnect(). In this case, ENOTCONN is returned: suppress this error rather than returning it to userspace so that close() doesn't report an error improperly. PR: kern/144061 Reported by: Matt Reimer <mreimer at vpop.net>, Nikolay Denev <ndenev at gmail.com>, Mikolaj Golub <to.my.trociny at gmail.com> MFC after: 3 days
# 4ccf64eb	06-Apr-2010	Nathan Whitehorn <nwhitehorn@FreeBSD.org>	MFC r205014,205015: Provide groundwork for 32-bit binary compatibility on non-x86 platforms, for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32 option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts of the kernel and enhances the freebsd32 compatibility code to support big-endian platforms. This MFC is required for MFCs of later changes to the freebsd32 compatibility from HEAD. Requested by: kib
# 67208dfa	27-Mar-2010	Bjoern A. Zeeb <bz@FreeBSD.org>	MFC r204147: Set curvnet earlier so that it also covers calls to sodisconnect(), which before were possibly panicing the system in ULP code in the VIMAGE case. Submitted by: Igor (igor ispsystem.com)
# 841c0c7e	11-Mar-2010	Nathan Whitehorn <nwhitehorn@FreeBSD.org>	Provide groundwork for 32-bit binary compatibility on non-x86 platforms, for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32 option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts of the kernel and enhances the freebsd32 compatibility code to support big-endian platforms. Reviewed by: kib, jhb
# 0a68a459	20-Feb-2010	Bjoern A. Zeeb <bz@FreeBSD.org>	Set curvnet earlier so that it also covers calls to sodisconnect(), which before were possibly panicing the system in ULP code in the VIMAGE case. Submitted by: Igor (igor ispsystem.com) MFC after: 5 days
# 70785093	14-Dec-2009	Robert Watson <rwatson@FreeBSD.org>	Merge r197720 from head to stable/8: Don't comment on stream socket handling in sosend_dgram, since that's not handled.
# afd8e45b	02-Oct-2009	Robert Watson <rwatson@FreeBSD.org>	Don't comment on stream socket handling in sosend_dgram, since that's not handled. MFC after: 3 weeks
# 11c99a6d	15-Sep-2009	Andre Oppermann <andre@FreeBSD.org>	-Put the optimized soreceive_stream() under a compile time option called TCP_SORECEIVE_STREAM for the time being. Requested by: brooks Once compiled in make it easily switchable for testers by using a tuneable net.inet.tcp.soreceive_stream and a corresponding read-only sysctl to report the current state. Suggested by: rwatson MFC after: 2 days
# e76d823b	12-Sep-2009	Robert Watson <rwatson@FreeBSD.org>	Use C99 initialization for struct filterops. Obtained from: Mac OS X Sponsored by: Apple Inc. MFC after: 3 weeks
# 74d1c492	25-Aug-2009	Jilles Tjoelker <jilles@FreeBSD.org>	Fix poll() on half-closed sockets, while retaining POLLHUP for fifos. This reverts part of r196460, so that sockets only return POLLHUP if both directions are closed/error. Fifos get POLLHUP by closing the unused direction immediately after creating the sockets. The tools/regression/poll/*poll.c tests now pass except for two other things: - if POLLHUP is returned, POLLIN is always returned as well instead of only when there is data left in the buffer to be read - fifo old/new reader distinction does not work the way POSIX specs it Reviewed by: kib, bde
# f2159cc7	22-Aug-2009	Konstantin Belousov <kib@FreeBSD.org>	Fix the conformance of poll(2) for sockets after r195423 by returning POLLHUP instead of POLLIN for several cases. Now, the tools/regression/poll results for FreeBSD are closer to that of the Solaris and Linux. Also, improve the POSIX conformance by explicitely clearing POLLOUT when POLLHUP is reported in pollscan(), making the fix global. Submitted by: bde Reviewed by: rwatson MFC after: 1 week
# 530c0060	01-Aug-2009	Robert Watson <rwatson@FreeBSD.org>	Merge the remainder of kern_vimage.c and vimage.h into vnet.c and vnet.h, we now use jails (rather than vimages) as the abstraction for virtualization management, and what remained was specific to virtual network stacks. Minor cleanups are done in the process, and comments updated to reflect these changes. Reviewed by: bz Approved by: re (vimage blanket)
# 7973fba3	28-Jul-2009	Julian Elischer <julian@FreeBSD.org>	Somewhere along the line accept sockets stopped honoring the FIB selected for them. Fix this. Reviewed by: ambrisko Approved by: re (kib) MFC after: 3 days
# 006e9db4	19-Jul-2009	Robert Watson <rwatson@FreeBSD.org>	Normalize field naming for struct vnet, fix two debugging printfs that print them. Reviewed by: bz Approved by: re (kensmith, kib)
# 7f5dff50	07-Jul-2009	Konstantin Belousov <kib@FreeBSD.org>	Fix poll(2) and select(2) for named pipes to return "ready for read" when all writers, observed by reader, exited. Use writer generation counter for fifo, and store the snapshot of the fifo generation in the f_seqcount field of struct file, that is otherwise unused for fifos. Set FreeBSD-undocumented POLLINIGNEOF flag only when file f_seqcount is equal to fifo' fi_wgen, and revert r89376. Fix POLLINIGNEOF for sockets and pipes, and return POLLHUP for them. Note that the patch does not fix not returning POLLHUP for fifos. PR: kern/94772 Submitted by: bde (original version) Reviewed by: rwatson, jilles Approved by: re (kensmith) MFC after: 6 weeks (might be)
# ef760e6a	22-Jun-2009	Andre Oppermann <andre@FreeBSD.org>	Add soreceive_stream(), an optimized version of soreceive() for stream (TCP) sockets. It is functionally identical to generic soreceive() but has a number stream specific optimizations: o does only one sockbuf unlock/lock per receive independent of the length of data to be moved into the uio compared to soreceive() which unlocks/locks per mbuf. o uses m_mbuftouio() instead of its own copy(out) variant. o much more compact code flow as a large number of special cases is removed. o much improved reability. It offers significantly reduced CPU usage and lock contention when receiving fast TCP streams. Additional gains are obtained when the receiving application is using SO_RCVLOWAT to batch up some data before a read (and wakeup) is done. This function was written by "reverse engineering" and is not just a stripped down variant of soreceive(). It is not yet enabled by default on TCP sockets. Instead it is commented out in the protocol initialization in tcp_usrreq.c until more widespread testing has been done. Testers, especially with 10GigE gear, are welcome. MFP4: r164817 //depot/user/andre/soreceive_stream/
# 9ed47d01	15-Jun-2009	Jamie Gritton <jamie@FreeBSD.org>	Get vnets from creds instead of threads where they're available, and from passed threads instead of curthread. Reviewed by: zec, julian Approved by: bz (mentor)
# d8b0556c	10-Jun-2009	Konstantin Belousov <kib@FreeBSD.org>	Adapt vfs kqfilter to the shared vnode lock used by zfs write vop. Use vnode interlock to protect the knote fields [1]. The locking assumes that shared vnode lock is held, thus we get exclusive access to knote either by exclusive vnode lock protection, or by shared vnode lock + vnode interlock. Do not use kl_locked() method to assert either lock ownership or the fact that curthread does not own the lock. For shared locks, ownership is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared lock not owned by curthread, causing false positives in kqueue subsystem assertions about knlist lock. Remove kl_locked method from knlist lock vector, and add two separate assertion methods kl_assert_locked and kl_assert_unlocked, that are supposed to use proper asserts. Change knlist_init accordingly. Add convenience function knlist_init_mtx to reduce number of arguments for typical knlist initialization. Submitted by: jhb [1] Noted by: jhb [2] Reviewed by: jhb Tested by: rnoland
# bcf11e8d	05-Jun-2009	Robert Watson <rwatson@FreeBSD.org>	Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd
# f93bfb23	02-Jun-2009	Robert Watson <rwatson@FreeBSD.org>	Add internal 'mac_policy_count' counter to the MAC Framework, which is a count of the number of registered policies. Rather than unconditionally locking sockets before passing them into MAC, lock them in the MAC entry points only if mac_policy_count is non-zero. This avoids locking overhead for a number of socket system calls when no policies are registered, eliminating measurable overhead for the MAC Framework for the socket subsystem when there are no active policies. Possibly socket locks should be acquired by policies if they are required for socket labels, which would further avoid locking overhead when there are policies but they don't require labeling of sockets, or possibly don't even implement socket controls. Obtained from: TrustedBSD Project
# 74fb0ba7	01-Jun-2009	John Baldwin <jhb@FreeBSD.org>	Rework socket upcalls to close some races with setup/teardown of upcalls. - Each socket upcall is now invoked with the appropriate socket buffer locked. It is not permissible to call soisconnected() with this lock held; however, so socket upcalls now return an integer value. The two possible values are SU_OK and SU_ISCONNECTED. If an upcall returns SU_ISCONNECTED, then the soisconnected() will be invoked on the socket after the socket buffer lock is dropped. - A new API is provided for setting and clearing socket upcalls. The API consists of soupcall_set() and soupcall_clear(). - To simplify locking, each socket buffer now has a separate upcall. - When a socket upcall returns SU_ISCONNECTED, the upcall is cleared from the receive socket buffer automatically. Note that a SO_SND upcall should never return SU_ISCONNECTED. - All this means that accept filters should now return SU_ISCONNECTED instead of calling soisconnected() directly. They also no longer need to explicitly clear the upcall on the new socket. - The HTTP accept filter still uses soupcall_set() to manage its internal state machine, but other accept filters no longer have any explicit knowlege of socket upcall internals aside from their return value. - The various RPC client upcalls currently drop the socket buffer lock while invoking soreceive() as a temporary band-aid. The plan for the future is to add a new flag to allow soreceive() to be called with the socket buffer locked. - The AIO callback for socket I/O is now also invoked with the socket buffer locked. Previously sowakeup() would drop the socket buffer lock only to call aio_swake() which immediately re-acquired the socket buffer lock for the duration of the function call. Discussed with: rwatson, rmacklem
# 2114e063	08-May-2009	Marko Zec <zec@FreeBSD.org>	A NOP change: style / whitespace cleanup of the noise that slipped into r191816. Spotted by: bz Approved by: julian (mentor) (an earlier version of the diff)
# 21ca7b57	05-May-2009	Marko Zec <zec@FreeBSD.org>	Change the curvnet variable from a global const struct vnet , previously always pointing to the default vnet context, to a dynamically changing thread-local one. The currvnet context should be set on entry to networking code via CURVNET_SET() macros, and reverted to previous state via CURVNET_RESTORE(). Recursions on curvnet are permitted, though strongly discuouraged. This change should have no functional impact on nooptions VIMAGE kernel builds, where CURVNET_ macros expand to whitespace. The curthread->td_vnet (aka curvnet) variable's purpose is to be an indicator of the vnet context in which the current network-related operation takes place, in case we cannot deduce the current vnet context from any other source, such as by looking at mbuf's m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so far curvnet has turned out to be an invaluable consistency checking aid: it helps to catch cases when sockets, ifnets or any other vnet-aware structures may have leaked from one vnet to another. The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros was a result of an empirical iterative process, whith an aim to reduce recursions on CURVNET_SET() to a minimum, while still reducing the scope of CURVNET_SET() to networking only operations - the alternative would be calling CURVNET_SET() on each system call entry. In general, curvnet has to be set in three typicall cases: when processing socket-related requests from userspace or from within the kernel; when processing inbound traffic flowing from device drivers to upper layers of the networking stack, and when executing timer-driven networking functions. This change also introduces a DDB subcommand to show the list of all vnet instances. Approved by: julian (mentor)
# f6dfe47a	30-Apr-2009	Marko Zec <zec@FreeBSD.org>	Permit buiding kernels with options VIMAGE, restricted to only a single active network stack instance. Turning on options VIMAGE at compile time yields the following changes relative to default kernel build: 1) V_ accessor macros for virtualized variables resolve to structure fields via base pointers, instead of being resolved as fields in global structs or plain global variables. As an example, V_ifnet becomes: options VIMAGE: ((struct vnet_net ) vnet_net)->_ifnet default build: vnet_net_0._ifnet options VIMAGE_GLOBALS: ifnet 2) INIT_VNET_ macros will declare and set up base pointers to be used by V_ accessor macros, instead of resolving to whitespace: INIT_VNET_NET(ifp->if_vnet); becomes struct vnet_net vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET]; 3) Memory for vnet modules registered via vnet_mod_register() is now allocated at run time in sys/kern/kern_vimage.c, instead of per vnet module structs being declared as globals. If required, vnet modules can now request the framework to provide them with allocated bzeroed memory by filling in the vmi_size field in their vmi_modinfo structures. 4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are extended to hold a pointer to the parent vnet. options VIMAGE builds will fill in those fields as required. 5) curvnet is introduced as a new global variable in options VIMAGE builds, always pointing to the default and only struct vnet. 6) struct sysctl_oid has been extended with additional two fields to store major and minor virtualization module identifiers, oid_v_subs and oid_v_mod. SYSCTL_V_ family of macros will fill in those fields accordingly, and store the offset in the appropriate vnet container struct in oid_arg1. In sysctl handlers dealing with virtualized sysctls, the SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target variable and make it available in arg1 variable for further processing. Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have been deleted. Reviewed by: bz, rwatson Approved by: julian (mentor)
# ca04ba64	05-Feb-2009	Jamie Gritton <jamie@FreeBSD.org>	Don't allow creating a socket with a protocol family that the current jail doesn't support. This involves a new function prison_check_af, like prison_check_ip[46] but that checks only the family. With this change, most of the errors generated by jailed sockets shouldn't ever occur, at least until jails are changeable. Approved by: bz (mentor)
# fd4f1ebd	04-Feb-2009	Robert Watson <rwatson@FreeBSD.org>	Remove written-to but never read local variable 'offset' from soreceive_dgram(). Submitted by: Christoph Mallon <christoph dot mallon at gmx dot de> MFC after: 1 week
# 62938659	10-Dec-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Make sure nmbclusters are initialized before maxsockets by running the tunable_mbinit() SYSINIT at SI_ORDER_MIDDLE before the init_maxsockets() SYSINT at SI_ORDER_ANY. Reviewed by: rwatson, zec Sponsored by: The FreeBSD Foundation MFC after: 4 weeks
# 36b5ba0c	10-Dec-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Style changes only. Put the return type on an extra line[1] and add an empty line at the beginning as we do not have any local variables. Submitted by: rwatson [1] Reviewed by: rwatson MFC after: 4 weeks
# 413628a7	29-Nov-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
# b4cf0e62	21-Nov-2008	Konstantin Belousov <kib@FreeBSD.org>	Add sv_flags field to struct sysentvec with intention to provide description of the ABI of the currently executing image. Change some places to test the flags instead of explicit comparing with address of known sysentvec structures to determine ABI features. Discussed with: dchagin, imp, jhb, peter
# bc97ba51	19-Nov-2008	Julian Elischer <julian@FreeBSD.org>	Fix a scope problem in the multiple routing table code that stopped the SO_SETFIB socket option from working correctly. Obtained from: Ironport MFC after: 3 days
# d7f03759	19-Oct-2008	Ulf Lilleengen <lulf@FreeBSD.org>	- Import the HEAD csup code which is the basis for the cvsmode work.
# 5a1760fc	16-Oct-2008	Kip Macy <kmacy@FreeBSD.org>	make sure that SO_NO_DDP and SO_NO_OFFLOAD get passed in correctly PR: 127360 MFC after: 3 days
# ff601c36	07-Oct-2008	Robert Watson <rwatson@FreeBSD.org>	In soreceive_dgram, when a 0-length buffer is passed into recv(2) and no data is ready, return 0 rather than blocking or returning EAGAIN. This is consistent with the behavior of soreceive_generic (soreceive) in earlier versions of FreeBSD, and restores this behavior for UDP. Discussed with: jhb, sam MFC after: 3 days
# ffe72750	07-Oct-2008	Robert Watson <rwatson@FreeBSD.org>	Remove temporary debugging KASSERT's introduced to detect protocols improperly invoking sosend(), soreceive(), and sopoll() instead of attach either specialized or _generic() versions of those functions to their pru_sosend, pru_soreceive, and pru_sopoll protosw methods. MFC after: 3 days
# 1af1c6cd	01-Oct-2008	John Baldwin <jhb@FreeBSD.org>	Wait until after dropping the receive socket buffer lock to allocate space to store the socket address stored in the first mbuf in a packet chain. This reduces contention on the lock and CPU system time in certain UDP workloads. Tested by: ps Reviewed by: rwatson MFC after: 1 week
# 25edc6dd	01-Oct-2008	Robert Watson <rwatson@FreeBSD.org>	Various cleanups for soreceive_dgram(): - Update or remove comments that were left over from the original soreceive_generic() implementation. Quite a few were misleading in the context of the new code. - Since soreceive_dgram() has a simpler structure, replace several gotos with a while loop making the invariants more clear. - In the blocking while loop, don't try to handle cases incompatible with the loop invariant (since m is always NULL, don't check for and handle non-NULL). - Don't drop and re-acquire the socket buffer lock unnecessarily after sbwait() returns, which may help reduce lock contention (etc). - Assume PR_ATOMIC since we assert it at the top of the function. MFC after: 3 days
# c4688866	30-Sep-2008	John Baldwin <jhb@FreeBSD.org>	Update the function name in several assertions in soreceive_dgram(). Approved by: rwatson MFC after: 3 days
# 26ec197d	02-Sep-2008	Robert Watson <rwatson@FreeBSD.org>	Remove XXXRW in soreceive_dgram that proves unnecessary. Remove unused orig_resid variable in soreceive_dgram. Submitted by: alfred X-MFC with: soreceive_dgram (r180198, r180211)
# dd0e6c38	20-Jul-2008	Kip Macy <kmacy@FreeBSD.org>	Add accessor functions for socket fields. MFC after: 1 week
# 6992381e	03-Jul-2008	Robert Watson <rwatson@FreeBSD.org>	Update copyright date in light of soreceive_dgram(9).
# 5df3e839	02-Jul-2008	Robert Watson <rwatson@FreeBSD.org>	Add soreceive_dgram(9), an optimized socket receive function for use by datagram-only protocols, such as UDP. This version removes use of sblock(), which is not required due to an inability to interlace data improperly with datagrams, as well as avoiding some of the larger loops and state management that don't apply on datagram sockets. This is experimental code, so hook it up only for UDPv4 for testing; if there are problems we may need to revise it or turn it off by default, but it offers significant performance improvements for threaded UDP applications such as BIND9, nsd, and memcached using UDP. Tested by: kris, ps
# 8b07e49a	09-May-2008	Julian Elischer <julian@FreeBSD.org>	Add code to allow the system to handle multiple routing tables. This particular implementation is designed to be fully backwards compatible and to be MFC-able to 7.x (and 6.x) Currently the only protocol that can make use of the multiple tables is IPv4 Similar functionality exists in OpenBSD and Linux. From my notes: ----- One thing where FreeBSD has been falling behind, and which by chance I have some time to work on is "policy based routing", which allows different packet streams to be routed by more than just the destination address. Constraints: ------------ I want to make some form of this available in the 6.x tree (and by extension 7.x) , but FreeBSD in general needs it so I might as well do it in -current and back port the portions I need. One of the ways that this can be done is to have the ability to instantiate multiple kernel routing tables (which I will now refer to as "Forwarding Information Bases" or "FIBs" for political correctness reasons). Which FIB a particular packet uses to make the next hop decision can be decided by a number of mechanisms. The policies these mechanisms implement are the "Policies" referred to in "Policy based routing". One of the constraints I have if I try to back port this work to 6.x is that it must be implemented as a EXTENSION to the existing ABIs in 6.x so that third party applications do not need to be recompiled in timespan of the branch. This first version will not have some of the bells and whistles that will come with later versions. It will, for example, be limited to 16 tables in the first commit. Implementation method, Compatible version. (part 1) ------------------------------- For this reason I have implemented a "sufficient subset" of a multiple routing table solution in Perforce, and back-ported it to 6.x. (also in Perforce though not always caught up with what I have done in -current/P4). The subset allows a number of FIBs to be defined at compile time (8 is sufficient for my purposes in 6.x) and implements the changes needed to allow IPV4 to use them. I have not done the changes for ipv6 simply because I do not need it, and I do not have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it. Other protocol families are left untouched and should there be users with proprietary protocol families, they should continue to work and be oblivious to the existence of the extra FIBs. To understand how this is done, one must know that the current FIB code starts everything off with a single dimensional array of pointers to FIB head structures (One per protocol family), each of which in turn points to the trie of routes available to that family. The basic change in the ABI compatible version of the change is to extent that array to be a 2 dimensional array, so that instead of protocol family X looking at rt_tables[X] for the table it needs, it looks at rt_tables[Y][X] when for all protocol families except ipv4 Y is always 0. Code that is unaware of the change always just sees the first row of the table, which of course looks just like the one dimensional array that existed before. The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign() are all maintained, but refer only to the first row of the array, so that existing callers in proprietary protocols can continue to do the "right thing". Some new entry points are added, for the exclusive use of ipv4 code called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(), which have an extra argument which refers the code to the correct row. In addition, there are some new entry points (currently called rtalloc_fib() and friends) that check the Address family being looked up and call either rtalloc() (and friends) if the protocol is not IPv4 forcing the action to row 0 or to the appropriate row if it IS IPv4 (and that info is available). These are for calling from code that is not specific to any particular protocol. The way these are implemented would change in the non ABI preserving code to be added later. One feature of the first version of the code is that for ipv4, the interface routes show up automatically on all the FIBs, so that no matter what FIB you select you always have the basic direct attached hosts available to you. (rtinit() does this automatically). You CAN delete an interface route from one FIB should you want to but by default it's there. ARP information is also available in each FIB. It's assumed that the same machine would have the same MAC address, regardless of which FIB you are using to get to it. This brings us as to how the correct FIB is selected for an outgoing IPV4 packet. Firstly, all packets have a FIB associated with them. if nothing has been done to change it, it will be FIB 0. The FIB is changed in the following ways. Packets fall into one of a number of classes. 1/ locally generated packets, coming from a socket/PCB. Such packets select a FIB from a number associated with the socket/PCB. This in turn is inherited from the process, but can be changed by a socket option. The process in turn inherits it on fork. I have written a utility call setfib that acts a bit like nice.. setfib -3 ping target.example.com # will use fib 3 for ping. It is an obvious extension to make it a property of a jail but I have not done so. It can be achieved by combining the setfib and jail commands. 2/ packets received on an interface for forwarding. By default these packets would use table 0, (or possibly a number settable in a sysctl(not yet)). but prior to routing the firewall can inspect them (see below). (possibly in the future you may be able to associate a FIB with packets received on an interface.. An ifconfig arg, but not yet.) 3/ packets inspected by a packet classifier, which can arbitrarily associate a fib with it on a packet by packet basis. A fib assigned to a packet by a packet classifier (such as ipfw) would over-ride a fib associated by a more default source. (such as cases 1 or 2). 4/ a tcp listen socket associated with a fib will generate accept sockets that are associated with that same fib. 5/ Packets generated in response to some other packet (e.g. reset or icmp packets). These should use the FIB associated with the packet being reponded to. 6/ Packets generated during encapsulation. gif, tun and other tunnel interfaces will encapsulate using the FIB that was in effect withthe proces that set up the tunnel. thus setfib 1 ifconfig gif0 [tunnel instructions] will set the fib for the tunnel to use to be fib 1. Routing messages would be associated with their process, and thus select one FIB or another. messages from the kernel would be associated with the fib they refer to and would only be received by a routing socket associated with that fib. (not yet implemented) In addition Netstat has been edited to be able to cope with the fact that the array is now 2 dimensional. (It looks in system memory using libkvm (!)). Old versions of netstat see only the first FIB. In addition two sysctls are added to give: a) the number of FIBs compiled in (active) b) the default FIB of the calling process. Early testing experience: ------------------------- Basically our (IronPort's) appliance does this functionality already using ipfw fwd but that method has some drawbacks. For example, It can't fully simulate a routing table because it can't influence the socket's choice of local address when a connect() is done. Testing during the generating of these changes has been remarkably smooth so far. Multiple tables have co-existed with no notable side effects, and packets have been routes accordingly. ipfw has grown 2 new keywords: setfib N ip from anay to any count ip from any to any fib N In pf there seems to be a requirement to be able to give symbolic names to the fibs but I do not have that capacity. I am not sure if it is required. SCTP has interestingly enough built in support for this, called VRFs in Cisco parlance. it will be interesting to see how that handles it when it suddenly actually does something. Where to next: -------------------- After committing the ABI compatible version and MFCing it, I'd like to proceed in a forward direction in -current. this will result in some roto-tilling in the routing code. Firstly: the current code's idea of having a separate tree per protocol family, all of the same format, and pointed to by the 1 dimensional array is a bit silly. Especially when one considers that there is code that makes assumptions about every protocol having the same internal structures there. Some protocols don't WANT that sort of structure. (for example the whole idea of a netmask is foreign to appletalk). This needs to be made opaque to the external code. My suggested first change is to add routing method pointers to the 'domain' structure, along with information pointing the data. instead of having an array of pointers to uniform structures, there would be an array pointing to the 'domain' structures for each protocol address domain (protocol family), and the methods this reached would be called. The methods would have an argument that gives FIB number, but the protocol would be free to ignore it. When the ABI can be changed it raises the possibilty of the addition of a fib entry into the "struct route". Currently, the structure contains the sockaddr of the desination, and the resulting fib entry. To make this work fully, one could add a fib number so that given an address and a fib, one can find the third element, the fib entry. Interaction with the ARP layer/ LL layer would need to be revisited as well. Qing Li has been working on this already. This work was sponsored by Ironport Systems/Cisco Reviewed by: several including rwatson, bz and mlair (parts each) Obtained from: Ironport systems/Cisco
# cf71e438	14-Apr-2008	Randall Stewart <rrs@FreeBSD.org>	Add pru_flush routine so a transport can flush itself during Shutdown MFC after: 1 week
# ea26d587	25-Mar-2008	Ruslan Ermilov <ru@FreeBSD.org>	Replaced the misleading uses of a historical artefact M_TRYWAIT with M_WAIT. Removed dead code that assumed that M_TRYWAIT can return NULL; it's not true since the advent of MBUMA. Reviewed by: arch There are ongoing disputes as to whether we want to switch to directly using UMA flags M_WAITOK/M_NOWAIT for mbuf(9) allocation.
# 073d8ba4	19-Mar-2008	Maxim Sobolev <sobomax@FreeBSD.org>	Revert previous change - it appears that the limit I was hitting was a maxsockets limit, not maxfiles limit. The question remains why those limits are handled differently (with error code for maxfiles but with sleep for maxsokets), but those would be addressed in a separate commit if necessary. Requested by: rwhatson, jeff
# c9370ff4	16-Mar-2008	Maxim Sobolev <sobomax@FreeBSD.org>	Properly set size of the file_zone to match kern.maxfiles parameter. Otherwise the parameter is no-op, since zone by default limits number of descriptors to some 12K entries. Attempt to allocate more ends up sleeping on zonelimit. MFC after: 2 weeks
# 3f0bfccc	03-Feb-2008	Robert Watson <rwatson@FreeBSD.org>	Further clean up sorflush: - Expose sbrelease_internal(), a variant of sbrelease() with no expectations about the validity of locks in the socket buffer. - Use sbrelease_internel() in sorflush(), and as a result avoid intializing and destroying a socket buffer lock for the temporary stack copy of the actual buffer, asb. - Add a comment indicating why we do what we do, and remove an XXX since things have gotten less ugly in sorflush() lately. This makes socket close cleaner, and possibly also marginally faster. MFC after: 3 weeks
# 265de5bb	31-Jan-2008	Robert Watson <rwatson@FreeBSD.org>	Correct two problems relating to sorflush(), which is called to flush read socket buffers in shutdown() and close(): - Call socantrcvmore() before sblock() to dislodge any threads that might be sleeping (potentially indefinitely) while holding sblock(), such as a thread blocked in recv(). - Flag the sblock() call as non-interruptible so that a signal delivered to the thread calling sorflush() doesn't cause sblock() to fail. The sblock() is required to ensure that all other socket consumer threads have, in fact, left, and do not enter, the socket buffer until we're done flushin it. To implement the latter, change the 'flags' argument to sblock() to accept two flags, SBL_WAIT and SBL_NOINTR, rather than one M_WAITOK flag. When SBL_NOINTR is set, it forces a non-interruptible sx acquisition, regardless of the setting of the disposition of SB_NOINTR on the socket buffer; without this change it would be possible for another thread to clear SB_NOINTR between when the socket buffer mutex is released and sblock() is invoked. Reviewed by: bz, kmacy Reported by: Jos Backus <jos at catnook dot com>
# 30d239bc	24-Oct-2007	Robert Watson <rwatson@FreeBSD.org>	Merge first in a series of TrustedBSD MAC Framework KPI changes from Mac OS X Leopard--rationalize naming for entry points to the following general forms: mac_<object>_<method/action> mac_<object>_check_<method/action> The previous naming scheme was inconsistent and mostly reversed from the new scheme. Also, make object types more consistent and remove spaces from object types that contain multiple parts ("posix_sem" -> "posixsem") to make mechanical parsing easier. Introduce a new "netinet" object type for certain IPv4/IPv6-related methods. Also simplify, slightly, some entry point names. All MAC policy modules will need to be recompiled, and modules not updates as part of this commit will need to be modified to conform to the new KPI. Sponsored by: SPARTA (original patches against Mac OS X) Obtained from: TrustedBSD Project, Apple Computer
# 041b706b	04-Jun-2007	David Malone <dwmalone@FreeBSD.org>	Despite several examples in the kernel, the third argument of sysctl_handle_int is not sizeof the int type you want to export. The type must always be an int or an unsigned int. Remove the instances where a sizeof(variable) is passed to stop people accidently cut and pasting these examples. In a few places this was sysctl_handle_int was being used on 64 bit types, which would truncate the value to be exported. In these cases use sysctl_handle_quad to export them and change the format to Q so that sysctl(1) can still print them.
# 1c4bcd05	31-May-2007	Jeff Roberson <jeff@FreeBSD.org>	- Move rusage from being per-process in struct pstats to per-thread in td_ru. This removes the requirement for per-process synchronization in statclock() and mi_switch(). This was previously supported by sched_lock which is going away. All modifications to rusage are now done in the context of the owning thread. reads proceed without locks. - Aggregate exiting threads rusage in thread_exit() such that the exiting thread's rusage is not lost. - Provide a new routine, rufetch() to fetch an aggregate of all rusage structures from all threads in a process. This routine must be used in any place requiring a rusage from a process prior to it's exit. The exited process's rusage is still available via p_ru. - Aggregate tick statistics only on demand via rufetch() or when a thread exits. Tick statistics are kept in the thread and protected by sched_lock until it exits. Initial patch by: attilio Reviewed by: attilio, bde (some objections), arch (mostly silent)
# d19e16a7	16-May-2007	Robert Watson <rwatson@FreeBSD.org>	Generally migrate to ANSI function headers, and remove 'register' use.
# ccd8d954	07-May-2007	Pyun YongHyeon <yongari@FreeBSD.org>	Add missing socket buffer unlock before returning to userland. Reviewed by: rwatson
# 7abab911	03-May-2007	Robert Watson <rwatson@FreeBSD.org>	sblock() implements a sleep lock by interlocking SB_WANT and SB_LOCK flags on each socket buffer with the socket buffer's mutex. This sleep lock is used to serialize I/O on sockets in order to prevent I/O interlacing. This change replaces the custom sleep lock with an sx(9) lock, which results in marginally better performance, better handling of contention during simultaneous socket I/O across multiple threads, and a cleaner separation between the different layers of locking in socket buffers. Specifically, the socket buffer mutex is now solely responsible for serializing simultaneous operation on the socket buffer data structure, and not for I/O serialization. While here, fix two historic bugs: (1) a bug allowing I/O to be occasionally interlaced during long I/O operations (discovere by Isilon). (2) a bug in which failed non-blocking acquisition of the socket buffer I/O serialization lock might be ignored (discovered by sam). SCTP portion of this patch submitted by rrs.
# 8c799760	26-Mar-2007	Robert Watson <rwatson@FreeBSD.org>	Following movement of functions from uipc_socket2.c to uipc_socket.c and uipc_sockbuf.c, clean up and update comments.
# 20d9e5e8	26-Mar-2007	Robert Watson <rwatson@FreeBSD.org>	Complete removal of uipc_socket2.c by moving the last few functions to other C files: - Move sbcreatecontrol() and sbtoxsockbuf() to uipc_sockbuf.c. While sbcreatecontrol() is really an mbuf allocation routine, it does its work with awareness of the layout of socket buffer memory. - Move pru_*() protocol switch stubs to uipc_socket.c where the non-stub versions of several of these functions live. Likewise, move socket state transition calls (soisconnecting(), etc) to uipc_socket.c. Moveo sodupsockaddr() and sotoxsocket().
# cd68a3f7	22-Mar-2007	Gleb Smirnoff <glebius@FreeBSD.org>	Move the dom_dispose and pru_detach calls in sofree() earlier. Only after calling pru_detach we can be absolutely sure, that we don't have any references to the socket in the stack. This closes race between lockless sbdestroy() and data arriving on socket. Reviewed by: rwatson
# 75685034	12-Mar-2007	John Baldwin <jhb@FreeBSD.org>	- Use m_gethdr(), m_get(), and m_clget() instead of the macros in sosend_copyin(). - Use M_WAITOK instead of M_TRYWAIT in sosend_copyin(). - Don't check for NULL from M_WAITOK and return ENOBUFS. M_WAITOK/M_TRYWAIT allocations don't fail with NULL. Reviewed by: andre Requested by: andre (2)
# fac61393	26-Feb-2007	Ruslan Ermilov <ru@FreeBSD.org>	Don't block on the socket zone limit during the socket() call which can easily lock up a system otherwise; instead, return ENOBUFS as documented in a manpage, thus reverting us to the FreeBSD 4.x behavior. Reviewed by: rwatson MFC after: 2 weeks
# f58dd470	15-Feb-2007	Robert Watson <rwatson@FreeBSD.org>	Rename somaxconn_sysctl() to sysctl_somaxconn() so that I will be able to claim that sofoo() functions all accept a socket as their first argument.
# 7dc8d021	02-Feb-2007	Bruce M Simpson <bms@FreeBSD.org>	Diff reduction with RELENG_6, style(9): Remove unnecessary brace; && should be on end of line. No functional changes.
# 6a37f331	01-Feb-2007	Andre Oppermann <andre@FreeBSD.org>	Generic socket buffer auto sizing support, header defines, flag inheritance. MFC after: 1 month
# 7c32173b	22-Jan-2007	Andre Oppermann <andre@FreeBSD.org>	Unbreak writes of 0 bytes. Zero byte writes happen when only ancillary control data but no payload data is passed. Change m_uiotombuf() to return at least one empty mbuf if the requested length was zero. Add comment to sosend_dgram and sosend_generic(). Diagnoses by: jhb Regression test by: rwatson Pointy hat to. andre
# abdeb3b0	08-Jan-2007	Robert Watson <rwatson@FreeBSD.org>	Canonicalize copyrights in some files I hold copyrights on: - Sort by date in license blocks, oldest copyright first. - All rights reserved after all copyrights, not just the first. - Use (c) to be consistent with other entries. MFC after: 3 days
# a86ec338	23-Dec-2006	Bruce M Simpson <bms@FreeBSD.org>	Drop all received data mbufs from a socket's queue if the MT_SONAME mbuf is dropped, to preserve the invariant in the PR_ADDR case. Add a regression test to detect this condition, but do not hook it up to the build for now. PR: kern/38495 Submitted by: James Juran Reviewed by: sam, rwatson Obtained from: NetBSD MFC after: 2 weeks
# 84eab9ad	22-Nov-2006	Mohan Srinivasan <mohans@FreeBSD.org>	Fix a race in soclose() where connections could be queued to the listening socket after the pass that cleans those queues. This results in these connections being orphaned (and leaked). The fix is to clean up the so queues after detaching the socket from the protocol. Thanks to ups and jhb for discussions and a thorough code review.
# 1ae4d97d	02-Nov-2006	Andre Oppermann <andre@FreeBSD.org>	Use the improved m_uiotombuf() function instead of home grown sosend_copyin() to do the userland to kernel copying in sosend_generic() and sosend_dgram(). sosend_copyin() is retained for ZERO_COPY_SOCKETS which are not yet supported by m_uiotombuf(). Benchmaring shows significant improvements (95% confidence): 66% less cpu (or 2.9 times better) with new sosend vs. old sosend (non-TSO) 65% less cpu (or 2.8 times better) with new sosend vs. old sosend (TSO) (Sender AMD Opteron 852 (2.6GHz) with em(4) PCI-X-133 interface and receiver DELL Poweredge SC1425 P-IV Xeon 3.2GHz with em(4) LOM connected back to back at 1000Base-TX full duplex.) Sponsored by: TCP/IP Optimization Fundraise 2005 MFC after: 3 month
# aed55708	22-Oct-2006	Robert Watson <rwatson@FreeBSD.org>	Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead. This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd. Obtained from: TrustedBSD Project Sponsored by: SPARTA
# 4a75dc25	22-Sep-2006	Bruce M Simpson <bms@FreeBSD.org>	Fix a case where socket I/O atomicity is violated due to not dropping the entire record when a non-data mbuf is removed in the soreceive() path. This only triggers a panic directly when compiled with INVARIANTS. PR: 38495 Submitted by: James Juran MFC after: 1 week
# 689f94bf	13-Sep-2006	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Fix a lock leak in an error case. Reported by: netchild Reviewed by: rwatson
# 805def2e	10-Sep-2006	Andre Oppermann <andre@FreeBSD.org>	New sockets created by incoming connections into listen sockets should inherit all settings and options except listen specific options. Add the missing send/receive timeouts and low watermarks. Remove inheritance of the field so_timeo which is unused. Noticed by: phk Reviewed by: rwatson Sponsored by: TCP/IP Optimization Fundraise 2005 MFC after: 3 days
# daa5817e	18-Aug-2006	George V. Neville-Neil <gnn@FreeBSD.org>	Fix a kernel panic based on receiving an ICMPv6 Packet too Big message. PR: 99779 Submitted by: Jinmei Tatuya Reviewed by: clement, rwatson MFC after: 1 week
# 79ad81c0	11-Aug-2006	Robert Watson <rwatson@FreeBSD.org>	Before performing a sodealloc() when pru_attach() fails, assert that the socket refcount remains 1, and then drop to 0 before freeing the socket. PR: 101763 Reported by: Gleb Kozyrev <gkozyrev at ukr dot net>
# 9126410f	02-Aug-2006	Robert Watson <rwatson@FreeBSD.org>	Move destroying kqueue state from above pru_detach to below it in sofree(), as a number of protocols expect to be able to call soisdisconnected() during detach. That may not be a good assumption, but until I'm sure if it's a good assumption or not, allow it.
# c0e1415d	01-Aug-2006	Robert Watson <rwatson@FreeBSD.org>	Move updated of 'numopensockets' from bottom of sodealloc() to the top, eliminating a second set of identical mutex operations at the bottom. This allows brief exceeding of the max sockets limit, but only by sockets in the last stages of being torn down.
# eaa6dfbc	01-Aug-2006	Robert Watson <rwatson@FreeBSD.org>	Reimplement socket buffer tear-down in sofree(): as the socket is no longer referenced by other threads (hence our freeing it), we don't need to set the can't send and can't receive flags, wake up the consumers, perform two levels of locking, etc. Implement a fast-path teardown, sbdestroy(), which flushes and releases each socket buffer. A manual dom_dispose of the receive buffer is still required explicitly to GC any in-flight file descriptors, etc, before flushing the buffer. This results in a 9% UP performance improvement and 16% SMP performance improvement on a tight loop of socket();close(); in micro-benchmarking, but will likely also affect CPU-bound macro-benchmark performance.
# b0668f71	24-Jul-2006	Robert Watson <rwatson@FreeBSD.org>	soreceive_generic(), and sopoll_generic(). Add new functions sosend(), soreceive(), and sopoll(), which are wrappers for pru_sosend, pru_soreceive, and pru_sopoll, and are now used univerally by socket consumers rather than either directly invoking the old so*() functions or directly invoking the protocol switch method (about an even split prior to this commit). This completes an architectural change that was begun in 1996 to permit protocols to provide substitute implementations, as now used by UDP. Consumers now uniformly invoke sosend(), soreceive(), and sopoll() to perform these operations on sockets -- in particular, distributed file systems and socket system calls. Architectural head nod: sam, gnn, wollman
# 809c2b78	23-Jul-2006	Robert Watson <rwatson@FreeBSD.org>	Update various uipc_socket.c comments, and reformat others.
# a152f8a3	21-Jul-2006	Robert Watson <rwatson@FreeBSD.org>	Change semantics of socket close and detach. Add a new protocol switch function, pru_close, to notify protocols that the file descriptor or other consumer of a socket is closing the socket. pru_abort is now a notification of close also, and no longer detaches. pru_detach is no longer used to notify of close, and will be called during socket tear-down by sofree() when all references to a socket evaporate after an earlier call to abort or close the socket. This means detach is now an unconditional teardown of a socket, whereas previously sockets could persist after detach of the protocol retained a reference. This faciliates sharing mutexes between layers of the network stack as the mutex is required during the checking and removal of references at the head of sofree(). With this change, pru_detach can now assume that the mutex will no longer be required by the socket layer after completion, whereas before this was not necessarily true. Reviewed by: gnn
# 5cd1a271	16-Jul-2006	Robert Watson <rwatson@FreeBSD.org>	Change comment on soabort() to more accurately describe how/when soabort() is used. Remove trailing white space.
# 5908c617	11-Jul-2006	Robert Watson <rwatson@FreeBSD.org>	Several protocol switch functions (pru_abort, pru_detach, pru_sosetlabel) return void, so don't implement no-op versions of these functions. Instead, consistently check if those switch pointers are NULL before invoking them.
# f949ae9b	11-Jul-2006	Robert Watson <rwatson@FreeBSD.org>	When pru_attach() fails, call sodealloc() on the socket rather than using sorele() and the full tear-down path. Since protocol state allocation failed, this is not required (and is arguably undesirable). This matches the behavior of sonewconn() under the same circumstances.
# 721150ad	18-Jun-2006	Robert Watson <rwatson@FreeBSD.org>	When retrieving SO_ERROR via getsockopt(), hold the socket lock around the retrieval and replacement with 0. MFC after: 1 week
# b37ffd31	10-Jun-2006	Robert Watson <rwatson@FreeBSD.org>	Move some functions and definitions from uipc_socket2.c to uipc_socket.c: - Move sonewconn(), which creates new sockets for incoming connections on listen sockets, so that all socket allocate code is together in uipc_socket.c. - Move 'maxsockets' and associated sysctls to uipc_socket.c with the socket allocation code. - Move kern.ipc sysctl node to uipc_socket.c, add a SYSCTL_DECL() for it to sysctl.h and remove lots of scattered implementations in various IPC modules. - Sort sodealloc() after soalloc() in uipc_socket.c for dependency order reasons. Statisticize soalloc() and sodealloc() as they are now required only in uipc_socket.c, and are internal to the socket implementation. After this change, socket allocation and deallocation is entirely centralized in one file, and uipc_socket2.c consists entirely of socket buffer manipulation and default protocol switch functions. MFC after: 1 month
# e02421f3	08-Jun-2006	Robert Watson <rwatson@FreeBSD.org>	Rearrange code in soalloc() so that it's less indented by returning early if uma_zalloc() from the socket zone fails. No functional change. MFC after: 1 week
# 0cec9959	23-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Assert that sockets passed into soabort() not be SQ_COMP or SQ_INCOMP, since that removal should have been done a layer up. MFC after: 3 months
# 28ea1801	23-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Add missing 'not' to SQ_COMP comment. MFC after: 3 months
# 6ca35d4b	23-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Move handling of SQ_COMP exception case in sofree() to the top of the function along with the remainder of the reference checking code. Move comment from body to header with remainder of comments. Inclusion of a socket in a completed connection queue counts as a true reference, and should not be handled as an under-documented edge case. MFC after: 3 months
# bc725eaf	01-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Chance protocol switch method pru_detach() so that it returns void rather than an error. Detaches do not "fail", they other occur or the protocol flags SS_PROTOREF to take ownership of the socket. soclose() no longer looks at so_pcb to see if it's NULL, relying entirely on the protocol to decide whether it's time to free the socket or not using SS_PROTOREF. so_pcb is now entirely owned and managed by the protocol code. Likewise, no longer test so_pcb in other socket functions, such as soreceive(), which have no business digging into protocol internals. Protocol detach routines no longer try to free the socket on detach, this is performed in the socket code if the protocol permits it. In rts_detach(), no longer test for rp != NULL in detach, and likewise in other protocols that don't permit a NULL so_pcb, reduce the incidence of testing for it during detach. netinet and netinet6 are not fully updated to this change, which will be in an upcoming commit. In their current state they may leak memory or panic. MFC after: 3 months
# ac45e92f	01-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Change protocol switch pru_abort() API so that it returns void rather than an int, as an error here is not meaningful. Modify soabort() to unconditionally free the socket on the return of pru_abort(), and modify most protocols to no longer conditionally free the socket, since the caller will do this. This commit likely leaves parts of netinet and netinet6 in a situation where they may panic or leak memory, as they have not are not fully updated by this commit. This will be corrected shortly in followup commits to these components. MFC after: 3 months
# 7f689de2	01-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Assert so->so_pcb is NULL in sodealloc() -- the protocol state should not be present at this point. We will eventually remove this assert because the socket layer should never look at so_pcb, but for now it's a useful debugging tool. MFC after: 3 months
# 220c1357	01-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Add a somewhat sizable comment documenting the semantics of various kernel socket calls relating to the creation and destruction of sockets. This will eventually form the foundation of socket(9), but is currently in too much flux to do so. MFC after: 3 months
# 92c07a34	16-Mar-2006	Robert Watson <rwatson@FreeBSD.org>	Change soabort() from returning int to returning void, since all consumers ignore the return value, soabort() is required to succeed, and protocols produce errors here to report multiple freeing of the pcb, which we hope to eliminate.
# 93709ad0	14-Mar-2006	Robert Watson <rwatson@FreeBSD.org>	As with socket consumer references (so_count), make sofree() return without GC'ing the socket if a strong protocol reference to the socket is present (SS_PROTOREF).
# 13f322c2	12-Feb-2006	Robert Watson <rwatson@FreeBSD.org>	Improve consistency of return() style. MFC after: 3 days
# b8ae1cd6	13-Jan-2006	Robert Watson <rwatson@FreeBSD.org>	Add sosend_dgram(), a greatly reduced and simplified version of sosend() intended for use solely with atomic datagram socket types, and relies on the previous break-out of sosend_copyin(). Changes to allow UDP to optionally use this instead of sosend() will be committed as a follow-up.
# 398293a8	29-Nov-2005	John Baldwin <jhb@FreeBSD.org>	Fix snderr() to not leak the socket buffer lock if an error occurs in sosend(). Robert accidentally changed the snderr() macro to jump to the out label which assumes the lock is already released rather than the release label which drops the lock in his previous change to sosend(). This should fix the recent panics about returning from write(2) with the socket lock held and the most recent LOR on current@.
# 66dd8a6f	28-Nov-2005	Robert Watson <rwatson@FreeBSD.org>	Move zero copy statistics structure before sosend_copyin(). MFC after: 1 month Reported by: tinderbox, sam
# a725629c	28-Nov-2005	Robert Watson <rwatson@FreeBSD.org>	Break out functionality in sosend() responsible for building mbuf chains and copying in mbufs from the body of the send logic, creating a new function sosend_copyin(). This changes makes sosend() almost readable, and will allow the same logic to be used by tailored socket send routines. MFC after: 1 month Reviewed by: andre, glebius
# 34333b16	02-Nov-2005	Andre Oppermann <andre@FreeBSD.org>	Retire MT_HEADER mbuf type and change its users to use MT_DATA. Having an additional MT_HEADER mbuf type is superfluous and redundant as nothing depends on it. It only adds a layer of confusion. The distinction between header mbuf's and data mbuf's is solely done through the m->m_flags M_PKTHDR flag. Non-native code is not changed in this commit. For compatibility MT_HEADER is mapped to MT_DATA. Sponsored by: TCP/IP Optimization Fundraise 2005
# d374e81e	30-Oct-2005	Robert Watson <rwatson@FreeBSD.org>	Push the assignment of a new or updated so_qlimit from solisten() following the protocol pru_listen() call to solisten_proto(), so that it occurs under the socket lock acquisition that also sets SO_ACCEPTCONN. This requires passing the new backlog parameter to the protocol, which also allows the protocol to be aware of changes in queue limit should it wish to do something about the new queue limit. This continues a move towards the socket layer acting as a library for the protocol. Bump __FreeBSD_version due to a change in the in-kernel protocol interface. This change has been tested with IPv4 and UNIX domain sockets, but not other protocols.
# 53f5742d	26-Oct-2005	Paul Saab <ps@FreeBSD.org>	Allow 32bit get/setsockopt with SO_SNDTIMEO or SO_RECVTIMEO to work.
# 8434c29b	18-Sep-2005	Robert Watson <rwatson@FreeBSD.org>	Add three new read-only socket options, which allow regression tests and other applications to query the state of the stack regarding the accept queue on a listen socket: SO_LISTENQLIMIT Return the value of so_qlimit (socket backlog) SO_LISTENQLEN Return the value of so_qlen (complete sockets) SO_LISTENINCQLEN Return the value of so_incqlen (incomplete sockets) Minor white space tweaks to existing socket options to make them consistent. Discussed with: andre MFC after: 1 week
# bc6b8b5d	18-Sep-2005	Robert Watson <rwatson@FreeBSD.org>	Fix spelling in a comment. MFC after: 3 days
# aada5ccc	15-Sep-2005	Maxim Konovalov <maxim@FreeBSD.org>	Backout rev. 1.246, it breaks code uses shutdown(2) on non-connected sockets. Pointed out by: rwatson
# c5cff170	15-Sep-2005	Maxim Konovalov <maxim@FreeBSD.org>	o Return ENOTCONN when shutdown(2) on non-connected socket. PR: kern/84761 Submitted by: James Juran R-test: tools/regression/sockets/shutdown MFC after: 1 month
# 016e6212	06-Sep-2005	Gleb Smirnoff <glebius@FreeBSD.org>	In soreceive(), when a first mbuf is removed from socket buffer use sockbuf_pushsync(). Previous manipulation could lead to an inconsistent mbuf. Reviewed by: rwatson
# dcb5fef5	01-Aug-2005	Kelly Yancey <kbyanc@FreeBSD.org>	Make getsockopt(..., SOL_SOCKET, SO_ACCEPTCONN, ...) work per IEEE Std 1003.1 (POSIX).
# 0d52d7b0	28-Jul-2005	George V. Neville-Neil <gnn@FreeBSD.org>	Fix for PR 83885. Make sure that there actually is a next packet before setting nextrecord to that field. PR: 83885 Submitted by: hirose@comm.yamaha.co.jp Obtained from: Patch suggested in the PR MFC after: 1 week
# 571dcd15	01-Jul-2005	Suleiman Souhlal <ssouhlal@FreeBSD.org>	Fix the recent panics/LORs/hangs created by my kqueue commit by: - Introducing the possibility of using locks different than mutexes for the knlist locking. In order to do this, we add three arguments to knlist_init() to specify the functions to use to lock, unlock and check if the lock is owned. If these arguments are NULL, we assume mtx_lock, mtx_unlock and mtx_owned, respectively. - Using the vnode lock for the knlist locking, when doing kqueue operations on a vnode. This way, we don't have to lock the vnode while holding a mutex, in filt_vfsread. Reviewed by: jmg Approved by: re (scottl), scottl (mentor override) Pointyhat to: ssouhlal Will be happy: everyone
# fc74a9f9	10-Jun-2005	Brooks Davis <brooks@FreeBSD.org>	Stop embedding struct ifnet at the top of driver softcs. Instead the struct ifnet or the layer 2 common structure it was embedded in have been replaced with a struct ifnet pointer to be filled by a call to the new function, if_alloc(). The layer 2 common structure is also allocated via if_alloc() based on the interface type. It is hung off the new struct ifnet member, if_l2com. This change removes the size of these structures from the kernel ABI and will allow us to better manage them as interfaces come and go. Other changes of note: - Struct arpcom is no longer referenced in normal interface code. Instead the Ethernet address is accessed via the IFP2ENADDR() macro. To enforce this ac_enaddr has been renamed to _ac_enaddr. - The second argument to ether_ifattach is now always the mac address from driver private storage rather than sometimes being ac_enaddr. Reviewed by: sobomax, sam
# 8bde9359	09-Jun-2005	Scott Long <scottl@FreeBSD.org>	Drat! Committed from the wrong branch. Restore HEAD to its previous goodness.
# 76b472db	09-Jun-2005	Scott Long <scottl@FreeBSD.org>	Back out 1.68.2.26. It was a mis-guided change that was already backed out of HEAD and should not have been MFC'd. This will restore UDP socket functionality, which will correct the recent NFS problems. Submitted by: rwatson
# 92dd256b	05-Jun-2005	Andrew Gallatin <gallatin@FreeBSD.org>	Allow sends sent from non page-aligned userspace addresses to be considered for zero-copy sends. Reviewed by: alc Submitted by: Romer Gil at Rice University
# a59f81d2	11-Mar-2005	Robert Watson <rwatson@FreeBSD.org>	Move the logic implementing retrieval of the SO_ACCEPTFILTER socket option from uipc_socket.c to uipc_accf.c in do_getopt_accept_filter(), so that it now matches do_setopt_accept_filter(). Slightly reformulate the logic to match the optimistic allocation of storage for the argument in advance, and slightly expand the coverage of the socket lock.
# 56856fbf	11-Mar-2005	Robert Watson <rwatson@FreeBSD.org>	Remove an additional commented out reference to a possible future sx lock.
# 2b37548a	11-Mar-2005	Robert Watson <rwatson@FreeBSD.org>	When setting up a socket in socreate(), there's no need to lock the socket lock around knlist_init(), so don't. Hard code the setting of the socket reference count to 1 rather than using soref() to avoid asserting the socket lock, since we've not yet exposed the socket to other threads. This removes two mutex operations from each socket allocation.
# 5fab68b1	11-Mar-2005	Robert Watson <rwatson@FreeBSD.org>	Remove suggestive sx_init() comment in soalloc(). We will have something like this at some point, but for now it clutters the source.
# 0daccb9c	21-Feb-2005	Robert Watson <rwatson@FreeBSD.org>	In the current world order, solisten() implements the state transition of a socket from a regular socket to a listening socket able to accept new connections. As part of this state transition, solisten() calls into the protocol to update protocol-layer state. There were several bugs in this implementation that could result in a race wherein a TCP SYN received in the interval between the protocol state transition and the shortly following socket layer transition would result in a panic in the TCP code, as the socket would be in the TCPS_LISTEN state, but the socket would not have the SO_ACCEPTCONN flag set. This change does the following: - Pushes the socket state transition from the socket layer solisten() to to socket "library" routines called from the protocol. This permits the socket routines to be called while holding the protocol mutexes, preventing a race exposing the incomplete socket state transition to TCP after the TCP state transition has completed. The check for a socket layer state transition is performed by solisten_proto_check(), and the actual transition is performed by solisten_proto(). - Holds the socket lock for the duration of the socket state test and set, and over the protocol layer state transition, which is now possible as the socket lock is acquired by the protocol layer, rather than vice versa. This prevents additional state related races in the socket layer. This permits the dual transition of socket layer and protocol layer state to occur while holding locks for both layers, making the two changes atomic with respect to one another. Similar changes are likely require elsewhere in the socket/protocol code. Reported by: Peter Holm <peter@holm.cc> Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net> Philosophical head nod: gnn
# a00428ef	20-Feb-2005	Robert Watson <rwatson@FreeBSD.org>	In soreceive(), when considering delivery to a socket in SS_ISCONFIRMING, only call the protocol's pru_rcvd() if the protocol has the flag PR_WANTRCVD set. This brings that instance of pru_rcvd() into line with the rest, which do check the flag. MFC after: 3 days
# a7ae36bc	18-Feb-2005	Robert Watson <rwatson@FreeBSD.org>	Correct a typo in the comment describing soreceive_rcvoob(). MFC after: 3 days
# 1b5c4b15	18-Feb-2005	Robert Watson <rwatson@FreeBSD.org>	In soconnect(), when resetting so->so_error, the socket lock is not required due to a straight integer write in which minor races are not a problem.
# 78e43644	18-Feb-2005	Robert Watson <rwatson@FreeBSD.org>	Move do_setopt_accept_filter() from uipc_socket.c to uipc_accf.c, where the rest of the accept filter code currently lives. MFC after: 3 days
# 627de7fa	18-Feb-2005	Robert Watson <rwatson@FreeBSD.org>	Re-order checks in socheckuid() so that we check all deny cases before returning accept. MFC after: 3 days
# 0d89301c	17-Feb-2005	Robert Watson <rwatson@FreeBSD.org>	In solisten(), unconditionally set the SO_ACCEPTCONN option in so->so_options when solisten() will succeed, rather than setting it conditionally based on there not being queued sockets in the completed socket queue. Otherwise, if the protocol exposes new sockets via the completed queue before solisten() completes, the listen() system call will succeed, but the socket and protocol state will be out of sync. For TCP, this didn't happen in practice, as the TCP code will panic if a new connection comes in after the tcpcb has been transitioned to a listening state but the socket doesn't have SO_ACCEPTCONN set. This is historical behavior resulting from bitrot since 4.3BSD, in which that line of code was associated with the conditional NULL'ing of the connection queue pointers (one-time initialization to be performed during the transition to a listening socket), which are now initialized separately. Discussed with: fenner, gnn MFC after: 3 days
# 90d52f2f	23-Jan-2005	Gleb Smirnoff <glebius@FreeBSD.org>	- Convert so_qlen, so_incqlen, so_qlimit fields of struct socket from short to unsigned short. - Add SYSCTL_PROC() around somaxconn, not accepting values < 1 or > U_SHRTMAX. Before this change setting somaxconn to smth above 32767 and calling listen(fd, -1) lead to a socket, which doesn't accept connections at all. Reviewed by: rwatson Reported by: Igor Sysoev
# fdf84ec4	12-Jan-2005	Maxim Sobolev <sobomax@FreeBSD.org>	When re-connecting already connected datagram socket ensure to clean up its pending error state, which may be set in some rare conditions resulting in connect() syscall returning that bogus error and making application believe that attempt to change association has failed, while it has not in fact. There is sockets/reconnect regression test which excersises this bug. MFC after: 2 weeks
# 9454b2d8	06-Jan-2005	Warner Losh <imp@FreeBSD.org>	/* -> /*- for copyright notices, minor format tweaks as necessary
# ba653911	22-Dec-2004	Robert Watson <rwatson@FreeBSD.org>	Remove an XXXRW indicating atomic operations might be used as a substitute for a global mutex protecting the socket count and generation number. The observation that soreceive_rcvoob() can't return an mbuf chain is a property, not a bug, so remove the XXXRW. In sorflush, s/existing/previous/ for code when describing prior behavior. For SO_LINGER socket option retrieval, remove an XXXRW about why we hold the mutex: this is correct and not dubious. MFC after: 2 weeks
# 81b5dbec	22-Dec-2004	Robert Watson <rwatson@FreeBSD.org>	In soalloc(), simplify the mac_init_socket() handling to remove unnecessary use of a global variable and simplify the return case. While here, use ()'s around return values. In sodealloc(), remove a comment about why we bump the gencnt and decrement the socket count separately. It doesn't add substantially to the reading, and clutters the function. MFC after: 2 weeks
# c73e3e92	09-Dec-2004	Alan Cox <alc@FreeBSD.org>	Remove unneeded code from the zero-copy receive path. Discussed with: gallatin@ Tested by: ken@
# 1c4dbeda	07-Dec-2004	Alan Cox <alc@FreeBSD.org>	Tidy up the zero-copy receive path: Remove an unneeded argument to uiomoveco() and userspaceco().
# d297f702	29-Nov-2004	Paul Saab <ps@FreeBSD.org>	If soreceive() is called from a socket callback, there's no reason to do a window update to the peer (thru an ACK) from soreceive() itself. TCP will do that upon return from the socket callback. Sending a window update from soreceive() results in a lock reversal. Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com Reviewed by: rwatson
# 85d11adf	29-Nov-2004	Paul Saab <ps@FreeBSD.org>	Make soreceive(MSG_DONTWAIT) nonblocking. If MSG_DONTWAIT is passed into soreceive(), then pass in M_DONTWAIT to m_copym(). Also fix up error handling for the case where m_copym() returns failure. Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com Reviewed by: rwatson
# 1449a2f5	09-Nov-2004	Gleb Smirnoff <glebius@FreeBSD.org>	Since sb_timeo type was increased to int, use INT_MAX instead of SHRT_MAX. This also gives us ability to close PR. PR: kern/42352 Approved by: julian (mentor) MFC after: 1 week
# aae2782b	02-Nov-2004	Robert Watson <rwatson@FreeBSD.org>	Acquire the accept mutex in soabort() before calling sotryfree(), as that is now required. RELENG_5_3 candidate. Foot provided by: Dikshie <dikshie at ppk dot itb dot ac dot id>
# 3a82a545	23-Oct-2004	Andre Oppermann <andre@FreeBSD.org>	socreate() does an early abort if either the protocol cannot be found, or pru_attach is NULL. With loadable protocols the SPACER dummy protocols have valid function pointers for all methods to functions returning just EOPNOTSUPP. Thus the early abort check would not detect immediately that attach is not supported for this protocol. Instead it would correctly get the EOPNOTSUPP error later on when it calls the protocol specific attach function. Add testing against the pru_attach_notsupp() function pointer to the early abort check as well.
# 81158452	18-Oct-2004	Robert Watson <rwatson@FreeBSD.org>	Push acquisition of the accept mutex out of sofree() into the caller (sorele()/sotryfree()): - This permits the caller to acquire the accept mutex before the socket mutex, avoiding sofree() having to drop the socket mutex and re-order, which could lead to races permitting more than one thread to enter sofree() after a socket is ready to be free'd. - This also covers clearing of the so_pcb weak socket reference from the protocol to the socket, preventing races in clearing and evaluation of the reference such that sofree() might be called more than once on the same socket. This appears to close a race I was able to easily trigger by repeatedly opening and resetting TCP connections to a host, in which the tcp_close() code called as a result of the RST raced with the close() of the accepted socket in the user process resulting in simultaneous attempts to de-allocate the same socket. The new locking increases the overhead for operations that may potentially free the socket, so we will want to revise the synchronization strategy here as we normalize the reference counting model for sockets. The use of the accept mutex in freeing of sockets that are not listen sockets is primarily motivated by the potential need to remove the socket from the incomplete connection queue on its parent (listen) socket, so cleaning up the reference model here may allow us to substantially weaken the synchronization requirements. RELENG_5_3 candidate. MFC after: 3 days Reviewed by: dwhite Discussed with: gnn, dwhite, green Reported by: Marc UBM Bocklet <ubm at u-boot-man dot de> Reported by: Vlad <marchenko at gmail dot com>
# 35b260cd	11-Oct-2004	Robert Watson <rwatson@FreeBSD.org>	Rework sofree() logic to take into account a possible race with accept(). Sockets in the listen queues have reference counts of 0, so if the protocol decides to disconnect the pcb and try to free the socket, this triggered a race with accept() wherein accept() would bump the reference count before sofree() had removed the socket from the listen queues, resulting in a panic in sofree() when it discovered it was freeing a referenced socket. This might happen if a RST came in prior to accept() on a TCP connection. The fix is two-fold: to expand the coverage of the accept mutex earlier in sofree() to prevent accept() from grabbing the socket after the "is it really safe to free" tests, and to expand the logic of the "is it really safe to free" tests to check that the refcount is still 0 (i.e., we didn't race). RELENG_5 candidate. Much discussion with and work by: green Reported by: Marc UBM Bocklet <ubm at u-boot-man dot de> Reported by: Vlad <marchenko at gmail dot com>
# 76f69398	05-Sep-2004	Robert Watson <rwatson@FreeBSD.org>	Expand the scope of the socket buffer locks in sopoll() to include the state test as well as set, or we risk a race between a socket wakeup and registering for select() or poll() on the socket. This does increase the cost of the poll operation, but can probably be optimized some in the future. This appears to correct poll() "wedges" experienced with X11 on SMP systems with highly interactive applications, and might affect a plethora of other select() driven applications. RELENG_5 candidate. Problem reported by: Maxim Maximov <mcsi at mcsi dot pp dot ru> Debugged with help of: dwhite
# fe0f2d4e	23-Aug-2004	Robert Watson <rwatson@FreeBSD.org>	Conditional acquisition of socket buffer mutexes when testing socket buffers with kqueue filters is no longer required: the kqueue framework will guarantee that the mutex is held on entering the filter, either due to a call from the socket code already holding the mutex, or by explicitly acquiring it. This removes the last of the conditional socket locking.
# 7b38f0d3	20-Aug-2004	Robert Watson <rwatson@FreeBSD.org>	Back out uipc_socket.c:1.208, as it incorrectly assumes that all sockets are connection-oriented for the purposes of kqueue registration. Since UDP sockets aren't connection-oriented, this appeared to break a great many things, such as RPC-based applications and services (i.e., NFS). Since jmg isn't around I'm backing this out before too many more feet are shot, but intend to investigate the right solution with him once he's available. Apologies to: jmg Discussed with: imp, scottl
# 5d6dd468	19-Aug-2004	John-Mark Gurney <jmg@FreeBSD.org>	make sure that the socket is either accepting connections or is connected when attaching a knote to it... otherwise return EINVAL... Pointed out by: benno
# ad3b9257	15-Aug-2004	John-Mark Gurney <jmg@FreeBSD.org>	Add locking to the kqueue subsystem. This also makes the kqueue subsystem a more complete subsystem, and removes the knowlege of how things are implemented from the drivers. Include locking around filter ops, so a module like aio will know when not to be unloaded if there are outstanding knotes using it's filter ops. Currently, it uses the MTX_DUPOK even though it is not always safe to aquire duplicate locks. Witness currently doesn't support the ability to discover if a dup lock is ok (in some cases). Reviewed by: green, rwatson (both earlier versions)
# 217a4b6e	10-Aug-2004	Robert Watson <rwatson@FreeBSD.org>	Replace a reference to splnet() with a reference to locking in a comment.
# 99901d0a	25-Jul-2004	Robert Watson <rwatson@FreeBSD.org>	Do some initial locking on accept filter registration and attach. While here, close some races that existed in the pre-locking world during low memory conditions. This locking isn't perfect, but it's closer than before.
# cdb71f75	18-Jul-2004	David Malone <dwmalone@FreeBSD.org>	The recent changes to control message passing broke some things that get certain types of control messages (ping6 and rtsol are examples). This gets the new code closer to working: 1) Collect control mbufs for processing in the controlp == NULL case, so that they can be freed by externalize. 2) Loop over the list of control mbufs, as the externalize function may not know how to deal with chains. 3) In the case where there is no externalize function, remember to add the control mbuf to the controlp list so that it will be returned. 4) After adding stuff to the controlp list, walk to the end of the list of stuff that was added, incase we added a chain. This code can be further improved, but this is enough to get most things working again. Reviewed by: rwatson
# dad7b41a	15-Jul-2004	Robert Watson <rwatson@FreeBSD.org>	When entering soclose(), assert that SS_NOFDREF is not already set.
# dcee93dc	12-Jul-2004	David Malone <dwmalone@FreeBSD.org>	Rename Alfred's kern_setsockopt to so_setsockopt, as this seems a a better name. I have a kern_[sg]etsockopt which I plan to commit shortly, but the arguments to these function will be quite different from so_setsockopt. Approved by: alfred
# d58d3648	12-Jul-2004	Alfred Perlstein <alfred@FreeBSD.org>	Use SO_REUSEADDR and SO_REUSEPORT when reconnecting NFS mounts. Tune the timeout from 5 seconds to 12 seconds. Provide a sysctl to show how many reconnects the NFS client has done. Seems to fix IPv6 from: kuriyama
# a294c366	11-Jul-2004	Robert Watson <rwatson@FreeBSD.org>	Use sockbuf_pushsync() to synchronize stack and socket buffer state in soreceive() after removing an MT_SONAME mbuf from the head of the socket buffer. When processing MT_CONTROL mbufs in soreceive(), first remove all of the MT_CONTROL mbufs from the head of the socket buffer to a local mbuf chain, then feed them into dom_externalize() as a set, which both avoids thrashing the socket buffer lock when handling multiple control mbufs, and also avoids races with other threads acting on the socket buffer when the socket buffer mutex is released to enter the externalize code. Existing races that might occur if the protocol externalize method blocked during processing have also been closed. Now that we synchronize socket buffer and stack state following modifications to the socket buffer, turn the manual synchronization that previously followed control mbuf processing with a set of assertions. This can eventually be removed. The soreceive() code is now substantially more MPSAFE.
# b7562e17	11-Jul-2004	Robert Watson <rwatson@FreeBSD.org>	Add sockbuf_pushsync(), an inline function that, following a change to the head of the mbuf chains in a socket buffer, re-synchronizes the cache pointers used to optimize socket buffer appends. This will be used by soreceive() before dropping socket buffer mutexes to make sure a consistent version of the socket buffer is visible to other threads. While here, update copyright to account for substantial rewrite of much socket code required for fine-grained locking.
# d861372b	11-Jul-2004	Robert Watson <rwatson@FreeBSD.org>	Add additional annotations to soreceive(), documenting the effects of locking on 'nextrecord' and concerns regarding potentially inconsistent or stale use of socket buffer or stack fields if they aren't carefully synchronized whenever the socket buffer mutex is released. Document that the high-level sblock() prevents races against other readers on the socket. Also document the 'type' logic as to how soreceive() guarantees that it will only return one of normal data or inline out-of-band data.
# 0014b343	10-Jul-2004	Robert Watson <rwatson@FreeBSD.org>	In the 'dontblock' section of soreceive(), assert that the mbuf on hand ('m') is in fact the first mbuf in the receive socket buffer.
# 5e44d93f	10-Jul-2004	Robert Watson <rwatson@FreeBSD.org>	Break out non-inline out-of-band data receive code from soreceive() and put it in its own helper function soreceive_rcvoob().
# a04b0939	10-Jul-2004	Robert Watson <rwatson@FreeBSD.org>	Assign pointers values of NULL rather than 0 in soreceive().
# 7e17bc9f	10-Jul-2004	Robert Watson <rwatson@FreeBSD.org>	When the MT_SONAME mbuf is popped off of a receive socket buffer associated with a PR_ADDR protocol, make sure to update the m_nextpkt pointer of the new head mbuf on the chain to point to the next record. Otherwise, when we release the socket buffer mutex, the socket buffer mbuf chain may be in an inconsistent state.
# 5c2b7a22	09-Jul-2004	Robert Watson <rwatson@FreeBSD.org>	Now socket buffer locks are being asserted at higher code blocks in soreceive(), remove some leaf assertions that are redundant.
# 32775a01	09-Jul-2004	Robert Watson <rwatson@FreeBSD.org>	Assert socket buffer lock at strategic points between sections of code in soreceive() to confirm we've moved from block to block properly maintaining locking invariants.
# 6a72b225	05-Jul-2004	Robert Watson <rwatson@FreeBSD.org>	Drop the socket buffer lock around a call to m_copym() with M_TRYWAIT. A subset of locking changes to soreceive() in the queue for merging. Bumped into by: Willem Jan Withagen <wjw@withagen.nl>
# a2905746	26-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Add a new global mutex, so_global_mtx, which protects the global variables so_gencnt, numopensockets, and the per-socket field so_gencnt. Annotate this this might be better done with atomic operations. Annotate what accept_mtx protects.
# 11c40a39	26-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Replace comment on spl state when calling soabort() with a comment on locking state. No socket locks should be held when calling soabort() as it will call into protocol code that may acquire socket locks.
# c6b93bf2	23-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Lock socket buffers when processing setting socket options SO_SNDLOWAT or SO_RCVLOWAT for read-modify-write.
# adb4cf0f	23-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Slide socket buffer lock earlier in sopoll() to cover the call into selrecord(), setting up select and flagging the socker buffers as SB_SEL and setting up select under the lock.
# fea24c0a	21-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Remove spl's from uipc_socket to ease in merging.
# a34b7046	20-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Merge next step in socket buffer locking: - sowakeup() now asserts the socket buffer lock on entry. Move the call to KNOTE higher in sowakeup() so that it is made with the socket buffer lock held for consistency with other calls. Release the socket buffer lock prior to calling into pgsigio(), so_upcall(), or aio_swake(). Locking for this event management will need revisiting in the future, but this model avoids lock order reversals when upcalls into other subsystems result in socket/socket buffer operations. Assert that the socket buffer lock is not held at the end of the function. - Wrapper macros for sowakeup(), sorwakeup() and sowwakeup(), now have _locked versions which assert the socket buffer lock on entry. If a wakeup is required by sb_notify(), invoke sowakeup(); otherwise, unconditionally release the socket buffer lock. This results in the socket buffer lock being released whether a wakeup is required or not. - Break out socantsendmore() into socantsendmore_locked() that asserts the socket buffer lock. socantsendmore() unconditionally locks the socket buffer before calling socantsendmore_locked(). Note that both functions return with the socket buffer unlocked as socantsendmore_locked() calls sowwakeup_locked() which has the same properties. Assert that the socket buffer is unlocked on return. - Break out socantrcvmore() into socantrcvmore_locked() that asserts the socket buffer lock. socantrcvmore() unconditionally locks the socket buffer before calling socantrcvmore_locked(). Note that both functions return with the socket buffer unlocked as socantrcvmore_locked() calls sorwakeup_locked() which has similar properties. Assert that the socket buffer is unlocked on return. - Break out sbrelease() into a sbrelease_locked() that asserts the socket buffer lock. sbrelease() unconditionally locks the socket buffer before calling sbrelease_locked(). sbrelease_locked() now invokes sbflush_locked() instead of sbflush(). - Assert the socket buffer lock in socket buffer sanity check functions sblastrecordchk(), sblastmbufchk(). - Assert the socket buffer lock in SBLINKRECORD(). - Break out various sbappend() functions into sbappend_locked() (and variations on that name) that assert the socket buffer lock. The !_locked() variations unconditionally lock the socket buffer before calling their _locked counterparts. Internally, make sure to call _locked() support routines, etc, if already holding the socket buffer lock. - Break out sbinsertoob() into sbinsertoob_locked() that asserts the socket buffer lock. sbinsertoob() unconditionally locks the socket buffer before calling sbinsertoob_locked(). - Break out sbflush() into sbflush_locked() that asserts the socket buffer lock. sbflush() unconditionally locks the socket buffer before calling sbflush_locked(). Update panic strings for new function names. - Break out sbdrop() into sbdrop_locked() that asserts the socket buffer lock. sbdrop() unconditionally locks the socket buffer before calling sbdrop_locked(). - Break out sbdroprecord() into sbdroprecord_locked() that asserts the socket buffer lock. sbdroprecord() unconditionally locks the socket buffer before calling sbdroprecord_locked(). - sofree() now calls socantsendmore_locked() and re-acquires the socket buffer lock on return. It also now calls sbrelease_locked(). - sorflush() now calls socantrcvmore_locked() and re-acquires the socket buffer lock on return. Clean up/mess up other behavior in sorflush() relating to the temporary stack copy of the socket buffer used with dom_dispose by more properly initializing the temporary copy, and selectively bzeroing/copying more carefully to prevent WITNESS from getting confused by improperly initialized mutexes. Annotate why that's necessary, or at least, needed. - soisconnected() now calls sbdrop_locked() before unlocking the socket buffer to avoid locking overhead. Some parts of this change were: Submitted by: sam Sponsored by: FreeBSD Foundation Obtained from: BSD/OS
# fa8368a8	20-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	When retrieving the SO_LINGER socket option for user space, hold the socket lock over pulling so_options and so_linger out of the socket structure in order to retrieve a consistent snapshot. This may be overkill if user space doesn't require a consistent snapshot.
# 6f4b1b55	20-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Convert an if->panic in soclose() into a call to KASSERT().
# ed2f7766	20-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Annotate some ordering-related issues in solisten() which are not yet resolved by socket locking: in particular, that we test the connection state at the socket layer without locking, request that the protocol begin listening, and then set the listen state on the socket non-atomically, resulting in a non-atomic cross-layer test-and-set.
# 31f555a1	18-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Assert socket buffer lock in sb_lock() to protect socket buffer sleep lock state. Convert tsleep() into msleep() with socket buffer mutex as argument. Hold socket buffer lock over sbunlock() to protect sleep lock state. Assert socket buffer lock in sbwait() to protect the socket buffer wait state. Convert tsleep() into msleep() with socket buffer mutex as argument. Modify sofree(), sosend(), and soreceive() to acquire SOCKBUF_LOCK() in order to call into these functions with the lock, as well as to start protecting other socket buffer use in their implementation. Drop the socket buffer mutexes around calls into the protocol layer, around potentially blocking operations, for copying to/from user space, and VM operations relating to zero-copy. Assert the socket buffer mutex strategically after code sections or at the beginning of loops. In some cases, modify return code to ensure locks are properly dropped. Convert the potentially blocking allocation of storage for the remote address in soreceive() into a non-blocking allocation; we may wish to move the allocation earlier so that it can block prior to acquisition of the socket buffer lock. Drop some spl use. NOTE: Some races exist in the current structuring of sosend() and soreceive(). This commit only merges basic socket locking in this code; follow-up commits will close additional races. As merged, these changes are not sufficient to run without Giant safely. Reviewed by: juli, tjr
# 7b574f2e	17-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Hold SOCK_LOCK(so) while frobbing so_options. Note that while the local race is corrected, there's still a global race in sosend() relating to so_options and the SO_DONTROUTE flag.
# c0122607	17-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Merge some additional leaf node socket buffer locking from rwatson_netperf: Introduce conditional locking of the socket buffer in fifofs kqueue filters; KNOTE() will be called holding the socket buffer locks in fifofs, but sometimes the kqueue() system call will poll using the same entry point without holding the socket buffer lock. Introduce conditional locking of the socket buffer in the socket kqueue filters; KNOTE() will be called holding the socket buffer locks in the socket code, but sometimes the kqueue() system call will poll using the same entry points without holding the socket buffer lock. Simplify the logic in sodisconnect() since we no longer need spls. NOTE: To remove conditional locking in the kqueue filters, it would make sense to use a separate kqueue API entry into the socket/fifo code when calling from the kqueue() system call.
# 9535efc0	17-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Merge additional socket buffer locking from rwatson_netperf: - Lock down low hanging fruit use of sb_flags with socket buffer lock. - Lock down low hanging fruit use of so_state with socket lock. - Lock down low hanging fruit use of so_options. - Lock down low-hanging fruit use of sb_lowwat and sb_hiwat with socket buffer lock. - Annotate situations in which we unlock the socket lock and then grab the receive socket buffer lock, which are currently actually the same lock. Depending on how we want to play our cards, we may want to coallesce these lock uses to reduce overhead. - Convert a if()->panic() into a KASSERT relating to so_state in soaccept(). - Remove a number of splnet()/splx() references. More complex merging of socket and socket buffer locking to follow.
# c0b99ffa	14-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	The socket field so_state is used to hold a variety of socket related flags relating to several aspects of socket functionality. This change breaks out several bits relating to send and receive operation into a new per-socket buffer field, sb_state, in order to facilitate locking. This is required because, in order to provide more granular locking of sockets, different state fields have different locking properties. The following fields are moved to sb_state: SS_CANTRCVMORE (so_state) SS_CANTSENDMORE (so_state) SS_RCVATMARK (so_state) Rename respectively to: SBS_CANTRCVMORE (so_rcv.sb_state) SBS_CANTSENDMORE (so_snd.sb_state) SBS_RCVATMARK (so_rcv.sb_state) This facilitates locking by isolating fields to be located with other identically locked fields, and permits greater granularity in socket locking by avoiding storing fields with different locking semantics in the same short (avoiding locking conflicts). In the future, we may wish to coallesce sb_state and sb_flags; for the time being I leave them separate and there is no additional memory overhead due to the packing/alignment of shorts in the socket buffer structure.
# 395a08c9	12-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Extend coverage of SOCK_LOCK(so) to include so_count, the socket reference count: - Assert SOCK_LOCK(so) macros that directly manipulate so_count: soref(), sorele(). - Assert SOCK_LOCK(so) in macros/functions that rely on the state of so_count: sofree(), sotryfree(). - Acquire SOCK_LOCK(so) before calling these functions or macros in various contexts in the stack, both at the socket and protocol layers. - In some cases, perform soisdisconnected() before sotryfree(), as this could result in frobbing of a non-present socket if sotryfree() actually frees the socket. - Note that sofree()/sotryfree() will release the socket lock even if they don't free the socket. Submitted by: sam Sponsored by: FreeBSD Foundation Obtained from: BSD/OS
# f6c0cce6	12-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Introduce a mutex into struct sockbuf, sb_mtx, which will be used to protect fields in the socket buffer. Add accessor macros to use the mutex (SOCKBUF_()). Initialize the mutex in soalloc(), and destroy it in sodealloc(). Add addition, add SOCK_() access macros which will protect most remaining fields in the socket; for the time being, use the receive socket buffer mutex to implement socket level locking to reduce memory overhead. Submitted by: sam Sponosored by: FreeBSD Foundation Obtained from: BSD/OS
# 1a5ff928	08-Jun-2004	Stefan Farfeleder <stefanf@FreeBSD.org>	Avoid assignments to cast expressions. Reviewed by: md5 Approved by: das (mentor)
# 2658b3bb	01-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Integrate accept locking from rwatson_netperf, introducing a new global mutex, accept_mtx, which serializes access to the following fields across all sockets: so_qlen so_incqlen so_qstate so_comp so_incomp so_list so_head While providing only coarse granularity, this approach avoids lock order issues between sockets by avoiding ownership of the fields by a specific socket and its per-socket mutexes. While here, rewrite soclose(), sofree(), soaccept(), and sonewconn() to add assertions, close additional races and address lock order concerns. In particular: - Reorganize the optimistic concurrency behavior in accept1() to always allocate a file descriptor with falloc() so that if we do find a socket, we don't have to encounter the "Oh, there wasn't a socket" race that can occur if falloc() sleeps in the current code, which broke inbound accept() ordering, not to mention requiring backing out socket state changes in a way that raced with the protocol level. We may want to add a lockless read of the queue state if polling of empty queues proves to be important to optimize. - In accept1(), soref() the socket while holding the accept lock so that the socket cannot be free'd in a race with the protocol layer. Likewise in netgraph equivilents of the accept1() code. - In sonewconn(), loop waiting for the queue to be small enough to insert our new socket once we've committed to inserting it, or races can occur that cause the incomplete socket queue to overfill. In the previously implementation, it was sufficient to simply tested once since calling soabort() didn't release synchronization permitting another thread to insert a socket as we discard a previous one. - In soclose()/sofree()/et al, it is the responsibility of the caller to remove a socket from the incomplete connection queue before calling soabort(), which prevents soabort() from having to walk into the accept socket to release the socket from its queue, and avoids races when releasing the accept mutex to enter soabort(), permitting soabort() to avoid lock ordering issues with the caller. - Generally cluster accept queue related operations together throughout these functions in order to facilitate locking. Annotate new locking in socketvar.h.
# 36568179	31-May-2004	Robert Watson <rwatson@FreeBSD.org>	The SS_COMP and SS_INCOMP flags in the so_state field indicate whether the socket is on an accept queue of a listen socket. This change renames the flags to SQ_COMP and SQ_INCOMP, and moves them to a new state field on the socket, so_qstate, as the locking for these flags is substantially different for the locking on the remainder of the flags in so_state.
# 866046f5	31-May-2004	Don Lewis <truckman@FreeBSD.org>	Add MSG_NBIO flag option to soreceive() and sosend() that causes them to behave the same as if the SS_NBIO socket flag had been set for this call. The SS_NBIO flag for ordinary sockets is set by fcntl(fd, F_SETFL, O_NONBLOCK). Pass the MSG_NBIO flag to the soreceive() and sosend() calls in fifo_read() and fifo_write() instead of frobbing the SS_NBIO flag on the underlying socket for each I/O operation. The O_NONBLOCK flag is a property of the descriptor, and unlike ordinary sockets, fifos may be referenced by multiple descriptors.
# 099a0e58	31-May-2004	Bosko Milekic <bmilekic@FreeBSD.org>	Bring in mbuma to replace mballoc. mbuma is an Mbuf & Cluster allocator built on top of a number of extensions to the UMA framework, all included herein. Extensions to UMA worth noting: - Better layering between slab <-> zone caches; introduce Keg structure which splits off slab cache away from the zone structure and allows multiple zones to be stacked on top of a single Keg (single type of slab cache); perhaps we should look into defining a subset API on top of the Keg for special use by malloc(9), for example. - UMA_ZONE_REFCNT zones can now be added, and reference counters automagically allocated for them within the end of the associated slab structures. uma_find_refcnt() does a kextract to fetch the slab struct reference from the underlying page, and lookup the corresponding refcnt. mbuma things worth noting: - integrates mbuf & cluster allocations with extended UMA and provides caches for commonly-allocated items; defines several zones (two primary, one secondary) and two kegs. - change up certain code paths that always used to do: m_get() + m_clget() to instead just use m_getcl() and try to take advantage of the newly defined secondary Packet zone. - netstat(1) and systat(1) quickly hacked up to do basic stat reporting but additional stats work needs to be done once some other details within UMA have been taken care of and it becomes clearer to how stats will work within the modified framework. From the user perspective, one implication is that the NMBCLUSTERS compile-time option is no longer used. The maximum number of clusters is still capped off according to maxusers, but it can be made unlimited by setting the kern.ipc.nmbclusters boot-time tunable to zero. Work should be done to write an appropriate sysctl handler allowing dynamic tuning of kern.ipc.nmbclusters at runtime. Additional things worth noting/known issues (READ): - One report of 'ips' (ServeRAID) driver acting really slow in conjunction with mbuma. Need more data. Latest report is that ips is equally sucking with and without mbuma. - Giant leak in NFS code sometimes occurs, can't reproduce but currently analyzing; brueffer is able to reproduce but THIS IS NOT an mbuma-specific problem and currently occurs even WITHOUT mbuma. - Issues in network locking: there is at least one code path in the rip code where one or more locks are acquired and we end up in m_prepend() with M_WAITOK, which causes WITNESS to whine from within UMA. Current temporary solution: force all UMA allocations to be M_NOWAIT from within UMA for now to avoid deadlocks unless WITNESS is defined and we can determine with certainty that we're not holding any locks when we're M_WAITOK. - I've seen at least one weird socketbuffer empty-but- mbuf-still-attached panic. I don't believe this to be related to mbuma but please keep your eyes open, turn on debugging, and capture crash dumps. This change removes more code than it adds. A paper is available detailing the change and considering various performance issues, it was presented at BSDCan2004: http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf Please read the paper for Future Work and implementation details, as well as credits. Testing and Debugging: rwatson, brueffer, Ketrien I. Saihr-Kesenchedra, ... Reviewed by: Lots of people (for different parts)
# 123f024b	09-Apr-2004	Robert Watson <rwatson@FreeBSD.org>	Compare pointers with NULL rather than using pointers are booleans in if/for statements. Assign pointers to NULL rather than typecast 0. Compare pointers with NULL rather than 0.
# 7f8a436f	05-Apr-2004	Warner Losh <imp@FreeBSD.org>	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999. Approved by: core
# 8e44a7ec	30-Mar-2004	Robert Watson <rwatson@FreeBSD.org>	In sofree(), avoid nested declaration and initialization in declaration. Observe that initialization in declaration is frequently incompatible with locking, not just a bad idea due to style(9). Submitted by: bde
# 181e65db	29-Mar-2004	Robert Watson <rwatson@FreeBSD.org>	Use a common return path for filt_soread() and filt_sowrite() to simplify the impact of locking on these functions. Submitted by: sam Sponsored by: FreeBSD Foundation
# 71c90a29	29-Mar-2004	Robert Watson <rwatson@FreeBSD.org>	In sofree(), moving caching of 'head' from 'so->so_head' to later in the function once it has been determined to be non-NULL to simplify locking on an earlier return.
# 746e5bf0	29-Feb-2004	Robert Watson <rwatson@FreeBSD.org>	Rename dup_sockaddr() to sodupsockaddr() for consistency with other functions in kern_socket.c. Rename the "canwait" field to "mflags" and pass M_WAITOK and M_NOWAIT in from the caller context rather than "1" or "0". Correct mflags pass into mac_init_socket() from previous commit to not include M_ZERO. Submitted by: sam
# 740d9ba6	29-Feb-2004	Scott Long <scottl@FreeBSD.org>	Convert the other use of flags to mflags in soalloc().
# 2bc87dcf	29-Feb-2004	Robert Watson <rwatson@FreeBSD.org>	Modify soalloc() API so that it accepts a malloc flags argument rather than a "waitok" argument. Callers now passing M_WAITOK or M_NOWAIT rather than 0 or 1. This simplifies the soalloc() logic, and also makes the waiting behavior of soalloc() more clear in the calling context. Submitted by: sam
# f662a931	11-Feb-2004	Brian Feldman <green@FreeBSD.org>	Always socantsendmore() before deallocating a socket. This, in turn, calls selwakeup() if necessary (which it is, if you don't want freed memory hanging around on your td->td_selq). Props to: alfred
# be8a62e8	31-Jan-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Introduce the SO_BINTIME option which takes a high-resolution timestamp at packet arrival. For benchmarking purposes SO_BINTIME is preferable to SO_TIMEVAL since it has higher resolution and lower overhead. Simultaneous use of the two options is possible and they will return consistent timestamps. This introduces an extra test and a function call for SO_TIMEVAL, but I have not been able to measure that.
# 0541040c	18-Jan-2004	Ruslan Ermilov <ru@FreeBSD.org>	Since "m" is not part of the "mp" chain, need to free() it. Reported by: Stanford Metacompilation research group
# 9e71dd0f	16-Nov-2003	Robert Watson <rwatson@FreeBSD.org>	Reduce gratuitous redundancy and length in function names: mac_setsockopt_label_set() -> mac_setsockopt_label() mac_getsockopt_label_get() -> mac_getsockopt_label() mac_getsockopt_peerlabel_get() -> mac_getsockopt_peerlabel() Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 12cbb9dc	15-Nov-2003	Robert Watson <rwatson@FreeBSD.org>	When implementing getsockopt() for SO_LABEL and SO_PEERLABEL, make sure to sooptcopyin() the (struct mac) so that the MAC Framework knows which label types are being requested. This fixes process queries of socket labels. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 512824f8	09-Nov-2003	Seigo Tanimura <tanimura@FreeBSD.org>	- Implement selwakeuppri() which allows raising the priority of a thread being waken up. The thread waken up can run at a priority as high as after tsleep(). - Replace selwakeup()s with selwakeuppri()s and pass appropriate priorities. - Add cv_broadcastpri() which raises the priority of the broadcast threads. Used by selwakeuppri() if collision occurs. Not objected in: -arch, -current
# 395bb186	27-Oct-2003	Sam Leffler <sam@FreeBSD.org>	speedup stream socket recv handling by tracking the tail of the mbuf chain instead of walking the list for each append Submitted by: ps/jayanth Obtained from: netbsd (jason thorpe)
# 184dcdc7	21-Oct-2003	Mike Silbersack <silby@FreeBSD.org>	Change all SYSCTLS which are readonly and have a related TUNABLE from CTLFLAG_RD to CTLFLAG_RDTUN so that sysctl(8) can provide more useful error messages.
# cc342686	04-Aug-2003	Jeffrey Hsu <hsu@FreeBSD.org>	Make the second argument to sooptcopyout() constant in order to simplify the upcoming PIM patches. Submitted by: Pavlin Radoslavov <pavlin@icir.org>
# 4e19fe10	17-Jul-2003	Robert Drehmel <robert@FreeBSD.org>	To avoid a kernel panic provoked by a NULL pointer dereference, do not clear the `sb_sel' member of the sockbuf structure while invalidating the receive sockbuf in sorflush(), called from soshutdown(). The panic was reproduceable from user land by attaching a knote with EVFILT_READ filters to a socket, disabling further reads from it using shutdown(2), and then closing it. knote_remove() was called to remove all knotes from the socket file descriptor by detaching each using its associated filterops' detach call- back function, sordetach() in this case, which tried to remove itself from the invalidated sockbuf's klist (sb_sel.si_note). PR: kern/54331
# 330841c7	14-Jul-2003	Jeffrey Hsu <hsu@FreeBSD.org>	Rev 1.121 meant to pass the value 1 to soalloc() to indicate waitok. Reported by: arr
# 677b542e	10-Jun-2003	David E. O'Brien <obrien@FreeBSD.org>	Use __FBSDID().
# 104a9b7e	29-Apr-2003	Alexander Kabaev <kan@FreeBSD.org>	Deprecate machine/limits.h in favor of new sys/limits.h. Change all in-tree consumers to include <sys/limits.h> Discussed on: standards@ Partially submitted by: Craig Rodrigues <rodrigc@attbi.com>
# 695d74f3	14-Apr-2003	Olivier Houchard <cognet@FreeBSD.org>	Use while (controlp != NULL) instead of do ... while (control != NULL) There are valid cases where *controlp will be NULL at this point. Discussed with: dwmalone
# 8994a245	02-Mar-2003	Dag-Erling Smørgrav <des@FreeBSD.org>	Clean up whitespace, s/register //, refrain from strong urge to ANSIfy.
# c9524588	02-Mar-2003	Dag-Erling Smørgrav <des@FreeBSD.org>	uiomove-related caddr_t -> void * (just the low-hanging fruit)
# d6bf2378	19-Feb-2003	Olivier Houchard <cognet@FreeBSD.org>	Remove duplicate includes. Submitted by: Cyril Nguyen-Huu <cyril@ci0.org>
# a163d034	18-Feb-2003	Warner Losh <imp@FreeBSD.org>	Back out M_* changes, per decision of the TRB. Approved by: trb
# 44956c98	21-Jan-2003	Alfred Perlstein <alfred@FreeBSD.org>	Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
# 6f7cab93	17-Jan-2003	Thomas Moestl <tmm@FreeBSD.org>	Disallow listen() on sockets which are in the SS_ISCONNECTED or SS_ISCONNECTING state, returning EINVAL (which is what POSIX mandates in this case). listen() on connected or connecting sockets would cause them to enter a bad state; in the TCP case, this could cause sockets to go catatonic or panics, depending on how the socket was connected. Reviewed by: -net MFC after: 2 weeks
# 48e3128b	12-Jan-2003	Matthew Dillon <dillon@FreeBSD.org>	Bow to the whining masses and change a union back into void *. Retain removal of unnecessary casts and throw in some minor cleanups to see if anyone complains, just for the hell of it.
# cd72f218	11-Jan-2003	Matthew Dillon <dillon@FreeBSD.org>	Change struct file f_data to un_data, a union of the correct struct pointer types, and remove a huge number of casts from code using it. Change struct xfile xf_data to xun_data (ABI is still compatible). If we need to add a #define for f_data and xf_data we can, but I don't think it will be necessary. There are no operational changes in this commit.
# a09de2f7	05-Jan-2003	Alfred Perlstein <alfred@FreeBSD.org>	In sodealloc(), if there is an accept filter present on the socket then call do_setopt_accept_filter(so, NULL) which will free the filter instead of duplicating the code in do_setopt_accept_filter(). Pointed out by: Hiten Pandya <hiten@angelica.unixdaemons.com>
# 6ce9c72c	23-Dec-2002	Poul-Henning Kamp <phk@FreeBSD.org>	s/sokqfilter/soo_kqfilter/ for consistency with the naming of all other socket/file operations.
# 8819f45b	27-Nov-2002	Maxim Konovalov <maxim@FreeBSD.org>	Small SO_RCVTIMEO and SO_SNDTIMEO values are mistakenly taken to be zero. PR: kern/32827 Submitted by: Hartmut Brandt <brandt@fokus.gmd.de> Approved by: re (jhb) MFC after: 2 weeks
# 29f19445	08-Nov-2002	Alfred Perlstein <alfred@FreeBSD.org>	Fix instances of macros with improperly parenthasized arguments. Verified by: md5
# 247a32f2	05-Nov-2002	Kelly Yancey <kbyanc@FreeBSD.org>	Fix filt_soread() to properly flag a kevent when a 0-byte datagram is received. Verified by: dougb, Manfred Antar <null@pozo.com> Sponsored by: NTT Multimedia Communications Labs
# 5ee0a409	01-Nov-2002	Alan Cox <alc@FreeBSD.org>	Revert the change in revision 1.77 of kern/uipc_socket2.c. It is causing a panic because the socket's state isn't as expected by sofree(). Discussed with: dillon, fenner
# e0f640e8	01-Nov-2002	Kelly Yancey <kbyanc@FreeBSD.org>	Track the number of non-data chararacters stored in socket buffers so that the data value returned by kevent()'s EVFILT_READ filter on non-TCP sockets accurately reflects the amount of data that can be read from the sockets by applications. PR: 30634 Reviewed by: -net, -arch Sponsored by: NTT Multimedia Communications Labs MFC after: 2 weeks
# 6151efaa	28-Oct-2002	Robert Watson <rwatson@FreeBSD.org>	Trim extraneous #else and #endif MAC comments per style(9).
# 83985c26	05-Oct-2002	Robert Watson <rwatson@FreeBSD.org>	Modify label allocation semantics for sockets: pass in soalloc's malloc flags so that we can call malloc with M_NOWAIT if necessary, avoiding potential sleeps while holding mutexes in the TCP syncache code. Similar to the existing support for mbuf label allocation: if we can't allocate all the necessary label store in each policy, we back out the label allocation and fail the socket creation. Sync from MAC tree. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# ea6027a8	15-Aug-2002	Robert Watson <rwatson@FreeBSD.org>	Make similar changes to fo_stat() and fo_poll() as made earlier to fo_read() and fo_write(): explicitly use the cred argument to fo_poll() as "active_cred" using the passed file descriptor's f_cred reference to provide access to the file credential. Add an active_cred argument to fo_stat() so that implementers have access to the active credential as well as the file credential. Generally modify callers of fo_stat() to pass in td->td_ucred rather than fp->f_cred, which was redundantly provided via the fp argument. This set of modifications also permits threads to perform these operations on behalf of another thread without modifying their credential. Trickle this change down into fo_stat/poll() implementations: - badfo_poll(), badfo_stat(): modify/add arguments. - kqueue_poll(), kqueue_stat(): modify arguments. - pipe_poll(), pipe_stat(): modify/add arguments, pass active_cred to MAC checks rather than td->td_ucred. - soo_poll(), soo_stat(): modify/add arguments, pass fp->f_cred rather than cred to pru_sopoll() to maintain current semantics. - sopoll(): moidfy arguments. - vn_poll(), vn_statfile(): modify/add arguments, pass new arguments to vn_stat(). Pass active_cred to MAC and fp->f_cred to VOP_POLL() to maintian current semantics. - vn_close(): rename cred to file_cred to reflect reality while I'm here. - vn_stat(): Add active_cred and file_cred arguments to vn_stat() and consumers so that this distinction is maintained at the VFS as well as 'struct file' layer. Pass active_cred instead of td->td_ucred to MAC and to VOP_GETATTR() to maintain current semantics. - fifofs: modify the creation of a "filetemp" so that the file credential is properly initialized and can be used in the socket code if desired. Pass ap->a_td->td_ucred as the active credential to soo_poll(). If we teach the vnop interface about the distinction between file and active credentials, we would use the active credential here. Note that current inconsistent passing of active_cred vs. file_cred to VOP's is maintained. It's not clear why GETATTR would be authorized using active_cred while POLL would be authorized using file_cred at the file system level. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 5c5384fe	12-Aug-2002	Robert Watson <rwatson@FreeBSD.org>	Use the credential authorizing the socket creation operation to perform the jail check and the MAC socket labeling in socreate(). This handles socket creation using a cached credential better (such as in the NFS client code when rebuilding a socket following a disconnect: the new socket should be created using the nfsmount cached cred, not the cred of the thread causing the socket to be rebuilt). Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# f9d0d524	01-Aug-2002	Robert Watson <rwatson@FreeBSD.org>	Include file cleanup; mac.h and malloc.h at one point had ordering relationship requirements, and no longer do. Reminded by: bde
# b8279195	31-Jul-2002	Robert Watson <rwatson@FreeBSD.org>	Introduce support for Mandatory Access Control and extensible kernel access control. Implement two IOCTLs at the socket level to retrieve the primary and peer labels from a socket. Note that this user process interface will be changing to improve multi-policy support. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 335654d7	30-Jul-2002	Robert Watson <rwatson@FreeBSD.org>	Introduce support for Mandatory Access Control and extensible kernel access control. Invoke the necessary MAC entry points to maintain labels on sockets. In particular, invoke entry points during socket allocation and destruction, as well as creation by a process or during an accept-scenario (sonewconn). For UNIX domain sockets, also assign a peer label. As the socket code isn't locked down yet, locking interactions are not yet clear. Various protocol stack socket operations (such as peer label assignment for IPv4) will follow. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 5f0de712	24-Jul-2002	Mike Barcroft <mike@FreeBSD.org>	Catch up to rev 1.87 of sys/sys/socketvar.h (sb_cc changed from u_long to u_int). Noticed by: sparc64 tinderbox
# 80208239	28-Jun-2002	Alfred Perlstein <alfred@FreeBSD.org>	More caddr_t removal. Change struct knote's kn_hook from caddr_t to void *.
# 98cb733c	25-Jun-2002	Kenneth D. Merry <ken@FreeBSD.org>	At long last, commit the zero copy sockets code. MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes. ti.4: Update the ti(4) man page to include information on the TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options, and also include information about the new character device interface and the associated ioctls. man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated links. jumbo.9: New man page describing the jumbo buffer allocator interface and operation. zero_copy.9: New man page describing the general characteristics of the zero copy send and receive code, and what an application author should do to take advantage of the zero copy functionality. NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS, TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT. conf/files: Add uipc_jumbo.c and uipc_cow.c. conf/options: Add the 5 options mentioned above. kern_subr.c: Receive side zero copy implementation. This takes "disposable" pages attached to an mbuf, gives them to a user process, and then recycles the user's page. This is only active when ZERO_COPY_SOCKETS is turned on and the kern.ipc.zero_copy.receive sysctl variable is set to 1. uipc_cow.c: Send side zero copy functions. Takes a page written by the user and maps it copy on write and assigns it kernel virtual address space. Removes copy on write mapping once the buffer has been freed by the network stack. uipc_jumbo.c: Jumbo disposable page allocator code. This allocates (optionally) disposable pages for network drivers that want to give the user the option of doing zero copy receive. uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are enabled if ZERO_COPY_SOCKETS is turned on. Add zero copy send support to sosend() -- pages get mapped into the kernel instead of getting copied if they meet size and alignment restrictions. uipc_syscalls.c:Un-staticize some of the sf* functions so that they can be used elsewhere. (uipc_cow.c) if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid calling malloc() with M_WAITOK. Return an error if the M_NOWAIT malloc fails. The ti(4) driver and the wi(4) driver, at least, call this with a mutex held. This causes witness warnings for 'ifconfig -a' with a wi(4) or ti(4) board in the system. (I've only verified for ti(4)). ip_output.c: Fragment large datagrams so that each segment contains a multiple of PAGE_SIZE amount of data plus headers. This allows the receiver to potentially do page flipping on receives. if_ti.c: Add zero copy receive support to the ti(4) driver. If TI_PRIVATE_JUMBOS is not defined, it now uses the jumbo(9) buffer allocator for jumbo receive buffers. Add a new character device interface for the ti(4) driver for the new debugging interface. This allows (a patched version of) gdb to talk to the Tigon board and debug the firmware. There are also a few additional debugging ioctls available through this interface. Add header splitting support to the ti(4) driver. Tweak some of the default interrupt coalescing parameters to more useful defaults. Add hooks for supporting transmit flow control, but leave it turned off with a comment describing why it is turned off. if_tireg.h: Change the firmware rev to 12.4.11, since we're really at 12.4.11 plus fixes from 12.4.13. Add defines needed for debugging. Remove the ti_stats structure, it is now defined in sys/tiio.h. ti_fw.h: 12.4.11 firmware. ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13, and my header splitting patches. Revision 12.4.13 doesn't handle 10/100 negotiation properly. (This firmware is the same as what was in the tree previously, with the addition of header splitting support.) sys/jumbo.h: Jumbo buffer allocator interface. sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to indicate that the payload buffer can be thrown away / flipped to a userland process. socketvar.h: Add prototype for socow_setup. tiio.h: ioctl interface to the character portion of the ti(4) driver, plus associated structure/type definitions. uio.h: Change prototype for uiomoveco() so that we'll know whether the source page is disposable. ufs_readwrite.c:Update for new prototype of uiomoveco(). vm_fault.c: In vm_fault(), check to see whether we need to do a page based copy on write fault. vm_object.c: Add a new function, vm_object_allocate_wait(). This does the same thing that vm_object allocate does, except that it gives the caller the opportunity to specify whether it should wait on the uma_zalloc() of the object structre. This allows vm objects to be allocated while holding a mutex. (Without generating WITNESS warnings.) vm_object_allocate() is implemented as a call to vm_object_allocate_wait() with the malloc flag set to M_WAITOK. vm_object.h: Add prototype for vm_object_allocate_wait(). vm_page.c: Add page-based copy on write setup, clear and fault routines. vm_page.h: Add page based COW function prototypes and variable in the vm_page structure. Many thanks to Drew Gallatin, who wrote the zero copy send and receive code, and to all the other folks who have tested and reviewed this code over the years.
# c33c8251	20-Jun-2002	Alfred Perlstein <alfred@FreeBSD.org>	Implement SO_NOSIGPIPE option for sockets. This allows one to request that an EPIPE error return not generate SIGPIPE on sockets. Submitted by: lioux Inspired by: Darwin
# 4cc20ab1	31-May-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Back out my lats commit of locking down a socket, it conflicts with hsu's work. Requested by: hsu
# ec418160	21-May-2002	Andrew R. Reiter <arr@FreeBSD.org>	- td will never be NULL, so the call to soalloc() in socreate() will always be passed a 1; we can, however, use M_NOWAIT to indicate this. - Check so against NULL since it's a pointer to a structure.
# 1515cd22	21-May-2002	Andrew R. Reiter <arr@FreeBSD.org>	- OR the flag variable with M_ZERO so that the uma_zalloc() handles the zero'ing out of the allocated memory. Also removed the logical bzero that followed.
# 243917fe	19-May-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Lock down a socket, milestone 1. o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a socket buffer. The mutex in the receive buffer also protects the data in struct socket. o Determine the lock strategy for each members in struct socket. o Lock down the following members: - so_count - so_options - so_linger - so_state o Remove *_locked() socket APIs. Make the following socket APIs touching the members above now require a locked socket: - sodisconnect() - soisconnected() - soisconnecting() - soisdisconnected() - soisdisconnecting() - sofree() - soref() - sorele() - sorwakeup() - sotryfree() - sowakeup() - sowwakeup() Reviewed by: alfred
# e649887b	06-May-2002	Alfred Perlstein <alfred@FreeBSD.org>	Make funsetown() take a 'struct sigio **' so that the locking can be done internally. Ensure that no one can fsetown() to a dying process/pgrp. We need to check the process for P_WEXIT to see if it's exiting. Process groups are already safe because there is no such thing as a pgrp zombie, therefore the proctree lock completely protects the pgrp from having sigio structures associated with it after it runs funsetownlst. Add sigio lock to witness list under proctree and allproc, but over proc and pgrp. Seigo Tanimura helped with this.
# f1320723	01-May-2002	Alfred Perlstein <alfred@FreeBSD.org>	Redo the sigio locking. Turn the sigio sx into a mutex. Sigio lock is really only needed to protect interrupts from dereferencing the sigio pointer in an object when the sigio itself is being destroyed. In order to do this in the most unintrusive manner change pgsigio's sigio * argument into a **, that way we can lock internally to the function.
# e1f1827f	25-Apr-2002	Mike Silbersack <silby@FreeBSD.org>	Make sure that sockets undergoing accept filtering are aborted in a LRU fashion when the listen queue fills up. Previously, there was no mechanism to kick out old sockets, leading to an easy DoS of daemons using accept filtering. Reviewed by: alfred MFC after: 3 days
# 20504246	07-Apr-2002	Jeffrey Hsu <hsu@FreeBSD.org>	There's only one socket zone so we don't need to remember it in every socket structure.
# 59295dba	20-Mar-2002	Jeff Roberson <jeff@FreeBSD.org>	UMA permited us to utilize the 'waitok' flag to soalloc.
# c897b813	19-Mar-2002	Jeff Roberson <jeff@FreeBSD.org>	Remove references to vm_zone.h and switch over to the new uma API. Also, remove maxsockets. If you look carefully you'll notice that the old zone allocator never honored this anyway.
# 8355f576	19-Mar-2002	Jeff Roberson <jeff@FreeBSD.org>	This is the first part of the new kernel memory allocator. This replaces malloc(9) and vm_zone with a slab like allocator. Reviewed by: arch@
# 167b8d03	28-Feb-2002	Ian Dowse <iedowse@FreeBSD.org>	In sosend(), enforce the socket buffer limits regardless of whether the data was supplied as a uio or an mbuf. Previously the limit was ignored for mbuf data, and NFS could run the kernel out of mbufs when an ipfw rule blocked retransmissions.
# a854ed98	27-Feb-2002	John Baldwin <jhb@FreeBSD.org>	Simple p_ucred -> td_ucred changes to start using the per-thread ucred reference.
# ecde8f7c	04-Feb-2002	Matthew Dillon <dillon@FreeBSD.org>	Get rid of the twisted MFREE() macro entirely. Reviewed by: dg, bmilekic MFC after: 3 days
# 468485b8	14-Jan-2002	Alfred Perlstein <alfred@FreeBSD.org>	Fix select on fifos. Backout revision 1.56 and 1.57 of fifo_vnops.c. Introduce a new poll op "POLLINIGNEOF" that can be used to ignore EOF on a fifo, POLLIN/POLLRDNORM is converted to POLLINIGNEOF within the FIFO implementation to effect the correct behavior. This should allow one to view a fifo pretty much as a data source rather than worry about connections coming and going. Reviewed by: bde
# 9c4d63da	31-Dec-2001	Robert Watson <rwatson@FreeBSD.org>	o Make the credential used by socreate() an explicit argument to socreate(), rather than getting it implicitly from the thread argument. o Make NFS cache the credential provided at mount-time, and use the cached credential (nfsmount->nm_cred) when making calls to socreate() on initially connecting, or reconnecting the socket. This fixes bugs involving NFS over TCP and ipfw uid/gid rules, as well as bugs involving NFS and mandatory access control implementations. Reviewed by: freebsd-arch
# b1e4abd2	16-Nov-2001	Matthew Dillon <dillon@FreeBSD.org>	Give struct socket structures a ref counting interface similar to vnodes. This will hopefully serve as a base from which we can expand the MP code. We currently do not attempt to obtain any mutex or SX locks, but the door is open to add them when we nail down exactly how that part of it is going to work.
# 7377f0d1	12-Nov-2001	Giorgos Keramidas <keramida@FreeBSD.org>	Remove EOL whitespace. Reviewed by: alfred
# 074df018	12-Nov-2001	Giorgos Keramidas <keramida@FreeBSD.org>	Make KASSERT's print the values that triggered a panic. Reviewed by: alfred
# bd78cece	11-Oct-2001	John Baldwin <jhb@FreeBSD.org>	Change the kernel's ucred API as follows: - crhold() returns a reference to the ucred whose refcount it bumps. - crcopy() now simply copies the credentials from one credential to another and has no return value. - a new crshared() primitive is added which returns true if a ucred's refcount is > 1 and false (0) otherwise.
# 8a7d8cc6	09-Oct-2001	Robert Watson <rwatson@FreeBSD.org>	- Combine kern.ps_showallprocs and kern.ipc.showallsockets into a single kern.security.seeotheruids_permitted, describes as: "Unprivileged processes may see subjects/objects with different real uid" NOTE: kern.ps_showallprocs exists in -STABLE, and therefore there is an API change. kern.ipc.showallsockets does not. - Check kern.security.seeotheruids_permitted in cr_cansee(). - Replace visibility calls to socheckuid() with cr_cansee() (retain the change to socheckuid() in ipfw, where it is used for rule-matching). - Remove prison_unpcb() and make use of cr_cansee() against the UNIX domain socket credential instead of comparing root vnodes for the UDS and the process. This allows multiple jails to share the same chroot() and not see each others UNIX domain sockets. - Remove unused socheckproc(). Now that cr_cansee() is used universally for socket visibility, a variety of policies are more consistently enforced, including uid-based restrictions and jail-based restrictions. This also better-supports the introduction of additional MAC models. Reviewed by: ps, billf Obtained from: TrustedBSD Project
# 4787fd37	05-Oct-2001	Paul Saab <ps@FreeBSD.org>	Only allow users to see their own socket connections if kern.ipc.showallsockets is set to 0. Submitted by: billf (with modifications by me) Inspired by: Dave McKay (aka pm aka Packet Magnet) Reviewed by: peter MFC after: 2 weeks
# 2bc21ed9	04-Oct-2001	David Malone <dwmalone@FreeBSD.org>	Hopefully improve control message passing over Unix domain sockets. 1) Allow the sending of more than one control message at a time over a unix domain socket. This should cover the PR 29499. 2) This requires that unp_{ex,in}ternalize and unp_scan understand mbufs with more than one control message at a time. 3) Internalize and externalize used to work on the mbuf in-place. This made life quite complicated and the code for sizeof(int) < sizeof(file ) could end up doing the wrong thing. The patch always create a new mbuf/cluster now. This resulted in the change of the prototype for the domain externalise function. 4) You can now send SCM_TIMESTAMP messages. 5) Always use CMSG_DATA(cm) to determine the start where the data in unp_{ex,in}ternalize. It was using ((struct cmsghdr )cm + 1) in some places, which gives the wrong alignment on the alpha. (NetBSD made this fix some time ago). This results in an ABI change for discriptor passing and creds passing on the alpha. (Probably on the IA64 and Spare ports too). 6) Fix userland programs to use CMSG_* macros too. 7) Be more careful about freeing mbufs containing (file *)s. This is made possible by the prototype change of externalise. PR: 29499 MFC after: 6 weeks
# ed01445d	21-Sep-2001	John Baldwin <jhb@FreeBSD.org>	Use the passed in thread to selrecord() instead of curthread.
# b40ce416	12-Sep-2001	Julian Elischer <julian@FreeBSD.org>	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
# fb919e4d	01-May-2001	Mark Murray <markm@FreeBSD.org>	Undo part of the tangle of having sys/lock.h and sys/mutex.h included in other "system" header files. Also help the deprecation of lockmgr.h by making it a sub-include of sys/lock.h and removing sys/lockmgr.h form kernel .c files. Sort sys/*.h includes where possible in affected files. OK'ed by: bde (with reservations)
# 3abedb4e	27-Apr-2001	Alfred Perlstein <alfred@FreeBSD.org>	Actually show the values that tripped the assertion "receive 1"
# 4d286823	16-Mar-2001	Jonathan Lemon <jlemon@FreeBSD.org>	When doing a recv(.. MSG_WAITALL) for a message which is larger than the socket buffer size, the receive is done in sections. After completing a read, call pru_rcvd on the underlying protocol before blocking again. This allows the the protocol to take appropriate action, such as sending a TCP window update to the peer, if the window happened to close because the socket buffer was filled. If the protocol is not notified, a TCP transfer may stall until the remote end sends a window probe.
# c0647e0d	09-Mar-2001	Jonathan Lemon <jlemon@FreeBSD.org>	Push the test for a disconnected socket when accept()ing down to the protocol layer. Not all protocols behave identically. This fixes the brokenness observed with unix-domain sockets (and postfix)
# 8ac6dca7	27-Feb-2001	Ruslan Ermilov <ru@FreeBSD.org>	In soshutdown(), use SHUT_{RD,WR,RDWR} instead of FREAD and FWRITE. Also, return EINVAL if `how' is invalid, as required by POSIX spec.
# da403b9d	23-Feb-2001	Jonathan Lemon <jlemon@FreeBSD.org>	Introduce a NOTE_LOWAT flag for use with the read/write filters, which allow the watermark to be passed in via the data field during the EV_ADD operation. Hook this up to the socket read/write filters; if specified, it overrides the so_{rcv\|snd}.sb_lowat values in the filter. Inspired by: "Ronald F. Guilmette" <rfg@monkeys.com>
# b07540c8	23-Feb-2001	Jonathan Lemon <jlemon@FreeBSD.org>	When returning EV_EOF for the socket read/write filters, also return the current socket error in fflags. This may be useful for determining why a connect() request fails. Inspired by: "Jonathan Graehl" <jonathan@graehl.org>
# 91421ba2	20-Feb-2001	Robert Watson <rwatson@FreeBSD.org>	o Move per-process jail pointer (p->pr_prison) to inside of the subject credential structure, ucred (cr->cr_prison). o Allow jail inheritence to be a function of credential inheritence. o Abstract prison structure reference counting behind pr_hold() and pr_free(), invoked by the similarly named credential reference management functions, removing this code from per-ABI fork/exit code. o Modify various jail() functions to use struct ucred arguments instead of struct proc arguments. o Introduce jailed() function to determine if a credential is jailed, rather than directly checking pointers all over the place. o Convert PRISON_CHECK() macro to prison_check() function. o Move jail() function prototypes to jail.h. o Emulate the P_JAILED flag in fill_kinfo_proc() and no longer set the flag in the process flags field itself. o Eliminate that "const" qualifier from suser/p_can/etc to reflect mutex use. Notes: o Some further cleanup of the linux/jail code is still required. o It's now possible to consider resolving some of the process vs credential based permission checking confusion in the socket code. o Mutex protection of struct prison is still not present, and is required to protect the reference count plus some fields in the structure. Reviewed by: freebsd-arch Obtained from: TrustedBSD Project
# 608a3ce6	15-Feb-2001	Jonathan Lemon <jlemon@FreeBSD.org>	Extend kqueue down to the device layer. Backwards compatible approach suggested by: peter
# 2fd7d53d	13-Feb-2001	Jonathan Lemon <jlemon@FreeBSD.org>	Return ECONNABORTED from accept if connection is closed while on the listen queue, as well as the current behavior of a zero-length sockaddr. Obtained from: KAME Reviewed by: -net
# a3ea6d41	21-Jan-2001	Dag-Erling Smørgrav <des@FreeBSD.org>	First step towards an MP-safe zone allocator: - have zalloc() and zfree() always lock the vm_zone. - remove zalloci() and zfreei(), which are now redundant. Reviewed by: bmilekic, jasone
# 2a0c503e	21-Dec-2000	Bosko Milekic <bmilekic@FreeBSD.org>	* Rename M_WAIT mbuf subsystem flag to M_TRYWAIT. This is because calls with M_WAIT (now M_TRYWAIT) may not wait forever when nothing is available for allocation, and may end up returning NULL. Hopefully we now communicate more of the right thing to developers and make it very clear that it's necessary to check whether calls with M_(TRY)WAIT also resulted in a failed allocation. M_TRYWAIT basically means "try harder, block if necessary, but don't necessarily wait forever." The time spent blocking is tunable with the kern.ipc.mbuf_wait sysctl. M_WAIT is now deprecated but still defined for the next little while. * Fix a typo in a comment in mbuf.h * Fix some code that was actually passing the mbuf subsystem's M_WAIT to malloc(). Made it pass M_WAITOK instead. If we were ever to redefine the value of the M_WAIT flag, this could have became a big problem.
# 7cc0979f	08-Dec-2000	David Malone <dwmalone@FreeBSD.org>	Convert more malloc+bzero to malloc+M_ZERO. Submitted by: josh@zipperup.org Submitted by: Robert Drehmel <robd@gmx.net>
# 830fedd2	19-Nov-2000	Alfred Perlstein <alfred@FreeBSD.org>	Accept filters broke kernels compiled without options INET. Make accept filters conditional on INET support to fix. Pointed out by: bde Tested and assisted by: Stephen J. Kiernan <sab@vegamuse.org>
# d5aa1234	27-Sep-2000	Jonathan Lemon <jlemon@FreeBSD.org>	Check so_error in filt_so{read\|write} in order to detect UDP errors. PR: 21601
# f535380c	05-Sep-2000	Don Lewis <truckman@FreeBSD.org>	Remove uidinfo hash table lookup and maintenance out of chgproccnt() and chgsbsize(), which are called rather frequently and may be called from an interrupt context in the case of chgsbsize(). Instead, do the hash table lookup and maintenance when credentials are changed, which is a lot less frequent. Add pointers to the uidinfo structures to the ucred and pcred structures for fast access. Pass a pointer to the credential to chgproccnt() and chgsbsize() instead of passing the uid. Add a reference count to the uidinfo structure and use it to decide when to free the structure rather than freeing the structure when the resource consumption drops to zero. Move the resource tracking code from kern_proc.c to kern_resource.c. Move some duplicate code sequences in kern_prot.c to separate helper functions. Change KASSERTs in this code to unconditional tests and calls to panic().
# 6aef685f	29-Aug-2000	Brian Feldman <green@FreeBSD.org>	Remove any possibility of hiwat-related race conditions by changing the chgsbsize() call to use a "subject" pointer (&sb.sb_hiwat) and a u_long target to set it to. The whole thing is splnet(). This fixes a problem that jdp has been able to provoke.
# a1144591	07-Aug-2000	Jonathan Lemon <jlemon@FreeBSD.org>	Make the kqueue socket read filter honor the SO_RCVLOWAT value. Spotted by: "Steve M." <stevem@redlinenetworks.com>
# f4088964	19-Jul-2000	Alfred Perlstein <alfred@FreeBSD.org>	only allow accept filter modifications on listening sockets Submitted by: ps
# c6362551	22-Jun-2000	Alfred Perlstein <alfred@FreeBSD.org>	fix races in the uidinfo subsystem, several problems existed: 1) while allocating a uidinfo struct malloc is called with M_WAITOK, it's possible that while asleep another process by the same user could have woken up earlier and inserted an entry into the uid hash table. Having redundant entries causes inconsistancies that we can't handle. fix: do a non-waiting malloc, and if that fails then do a blocking malloc, after waking up check that no one else has inserted an entry for us already. 2) Because many checks for sbsize were done as "test then set" in a non atomic manner it was possible to exceed the limits put up via races. fix: instead of querying the count then setting, we just attempt to set the count and leave it up to the function to return success or failure. 3) The uidinfo code was inlining and repeating, lookups and insertions and deletions needed to be in their own functions for clarity. Reviewed by: green
# a79b7128	19-Jun-2000	Alfred Perlstein <alfred@FreeBSD.org>	return of the accept filter part II accept filters are now loadable as well as able to be compiled into the kernel. two accept filters are provided, one that returns sockets when data arrives the other when an http request is completed (doesn't work with 0.9 requests) Reviewed by: jmg
# a72fda71	18-Jun-2000	Alfred Perlstein <alfred@FreeBSD.org>	backout accept optimizations. Requested by: jmg, dcs, jdp, nate
# 8f4e4aa5	15-Jun-2000	Alfred Perlstein <alfred@FreeBSD.org>	add socketoptions DELAYACCEPT and HTTPACCEPT which will not allow an accept() until the incoming connection has either data waiting or what looks like a HTTP request header already in the socketbuffer. This ought to reduce the context switch time and overhead for processing requests. The initial idea and code for HTTPACCEPT came from Yahoo engineers and has been cleaned up and a more lightweight DELAYACCEPT for non-http servers has been added Reviewed by: silence on hackers.
# 3b43fd62	13-Jun-2000	Jeroen Ruigrok van der Werven <asmodai@FreeBSD.org>	Fix panic by moving the prp == 0 check up the order of sanity checks. Submitted by: Bart Thate <freebsd@1st.dudi.org> on -current Approved by: rwatson
# 7cadc266	03-Jun-2000	Robert Watson <rwatson@FreeBSD.org>	o Modify jail to limit creation of sockets to UNIX domain sockets, TCP/IP (v4) sockets, and routing sockets. Previously, interaction with IPv6 was not well-defined, and might be inappropriate for some environments. Similarly, sysctl MIB entries providing interface information also give out only addresses from those protocol domains. For the time being, this functionality is enabled by default, and toggleable using the sysctl variable jail.socket_unixiproute_only. In the future, protocol domains will be able to determine whether or not they are ``jail aware''. o Further limitations on process use of getpriority() and setpriority() by jailed processes. Addresses problem described in kern/17878. Reviewed by: phk, jmg
# e3975643	25-May-2000	Jake Burkholder <jake@FreeBSD.org>	Back out the previous change to the queue(3) interface. It was not discussed and should probably not happen. Requested by: msmith and others
# 740a1973	23-May-2000	Jake Burkholder <jake@FreeBSD.org>	Change the way that the queue(3) structures are declared; don't assume that the type argument to _HEAD and _ENTRY is a struct. Suggested by: phk Reviewed by: phk Approved by: mdodd
# cb679c38	16-Apr-2000	Jonathan Lemon <jlemon@FreeBSD.org>	Introduce kqueue() and kevent(), a kernel event notification facility.
# 95b2b777	18-Mar-2000	Bill Fenner <fenner@FreeBSD.org>	Make sure to free the socket in soabort() if the protocol couldn't free it (this could happen if the protocol already freed its part and we just kept the socket around to make sure accept(2) didn't block)
# bfbbc4aa	13-Jan-2000	Jason Evans <jasone@FreeBSD.org>	Add aio_waitcomplete(). Make aio work correctly for socket descriptors. Make gratuitous style(9) fixes (me, not the submitter) to make the aio code more readable. PR: kern/12053 Submitted by: Chris Sedore <cmsedore@maxwell.syr.edu>
# c2696359	26-Dec-1999	Brian Feldman <green@FreeBSD.org>	Correct an uninitialized variable use, which, unlike most times, is actually a bug this time. Submitted by: bde Reviewed by: bde
# f48b807f	11-Dec-1999	Brian Feldman <green@FreeBSD.org>	This is Bosko Milekic's mbuf allocation waiting code. Basically, this means that running out of mbuf space isn't a panic anymore, and code which runs out of network memory will sleep to wait for it. Submitted by: Bosko Milekic <bmilekic@dsuper.net> Reviewed by: green, wollman
# 82cd038d	21-Nov-1999	Yoshinobu Inoue <shin@FreeBSD.org>	KAME netinet6 basic part(no IPsec,no V6 Multicast Forwarding, no UDP/TCP for IPv6 yet) With this patch, you can assigne IPv6 addr automatically, and can reply to IPv6 ping. Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project
# 2e3c8fcb	16-Nov-1999	Poul-Henning Kamp <phk@FreeBSD.org>	This is a partial commit of the patch from PR 14914: Alot of the code in sys/kern directly accesses the Q_HEAD and Q_ENTRY structures for list operations. This patch makes all list operations in sys/kern use the queue(3) macros, rather than directly accessing the *Q_{HEAD,ENTRY} structures. This batch of changes compile to the same object files. Reviewed by: phk Submitted by: Jake Burkholder <jake@checker.org> PR: 14914
# ecf72308	09-Oct-1999	Brian Feldman <green@FreeBSD.org>	Implement RLIMIT_SBSIZE in the kernel. This is a per-uid sockbuf total usage limit.
# 2f9a2132	18-Sep-1999	Brian Feldman <green@FreeBSD.org>	Change so_cred's type to a ucred, not a pcred. THis makes more sense, actually. Make a sonewconn3() which takes an extra argument (proc) so new sockets created with sonewconn() from a user's system call get the correct credentials, not just the parent's credentials.
# c3aac50f	27-Aug-1999	Peter Wemm <peter@FreeBSD.org>	$Id$ -> $FreeBSD$
# f29be021	17-Jun-1999	Brian Feldman <green@FreeBSD.org>	Reviewed by: the cast of thousands This is the change to struct sockets that gets rid of so_uid and replaces it with a much more useful struct pcred *so_cred. This is here to be able to do socket-level credential checks (i.e. IPFW uid/gid support, to be added to HEAD soon). Along with this comes an update to pidentd which greatly simplifies the code necessary to get a uid from a socket. Soon to come: a sysctl() interface to finding individual sockets' credentials.
# 9c9906e9	03-Jun-1999	Peter Wemm <peter@FreeBSD.org>	Plug a mbuf leak in tcp_usr_send(). pru_send() routines are expected to either enqueue or free their mbuf chains, but tcp_usr_send() was dropping them on the floor if the tcpcb/inpcb has been torn down in the middle of a send/write attempt. This has been responsible for a wide variety of mbuf leak patterns, ranging from slow gradual leakage to rather rapid exhaustion. This has been a problem since before 2.2 was branched and appears to have been fixed in rev 1.16 and lost in 1.23/1.28. Thanks to Jayanth Vijayaraghavan <jayanth@yahoo-inc.com> for checking (extensively) into this on a live production 2.2.x system and that it was the actual cause of the leak and looks like it fixes it. The machine in question was loosing (from memory) about 150 mbufs per hour under load and a change similar to this stopped it. (Don't blame Jayanth for this patch though) An alternative approach to this would be to recheck SS_CANTSENDMORE etc inside the splnet() right before calling pru_send() after all the potential sleeps, interrupts and delays have happened. However, this would mean exposing knowledge of the tcp stack's reset handling and removal of the pcb to the generic code. There are other things that call pru_send() directly though. Problem originally noted by: John Plevyak <jplevyak@inktomi.com>
# 925fa5c3	21-May-1999	Andrey A. Chernov <ache@FreeBSD.org>	Realy fix overflow on SO_*TIMEO Submitted by: bde
# 3d177f46	03-May-1999	Bill Fumerola <billf@FreeBSD.org>	Add sysctl descriptions to many SYSCTL_XXXs PR: kern/11197 Submitted by: Adrian Chadd <adrian@FreeBSD.org> Reviewed by: billf(spelling/style/minor nits) Looked at by: bde(style)
# 02a3d526	24-Apr-1999	Andrey A. Chernov <ache@FreeBSD.org>	Lite2 bugfixes merge: so_linger is in seconds, not in 1/HZ range checking in SO_*TIMEO was wrong PR: 11252
# ce02431f	16-Feb-1999	Doug Rabson <dfr@FreeBSD.org>	* Change sysctl from using linker_set to construct its tree using SLISTs. This makes it possible to change the sysctl tree at runtime. * Change KLD to find and register any sysctl nodes contained in the loaded file and to unregister them when the file is unloaded. Reviewed by: Archie Cobbs <archie@whistle.com>, Peter Wemm <peter@netplex.com.au> (well they looked at it anyway)
# 8f70ac3e	02-Feb-1999	Bill Fenner <fenner@FreeBSD.org>	Fix the port of the NetBSD 19990120-accept fix. I misread a piece of code when examining their fix, which caused my code (in rev 1.52) to: - panic("soaccept: !NOFDREF") - fatal trap 12, with tracebacks going thru soclose and soaccept
# d254af07	27-Jan-1999	Matthew Dillon <dillon@FreeBSD.org>	Fix warnings in preparation for adding -Wall -Wcast-qual to the kernel compile
# 527b7a14	25-Jan-1999	Bill Fenner <fenner@FreeBSD.org>	Port NetBSD's 19990120-accept bug fix. This works around the race condition where select(2) can return that a listening socket has a connected socket queued, the connection is broken, and the user calls accept(2), which then blocks because there are no connections queued. Reviewed by: wollman Obtained from: NetBSD (ftp://ftp.NetBSD.ORG/pub/NetBSD/misc/security/patches/19990120-accept)
# 7b177710	20-Jan-1999	Bill Fenner <fenner@FreeBSD.org>	Also consider the space left in the socket buffer when deciding whether to set PRUS_MORETOCOME.
# b0acefa8	20-Jan-1999	Bill Fenner <fenner@FreeBSD.org>	Add a flag, passed to pru_send routines, PRUS_MORETOCOME. This flag means that there is more data to be put into the socket buffer. Use it in TCP to reduce the interaction between mbuf sizes and the Nagle algorithm. Based on: "Justin C. Walker" <justin@apple.com>'s description of Apple's fix for this problem.
# 219cbf59	09-Jan-1999	Eivind Eklund <eivind@FreeBSD.org>	KNFize, by bde.
# 5526d2d9	08-Jan-1999	Eivind Eklund <eivind@FreeBSD.org>	Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as discussed on -hackers. Introduce 'KASSERT(assertion, ("panic message", args))' for simple check + panic. Reviewed by: msmith
# f1d19042	07-Dec-1998	Archie Cobbs <archie@FreeBSD.org>	The "easy" fixes for compiling the kernel -Wunused: remove unreferenced static and local variables, goto labels, and functions declared but not defined.
# 831d27a9	11-Nov-1998	Don Lewis <truckman@FreeBSD.org>	Installed the second patch attached to kern/7899 with some changes suggested by bde, a few other tweaks to get the patch to apply cleanly again and some improvements to the comments. This change closes some fairly minor security holes associated with F_SETOWN, fixes a few bugs, and removes some limitations that F_SETOWN had on tty devices. For more details, see the description on the PR. Because this patch increases the size of the proc and pgrp structures, it is necessary to re-install the includes and recompile libkvm, the vinum lkm, fstat, gcore, gdb, ipfilter, ps, top, and w. PR: kern/7899 Reviewed by: bde, elvind
# 9898afa1	31-Aug-1998	Garrett Wollman <wollman@FreeBSD.org>	Bow to tradition and correctly implement the bogus-but-hallowed semantics of getsockopt never telling how much it might have copied if only the buffer were big enough.
# d224dbc1	31-Aug-1998	Garrett Wollman <wollman@FreeBSD.org>	Correctly set the return length regardless of the relative size of the user's buffer. Simplify the logic a bit. (Can we have a version of min() for size_t?)
# cfe8b629	22-Aug-1998	Garrett Wollman <wollman@FreeBSD.org>	Yow! Completely change the way socket options are handled, eliminating another specialized mbuf type in the process. Also clean up some of the cruft surrounding IPFW, multicast routing, RSVP, and other ill-explored corners.
# 0c495036	18-Jul-1998	Bill Fenner <fenner@FreeBSD.org>	Undo rev 1.41 until we get more details about why it makes some systems fail.
# dece5b6a	06-Jul-1998	Bill Fenner <fenner@FreeBSD.org>	Introduce (fairly hacky) workaround for odd TCP behavior with application writes of size (100,208]+N*MCLBYTES. The bug: sosend() hands each mbuf off to the protocol output routine as soon as it has copied it, in the hopes of increasing parallelism (see http://www.kohala.com/~rstevens/vanj.88jul20.txt ). This works well for TCP as long as the first mbuf handed off is at least the MSS. However, when doing small writes (between MHLEN and MINCLSIZE), the transaction is split into 2 small MBUF's and each is individually handed off to TCP. TCP assumes that the first small mbuf is the whole transaction, so sends a small packet. When the second small mbuf arrives, Nagle prevents TCP from sending it so it must wait for a (potentially delayed) ACK. This sends throughput down the toilet. The workaround: Set the "atomic" flag when we're doing small writes. The "atomic" flag has two meanings: 1. Copy all of the data into a chain of mbufs before handing off to the protocol. 2. Leave room for a datagram header in said mbuf chain. TCP wants the first but doesn't want the second. However, the second simply results in some memory wastage (but is why the workaround is a hack and not a fix). The real fix: The real fix for this problem is to introduce something like a "requested transfer size" variable in the socket->protocol interface. sosend() would then accumulate an mbuf chain until it exceeded the "requested transfer size". TCP could set it to the TCP MSS (note that the current interface causes strange TCP behaviors when the MSS > MCLBYTES; nobody notices because MCLBYTES > ethernet's MTU).
# 98271db4	15-May-1998	Garrett Wollman <wollman@FreeBSD.org>	Convert socket structures to be type-stable and add a version number. Define a parameter which indicates the maximum number of sockets in a system, and use this to size the zone allocators used for sockets and for certain PCBs. Convert PF_LOCAL PCB structures to be type-stable and add a version number. Define an external format for infomation about socket structures and use it in several places. Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through sysctl(3) without blocking network interrupts for an unreasonable length of time. This probably still has some bugs and/or race conditions, but it seems to work well enough on my machines. It is now possible for `netstat' to get almost all of its information via the sysctl(3) interface rather than reading kmem (changes to follow).
# 08637435	28-Mar-1998	Bruce Evans <bde@FreeBSD.org>	Moved some #includes from <sys/param.h> nearer to where they are actually used.
# 4049a042	01-Mar-1998	Guido van Rooij <guido@FreeBSD.org>	Make sure that you can only bind a more specific address when it is done by the same uid. Obtained from: OpenBSD
# 92f57d00	19-Feb-1998	Bill Fenner <fenner@FreeBSD.org>	Revert sosend() to its behavior from 4.3-Tahoe and before: if so_error is set, clear it before returning it. The behavior introduced in 4.3-Reno (to not clear so_error) causes potentially transient errors (e.g. ECONNREFUSED if the other end hasn't opened its socket yet) to be permanent on connected datagram sockets that are only used for writing. (soreceive() clears so_error before returning it, as does getsockopt(...,SO_ERROR,...).) Submitted by: Van Jacobson <van@ee.lbl.gov>, via a comment in the vat sources.
# 0b08f5f7	05-Feb-1998	Eivind Eklund <eivind@FreeBSD.org>	Back out DIAGNOSTIC changes.
# 47cfdb16	04-Feb-1998	Eivind Eklund <eivind@FreeBSD.org>	Turn DIAGNOSTIC into a new-style option.
# 64bd2f7b	08-Nov-1997	Jordan K. Hubbard <jkh@FreeBSD.org>	MF22: MSG_EOR bug fix. Submitted by: wollman
# a1c995b6	12-Oct-1997	Poul-Henning Kamp <phk@FreeBSD.org>	Last major round (Unless Bruce thinks of somthing :-) of malloc changes. Distribute all but the most fundamental malloc types. This time I also remembered the trick to making things static: Put "static" in front of them. A couple of finer points by: bde
# eabecea3	04-Oct-1997	Poul-Henning Kamp <phk@FreeBSD.org>	While booting diskless we have no proc pointer.
# e25aa68e	13-Sep-1997	Peter Wemm <peter@FreeBSD.org>	Extend select backend for sockets to work with a poll interface (more detail is passed back and forwards). This mostly came from NetBSD, except that our interfaces have changed a lot and this funciton is in a different part of the kernel. Obtained from: NetBSD
# e4ba6a82	02-Sep-1997	Bruce Evans <bde@FreeBSD.org>	Removed unused #includes.
# b1037dcd	21-Aug-1997	Bruce Evans <bde@FreeBSD.org>	#include <machine/limits.h> explicitly in the few places that it is required.
# 57bf258e	16-Aug-1997	Garrett Wollman <wollman@FreeBSD.org>	Fix all areas of the system (or at least all those in LINT) to avoid storing socket addresses in mbufs. (Socket buffers are the one exception.) A number of kernel APIs needed to get fixed in order to make this happen. Also, fix three protocol families which kept PCBs in mbufs to not malloc them instead. Delete some old compatibility cruft while we're at it, and add some new routines in the in_cksum family.
# 006ad618	27-Jun-1997	Peter Wemm <peter@FreeBSD.org>	Don't accept insane values for SO_(SND\|RCV)BUF, and the low water marks. Specifically, don't allow a value < 1 for any of them (it doesn't make sense), and don't let the low water mark be greater than the corresponding high water mark. Pre-Approved by: wollman Obtained from: NetBSD
# a29f300e	27-Apr-1997	Garrett Wollman <wollman@FreeBSD.org>	The long-awaited mega-massive-network-code- cleanup. Part I. This commit includes the following changes: 1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility glue for them is deleted, and the kernel will panic on boot if any are compiled in. 2) Certain protocol entry points are modified to take a process structure, so they they can easily tell whether or not it is possible to sleep, and also to access credentials. 3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt() call. Protocols should use the process pointer they are now passed. 4) The PF_LOCAL and PF_ROUTE families have been updated to use the new style, as has the `raw' skeleton family. 5) PF_LOCAL sockets now obey the process's umask when creating a socket in the filesystem. As a result, LINT is now broken. I'm hoping that some enterprising hacker with a bit more time will either make the broken bits work (should be easy for netipx) or dike them out.
# 3ac4d1ef	22-Mar-1997	Bruce Evans <bde@FreeBSD.org>	Don't #include <sys/fcntl.h> in <sys/file.h> if KERNEL is defined. Fixed everything that depended on getting fcntl.h stuff from the wrong place. Most things don't depend on file.h stuff at all.
# 639acc13	24-Feb-1997	Garrett Wollman <wollman@FreeBSD.org>	Create a new branch of the kernel MIB, kern.ipc, to store all of the configurables and instrumentation related to inter-process communication mechanisms. Some variables, like mbuf statistics, are instrumented here for the first time. For mbuf statistics: also keep track of m_copym() and m_pullup() failures, and provide for the user's inspection the compiled-in values of MSIZE, MHLEN, MCLBYTES, and MINCLSIZE.
# 6875d254	22-Feb-1997	Peter Wemm <peter@FreeBSD.org>	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# 1130b656	14-Jan-1997	Jordan K. Hubbard <jkh@FreeBSD.org>	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# add2e5d0	29-Nov-1996	David Greenman <dg@FreeBSD.org>	Check for error return from uiomove to prevent looping endlessly in soreceive(). Closes PR#2114. Submitted by: wpaul
# ebb0cbea	06-Oct-1996	Paul Traina <pst@FreeBSD.org>	Increase robustness of FreeBSD against high-rate connection attempt denial of service attacks. Reviewed by: bde,wollman,olah Inspired by: vjs@sgi.com
# 2c37256e	11-Jul-1996	Garrett Wollman <wollman@FreeBSD.org>	Modify the kernel to use the new pr_usrreqs interface rather than the old pr_usrreq mechanism which was poorly designed and error-prone. This commit renames pr_usrreq to pr_ousrreq so that old code which depended on it would break in an obvious manner. This commit also implements the new interface for TCP, although the old function is left as an example (#ifdef'ed out). This commit ALSO fixes a longstanding bug in the TCP timer processing (introduced by davidg on 1995/04/12) which caused timer processing on a TCB to always stop after a single timer had expired (because it misinterpreted the return value from tcp_usrreq() to indicate that the TCB had been deleted). Finally, some code related to polling has been deleted from if.c because it is not relevant t -current and doesn't look at all like my current code.
# 82dab6ce	09-May-1996	Garrett Wollman <wollman@FreeBSD.org>	Make it possible to return more than one piece of control information (PR #1178). Define a new SO_TIMESTAMP socket option for datagram sockets to return packet-arrival timestamps as control information (PR #1179). Submitted by: Louis Mamakos <loiue@TransSys.com>
# 46f578e7	15-Apr-1996	David Greenman <dg@FreeBSD.org>	Fix for PR #1146: the "next" pointer must be cached before calling soabort since the struct containing it may be freed.
# edbfedac	11-Mar-1996	Peter Wemm <peter@FreeBSD.org>	Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all files are off the vendor branch, so this should not change anything. A "U" marker generally means that the file was not changed in between the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally means that there was a change. [note new unused (in this form) syscalls.conf, to be 'cvs rm'ed]
# be24e9e8	11-Mar-1996	David Greenman <dg@FreeBSD.org>	Changed socket code to use 4.4BSD queue macros. This includes removing the obsolete soqinsque and soqremque functions as well as collapsing so_q0len and so_qlen into a single queue length of unaccepted connections. Now the queue of unaccepted & complete connections is checked directly for queued sockets. The new code should be functionally equivilent to the old while being substantially faster - especially in cases where large numbers of connections are often queued for accept (e.g. http).
# dc915e7c	13-Feb-1996	Garrett Wollman <wollman@FreeBSD.org>	Kill XNS. While we're at it, fix socreate() to take a process argument. (This was supposed to get committed days ago...)
# b1358054	07-Feb-1996	Garrett Wollman <wollman@FreeBSD.org>	Define a new socket option, SO_PRIVSTATE. Getting it returns the state of the SS_PRIV flag in so_state; setting it always clears same.
# 47daf5d5	14-Dec-1995	Bruce Evans <bde@FreeBSD.org>	Nuked ambiguous sleep message strings: old: new: netcls[] = "netcls" "soclos" netcon[] = "netcon" "accept", "connec" netio[] = "netio" "sblock", "sbwait"
# ff5c09da	03-Nov-1995	Garrett Wollman <wollman@FreeBSD.org>	Make somaxconn (maximum backlog in a listen(2) request) and sb_max (maximum size of a socket buffer) tunable. Permit callers of listen(2) to specify a negative backlog, which is translated into somaxconn. Previously, a negative backlog was silently translated into 0.
# 5e319b84	25-Aug-1995	Bruce Evans <bde@FreeBSD.org>	Remove extra arg from one of the calls to (*pr_usrreq)().
# 9b2e5354	30-May-1995	Rodney W. Grimes <rgrimes@FreeBSD.org>	Remove trailing whitespace.
# 5f540404	15-Feb-1995	Garrett Wollman <wollman@FreeBSD.org>	getsockopt(s, SOL_SOCKET, SO_SNDTIMEO, ...) would construct the returned timeval incorrectly, truncating the usec part. Obtained from: Stevens vol. 2 p. 548
# 6b8fda4d	06-Feb-1995	Garrett Wollman <wollman@FreeBSD.org>	Merge in the socket-level support for Transaction TCP.
# a635d6c7	05-Feb-1995	David Greenman <dg@FreeBSD.org>	Use M_NOWAIT instead of M_KERNEL for socket allocations; it is apparantly possible for certain socket operations to occur during interrupt context. Submitted by: John Dyson
# 9f518539	02-Feb-1995	David Greenman <dg@FreeBSD.org>	Calling semantics for kmem_malloc() have been changed...and the third argument is now more than just a single flag. (kern_malloc.c) Used new M_KERNEL value for socket allocations that previous were "M_NOWAIT". Note that this will change when we clean up the M_ namespace mess. Submitted by: John Dyson
# 797f2d22	02-Oct-1994	Poul-Henning Kamp <phk@FreeBSD.org>	All of this is cosmetic. prototypes, #includes, printfs and so on. Makes GCC a lot more silent.
# 3c4dd356	02-Aug-1994	David Greenman <dg@FreeBSD.org>	Added $Id$
# 3962127e	29-May-1994	David Greenman <dg@FreeBSD.org>	Changed mbuf allocation policy to get a cluster if size > MINCLSIZE. Makes a BIG difference in socket performance.
# 26f9a767	25-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
# df8bae1d	24-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	BSD 4.4 Lite Kernel Sources