Cross Reference: /freebsd-current/sys/kern/uipc

History log of /freebsd-current/sys/kern/uipc_syscalls.c
Revision	Date	Author	Comments
# aa32d7cb	06-Apr-2024	Jake Freeland <jfree@FreeBSD.org>	ktrace: Record socket violations with KTR_CAPFAIL Report restricted access to socket addresses and protocols while Capsicum violation tracing with CAPFAIL_ADDR and CAPFAIL_PROTO. Reviewed by: markj Approved by: markj (mentor) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D40681
# 0cd9cde7	06-Apr-2024	Jake Freeland <jfree@FreeBSD.org>	ktrace: Record namei violations with KTR_CAPFAIL Report namei path lookups while Capsicum violation tracing with CAPFAIL_NAMEI. vfs caching is also ignored when tracing to mimic capability mode behavior. Reviewed by: markj Approved by: markj (mentor) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D40680
# 47ad4f2d	04-Mar-2024	Kyle Evans <kevans@FreeBSD.org>	ktrace: log genio events on failed write Visibility into the contents of the buffer when a write(2) has failed can be immensely useful in debugging IPC issues -- pushing this to discuss the idea, or maybe an alternative where we can set a flag like KTRFAC_ERRIO to enable it. When a genio event is potentially raised after an error, currently we'll just free the uio and return. However, such data can be useful when debugging communication between processes to, e.g., understand what the remote side should have grabbed before closing a pipe. Tap out the entire buffer on failure rather than simply discarding it. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D43799
# f79a8585	30-Jan-2024	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: garbage collect SS_ISCONFIRMING Fixes: 8df32b19dee92b5eaa4b488ae78dca6accfcb38e
# c3276e02	16-Jan-2024	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: make shutdown(2) how argument a enum Reviwed by: tuexen Differential Revision: https://reviews.freebsd.org/D43412
# 0fac350c	30-Nov-2023	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: don't malloc/free sockaddr memory on getpeername/getsockname Just like it was done for accept(2) in cfb1e92912b4, use same approach for two simplier syscalls that return socket addresses. Although, these two syscalls aren't performance critical, this change generalizes some code between 3 syscalls trimming code size. Following example of accept(2), provide VNET-aware and INVARIANT-checking wrappers sopeeraddr() and sosockaddr() around protosw methods. Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D42694
# cfb1e929	30-Nov-2023	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: don't malloc/free sockaddr memory on accept(2) Let the accept functions provide stack memory for protocols to fill it in. Generic code should provide sockaddr_storage, specialized code may provide smaller structure. While rewriting accept(2) make 'addrlen' a true in/out parameter, reporting required length in case if provided length was insufficient. Our manual page accept(2) and POSIX don't explicitly require that, but one can read the text as they do. Linux also does that. Update tests accordingly. Reviewed by: rscheff, tuexen, zlei, dchagin Differential Revision: https://reviews.freebsd.org/D42635
# 29363fb4	23-Nov-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove ancient SCCS tags. Remove ancient SCCS tags from the tree, automated scripting, with two minor fixup to keep things compiling. All the common forms in the tree were removed with a perl script. Sponsored by: Netflix
# 761ae1ce	16-Oct-2023	Mark Johnston <markj@FreeBSD.org>	ktrace: Handle uio_resid underflow via MSG_TRUNC When recvmsg(2) is used with MSG_TRUNC on an atomic socket type (DGRAM or SEQPACKET), soreceive_generic() and uipc_peek_dgram() may intentionally underflow uio_resid so that userspace can find out how many bytes it should have asked for. If this happens, and KTR_GENIO is enabled, ktrgenio() will attempt to copy in beyond the end of the output buffer's iovec. In general this will silently cause the ktrace operation to fail since it'll result in EFAULT from uiomove(). Let's be more careful and make sure not to try and copy more bytes than we have. Fixes: be1f485d7d6b ("sockets: add MSG_TRUNC flag handling for recvfrom()/recvmsg().") Reported by: syzbot+30b4bb0c0bc0f53ac198@syzkaller.appspotmail.com Reviewed by: kib MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D42099
# 685dc743	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove $FreeBSD$: one-line .c pattern Remove /^[\s]__FBSDID$"\$FreeBSD\$"$;?\s*\n/
# 6016aedb	12-Jun-2023	Dmitriy Alexandrov <d06alexandrov@users.noreply.github.com>	uipc_syscalls: removed unnecessary check in accept1() function Signed-off-by: Dmitriy Alexandrov <d06alexandrov@gmail.com> Reviewed by: imp Pull Request: https://github.com/freebsd/freebsd-src/pull/773
# 00343b4a	13-Feb-2023	Mateusz Guzik <mjg@FreeBSD.org>	uipc: ansify Sponsored by: Rubicon Communications, LLC ("Netgate")
# 7a2c93b8	14-Dec-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: provide sousrsend() that does socket specific error handling Sockets have special handling for EPIPE on a write, that was spread out into several places. Treating transient errors is also special - if protocol is atomic, than we should ignore any changes to uio_resid, a transient error means the write had completely failed (see d2b3a0ed31e). - Provide sousrsend() that expects a valid uio, and leave sosend() for kernel consumers only. Do all special error handling right here. - In dofilewrite() don't do special handling of error for DTYPE_SOCKET. - For send(2), write(2) and aio_write(2) call into sousrsend() and remove error handling for kern_sendit(), soo_write() and soaio_process_job(). PR: 265087 Reported by: rz-rpi03 at h-ka.de Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35863
# 1760a695	10-Sep-2022	Mateusz Guzik <mjg@FreeBSD.org>	Fixup build after recent getsock changes
# 3be2225f	10-Sep-2022	Mateusz Guzik <mjg@FreeBSD.org>	Remove fflag argument from getsock_cap Interested callers can obtain in other own easily enough and there is no reason to branch on it.
# 3212ad15	07-Sep-2022	Mateusz Guzik <mjg@FreeBSD.org>	Add getsock All but one consumers of getsock_cap only pass 4 arguments. Take advantage of it.
# a2ad7092	10-Sep-2022	Mateusz Guzik <mjg@FreeBSD.org>	Add branch prediction hints to getsock_cap
# e7d02be1	17-Aug-2022	Gleb Smirnoff <glebius@FreeBSD.org>	protosw: refactor protosw and domain static declaration and load o Assert that every protosw has pr_attach. Now this structure is only for socket protocols declarations and nothing else. o Merge struct pr_usrreqs into struct protosw. This was suggested in 1996 by wollman@ (see 7b187005d18ef), and later reiterated in 2006 by rwatson@ (see 6fbb9cf860dcd). o Make struct domain hold a variable sized array of protosw pointers. For most protocols these pointers are initialized statically. Those domains that may have loadable protocols have spacers. IPv4 and IPv6 have 8 spacers each (andre@ dff3237ee54ea). o For inetsw and inet6sw leave a comment noting that many protosw entries very likely are dead code. o Refactor pf_proto_[un]register() into protosw_[un]register(). o Isolate pr_*_notsupp() methods into uipc_domain.c Reviewed by: melifaro Differential revision: https://reviews.freebsd.org/D36232
# 31d1b816	28-May-2022	Dmitry Chagin <dchagin@FreeBSD.org>	sysent: Get rid of bogus sys/sysent.h include. Where appropriate hide sysent.h under proper condition. MFC after: 2 weeks
# d60ea9a1	25-May-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sockets: return EMSGSIZE if control part of message is too large Specification doesn't list an explicit error code for the control size specified by msg_control being too large. But it does list EMSGSIZE as error code for "message is too large to be sent all at once (as the socket requires)". It also lists EINVAL as code for the "The sum of the iov_len values overflows an ssize_t." Given how generic and uninformative EINVAL is, the EMSGSIZE is more appropriate. https://pubs.opengroup.org/onlinepubs/9699919799/functions/sendmsg.html Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35316
# d2b3a0ed	17-Feb-2022	Gleb Smirnoff <glebius@FreeBSD.org>	sendto: don't clear transient errors for atomic protocols The changeset 65572cade35 uncovered the fact that top layer of sendto(2) would clear a transient error code if some data was copied out of uio. The clearing of the error makes sense for non-atomic protocols, since they have sent some data. The atomic protocols send all or nothing. The current implementation of unix/dgram uses sosend_generic(), which would always copyout and only then it may fail to deliver a message. The sosend_dgram(), currently used by UDP only, also has same behavior. Reported by: pho Reviewed by: pho, markj Differential revision: https://reviews.freebsd.org/D34309
# 308fc7e5	24-Jan-2022	John Baldwin <jhb@FreeBSD.org>	user_getpeername: Use 'bool' for the compat argument. This matches user_getsockname. Reviewed by: brooks, kib Sponsored by: The University of Cambridge, Google Inc. Differential Revision: https://reviews.freebsd.org/D33987
# ba4e5253	29-Nov-2021	Brooks Davis <brooks@FreeBSD.org>	syscalls: normalize orecvfrom and ogetsockname Declare o<foo>_args rather than reusing the equivalent <foo>_args structs. Avoiding the addition of a new type isn't worth the gratutious differences. Reviewed by: kib, imp
# 28f04718	29-Nov-2021	Brooks Davis <brooks@FreeBSD.org>	uipc: rework recvfrom, getsockname, getpeername Stop using <foo>_args structs as part of internal kernel APIs. Add a kern_recvfrom and adjust getsockname and getpeername's equivalent functions to take individual arguments rather than a uap pointer. Adopt a convention from CheriBSD that a function interacting with userspace pointers and sitting between the sys_<foo> syscall and kern_<foo> implementation is named user_<foo>. Reviewed by: kib, imp
# a8aa6f1f	07-Sep-2021	Mark Johnston <markj@FreeBSD.org>	socket: Avoid clearing SS_ISCONNECTING if soconnect() fails This behaviour appears to date from the 4.4 BSD import. It has two problems: 1. The update to so_state is not protected by the socket lock, so concurrent updates to so_state may be lost. 2. Suppose two threads race to call connect(2) on a socket, and one succeeds while the other fails. Then the failing thread may incorrectly clear SS_ISCONNECTING, confusing the state machine. Simply remove the update. It does not appear to be necessary: pru_connect implementations which call soisconnecting() only do so after all failure modes have been handled. For instance, tcp_connect() and tcp6_connect() will never return an error after calling soisconnected(). However, we cannot correctly assert that SS_ISCONNECTED is not set after an error from soconnect() since the socket lock is not held across the pru_connect call, so a concurrent connect(2) may have set the flag. MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31699
# 091869de	27-Aug-2021	Mark Johnston <markj@FreeBSD.org>	connect: Use soconnectat() unconditionally in kern_connect() soconnect(...) is equivalent to soconnectat(AT_FDCWD, ...), so rely on this to save a branch. No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation
# f4bb1869	14-Jun-2021	Mark Johnston <markj@FreeBSD.org>	Consistently use the SOLISTENING() macro Some code was using it already, but in many places we were testing SO_ACCEPTCONN directly. As a small step towards fixing some bugs involving synchronization with listen(2), make the kernel consistently use SOLISTENING(). No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation
# f187d6df	15-Mar-2021	Kyle Evans <kevans@FreeBSD.org>	base: remove if_wg(4) and associated utilities, manpage After length decisions, we've decided that the if_wg(4) driver and related work is not yet ready to live in the tree. This driver has larger security implications than many, and thus will be held to more scrutiny than other drivers. Please also see the related message sent to the freebsd-hackers@ and freebsd-arch@ lists by Kyle Evans <kevans@FreeBSD.org> on 2021/03/16, with the subject line "Removing WireGuard Support From Base" for additional context.
# 74ae3f3e	14-Mar-2021	Kyle Evans <kevans@FreeBSD.org>	if_wg: import latest fixup work from the wireguard-freebsd project This is the culmination of about a week of work from three developers to fix a number of functional and security issues. This patch consists of work done by the following folks: - Jason A. Donenfeld <Jason@zx2c4.com> - Matt Dunwoodie <ncon@noconroy.net> - Kyle Evans <kevans@FreeBSD.org> Notable changes include: - Packets are now correctly staged for processing once the handshake has completed, resulting in less packet loss in the interim. - Various race conditions have been resolved, particularly w.r.t. socket and packet lifetime (panics) - Various tests have been added to assure correct functionality and tooling conformance - Many security issues have been addressed - if_wg now maintains jail-friendly semantics: sockets are created in the interface's home vnet so that it can act as the sole network connection for a jail - if_wg no longer fails to remove peer allowed-ips of 0.0.0.0/0 - if_wg now exports via ioctl a format that is future proof and complete. It is additionally supported by the upstream wireguard-tools (which we plan to merge in to base soon) - if_wg now conforms to the WireGuard protocol and is more closely aligned with security auditing guidelines Note that the driver has been rebased away from using iflib. iflib poses a number of challenges for a cloned device trying to operate in a vnet that are non-trivial to solve and adds complexity to the implementation for little gain. The crypto implementation that was previously added to the tree was a super complex integration of what previously appeared in an old out of tree Linux module, which has been reduced to crypto.c containing simple boring reference implementations. This is part of a near-to-mid term goal to work with FreeBSD kernel crypto folks and take advantage of or improve accelerated crypto already offered elsewhere. There's additional test suite effort underway out-of-tree taking advantage of the aforementioned jail-friendly semantics to test a number of real-world topologies, based on netns.sh. Also note that this is still a work in progress; work going further will be much smaller in nature. MFC after: 1 month (maybe)
# 7e097daa	11-Aug-2019	Konstantin Belousov <kib@FreeBSD.org>	Only enable COMPAT_43 changes for syscalls ABI for a.out processes. Reviewed by: imp, jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21200
# 401ca034	05-Feb-2019	Mark Johnston <markj@FreeBSD.org>	Avoid leaking fp references when truncating SCM_RIGHTS control messages. Reported by: pho Approved by: so MFC after: 0 minutes Security: CVE-2019-5596 Sponsored by: The FreeBSD Foundation
# d48719bd	04-Dec-2018	Brooks Davis <brooks@FreeBSD.org>	Normalize COMPAT_43 syscall declarations. Have ogetkerninfo, ogetpagesize, ogethostname, osethostname, and oaccept declare o<foo>_args structs rather than non-compat ones. Due to a failure to use NOARGS in most cases this adds only one new declaration. No changes required in freebsd32 as only ogetpagesize() is implemented and it has a 32-bit specific implementation. Reviewed by: kib Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15816
# 318f0d77	06-Nov-2018	Brooks Davis <brooks@FreeBSD.org>	Use declared types for caddr_t arguments. Leave ptrace(2) alone for the moment as it's defined to take a caddr_t. Reviewed by: kib Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17852
# c7902fbe	07-Aug-2018	Mark Johnston <markj@FreeBSD.org>	Improve handling of control message truncation. If a recvmsg(2) or recvmmsg(2) caller doesn't provide sufficient space for all control messages, the kernel sets MSG_CTRUNC in the message flags to indicate truncation of the control messages. In the case of SCM_RIGHTS messages, however, we were failing to dispose of the rights that had already been externalized into the recipient's file descriptor table. Add a new function and mbuf type to handle this cleanup task, and use it any time we fail to copy control messages out to the recipient. To simplify cleanup, control message truncation is now only performed at control message boundaries. The change also fixes a few related bugs: - Rights could be leaked to the recipient process if an error occurred while copying out a message's contents. - We failed to set MSG_CTRUNC if the truncation occurred on a control message boundary, e.g., if the caller received two control messages and provided only the exact amount of buffer space needed for the first. PR: 131876 Reviewed by: ed (previous version) MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16561
# da446550	02-Aug-2018	Alan Somers <asomers@FreeBSD.org>	Fix LOCAL_PEERCRED with socketpair(2) Enable the LOCAL_PEERCRED socket option for unix domain stream sockets created with socketpair(2). Previously, it only worked with unix domain stream sockets created with socket(2)/listen(2)/connect(2)/accept(2). PR: 176419 Reported by: Nicholas Wilson <nicholas@nicholaswilson.me.uk> MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D16350
# 8a656309	22-May-2018	Matt Macy <mmacy@FreeBSD.org>	kern_sendit: use pre-initialized rights
# cbd92ce6	09-May-2018	Matt Macy <mmacy@FreeBSD.org>	Eliminate the overhead of gratuitous repeated reinitialization of cap_rights - Add macros to allow preinitialization of cap_rights_t. - Convert most commonly used code paths to use preinitialized cap_rights_t. A 3.6% speedup in fstat was measured with this change. Reported by: mjg Reviewed by: oshogbo Approved by: sbruno MFC after: 1 month
# 2216c693	30-Apr-2018	Ed Maste <emaste@FreeBSD.org>	Disable connectat/bindat with AT_FDCWD in capmode Previously it was possible to connect a socket (which had the CAP_CONNECT right) by calling "connectat(AT_FDCWD, ...)" even in capabilties mode. This combination should be treated the same as a call to connect (i.e. forbidden in capabilities mode). Similarly for bindat. Disable connectat/bindat with AT_FDCWD in capabilities mode, fix up the documentation and add tests. PR: 222632 Submitted by: Jan Kokemüller <jan.kokemueller@gmail.com> Reviewed by: Domagoj Stolfa MFC after: 1 week Relnotes: Yes Differential Revision: https://reviews.freebsd.org/D15221
# 6469bdcd	06-Apr-2018	Brooks Davis <brooks@FreeBSD.org>	Move most of the contents of opt_compat.h to opt_global.h. opt_compat.h is mentioned in nearly 180 files. In-progress network driver compabibility improvements may add over 100 more so this is closer to "just about everywhere" than "only some files" per the guidance in sys/conf/options. Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h is created on all architectures. Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the set of compiled files. Reviewed by: kib, cem, jhb, jtl Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14941
# 34a77b97	27-Mar-2018	Brooks Davis <brooks@FreeBSD.org>	Move uio enums to sys/_uio.h. Include _uio.h instead of uio.h in several headers to reduce header polution. Fix a few places that relied on header polution to get the uio.h header. I have not moved struct uio as many more things that use it rely on header polution to get other definitions from uio.h. Reviewed by: cem, kib, markj Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14811
# 51369649	20-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.
# 779f106a	08-Jun-2017	Gleb Smirnoff <glebius@FreeBSD.org>	Listening sockets improvements. o Separate fields of struct socket that belong to listening from fields that belong to normal dataflow, and unionize them. This shrinks the structure a bit. - Take out selinfo's from the socket buffers into the socket. The first reason is to support braindamaged scenario when a socket is added to kevent(2) and then listen(2) is cast on it. The second reason is that there is future plan to make socket buffers pluggable, so that for a dataflow socket a socket buffer can be changed, and in this case we also want to keep same selinfos through the lifetime of a socket. - Remove struct struct so_accf. Since now listening stuff no longer affects struct socket size, just move its fields into listening part of the union. - Provide sol_upcall field and enforce that so_upcall_set() may be called only on a dataflow socket, which has buffers, and for listening sockets provide solisten_upcall_set(). o Remove ACCEPT_LOCK() global. - Add a mutex to socket, to be used instead of socket buffer lock to lock fields of struct socket that don't belong to a socket buffer. - Allow to acquire two socket locks, but the first one must belong to a listening socket. - Make soref()/sorele() to use atomic(9). This allows in some situations to do soref() without owning socket lock. There is place for improvement here, it is possible to make sorele() also to lock optionally. - Most protocols aren't touched by this change, except UNIX local sockets. See below for more information. o Reduce copy-and-paste in kernel modules that accept connections from listening sockets: provide function solisten_dequeue(), and use it in the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4), infiniband, rpc. o UNIX local sockets. - Removal of ACCEPT_LOCK() global uncovered several races in the UNIX local sockets. Most races exist around spawning a new socket, when we are connecting to a local listening socket. To cover them, we need to hold locks on both PCBs when spawning a third one. This means holding them across sonewconn(). This creates a LOR between pcb locks and unp_list_lock. - To fix the new LOR, abandon the global unp_list_lock in favor of global unp_link_lock. Indeed, separating these two locks didn't provide us any extra parralelism in the UNIX sockets. - Now call into uipc_attach() may happen with unp_link_lock hold if, we are accepting, or without unp_link_lock in case if we are just creating a socket. - Another problem in UNIX sockets is that uipc_close() basicly did nothing for a listening socket. The vnode remained opened for connections. This is fixed by removing vnode in uipc_close(). Maybe the right way would be to do it for all sockets (not only listening), simply move the vnode teardown from uipc_detach() to uipc_close()? Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D9770
# d293f35c	29-Jan-2017	Edward Tomasz Napierala <trasz@FreeBSD.org>	Add kern_listen(), kern_shutdown(), and kern_socket(), and use them instead of their sys_*() counterparts in various compats. The svr4 is left untouched, because there's no point. Reviewed by: ed@, kib@ MFC after: 2 weeks Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D9367
# 9d71a397	21-Oct-2016	Hiren Panchasara <hiren@FreeBSD.org>	Rework r306337. In sendit(), if mp->msg_control is present, then in sockargs() we are allocating mbuf to store mp->msg_control. Later in kern_sendit(), call to getsock_cap(), will check validity of file pointer passed, if this fails EBADF is returned but mbuf allocated in sockargs() is not freed. Made code changes to free the same. Since freeing control mbuf in sendit() after checking (control != NULL) may lead to double freeing of control mbuf in sendit(), we can free control mbuf in kern_sendit() if there are any errors in the routine. Submitted by: Lohith Bellad <lohith.bellad@me.com> Reviewed by: glebius MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D8152
# 7c9a4d09	26-Sep-2016	Hiren Panchasara <hiren@FreeBSD.org>	Revert r306337. dhw@ reproted a panic which seems related to this and bde@ has raised some issues.
# 41bb1a25	26-Sep-2016	Hiren Panchasara <hiren@FreeBSD.org>	In sendit(), if mp->msg_control is present, then in sockargs() we are allocating mbuf to store mp->msg_control. Later in kern_sendit(), call to getsock_cap(), will check validity of file pointer passed, if this fails EBADF is returned but mbuf allocated in sockargs() is not freed. Fix this possible leak. Submitted by: Lohith Bellad <lohith.bellad@me.com> Reviewed by: adrian MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D7910
# 85b0f9de	22-Sep-2016	Mariusz Zaborski <oshogbo@FreeBSD.org>	capsicum: propagate rights on accept(2) Descriptor returned by accept(2) should inherits capabilities rights from the listening socket. PR: 201052 Reviewed by: emaste, jonathan Discussed with: many Differential Revision: https://reviews.freebsd.org/D7724
# 69a28758	15-Sep-2016	Ed Maste <emaste@FreeBSD.org>	Renumber license clauses in sys/kern to avoid skipping #3
# 2e4fd101	10-Sep-2016	Konstantin Belousov <kib@FreeBSD.org>	Fix build
# 82b3cec5	09-Sep-2016	Ed Maste <emaste@FreeBSD.org>	ANSIfy uipc_syscalls.c Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D7839
# e9877429	18-May-2016	Gleb Smirnoff <glebius@FreeBSD.org>	The SA-16:19 wouldn't have happened if the sockargs() had properly typed argument for length. While here make it static and convert to ANSI C. Reviewed by: C Turt
# 7349ea78	17-May-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Validate that user supplied control message length is not negative. Submitted by: C Turt <cturt hardenedbsd.org> Security: SA-16:19 Security: CVE-2016-1887
# b85f65af	15-Apr-2016	Pedro F. Giffuni <pfg@FreeBSD.org>	kern: for pointers replace 0 with NULL. These are mostly cosmetical, no functional change. Found with devel/coccinelle.
# 33a2a37b	21-Jan-2016	Gleb Smirnoff <glebius@FreeBSD.org>	- Separate sendfile(2) implementation from uipc_syscalls.c into separate file. Claim my copyright. - Provide more comments, better function and structure names. - Sort out unneeded includes from resulting two files. No functional changes.
# 2bab0c55	08-Jan-2016	Gleb Smirnoff <glebius@FreeBSD.org>	New sendfile(2) syscall. A joint effort of NGINX and Netflix from 2013 and up to now. The new sendfile is the code that Netflix uses to send their multiple tens of gigabits of data per second. The new implementation features asynchronous I/O, when I/O operations are launched, but not awaited to be complete. An explanation of why such behavior is beneficial compared to old one is going to be too long for a commit message, so we will skip it here. Additional features of new syscall are extra flags, which provide an application more control over data sent. The SF_NOCACHE flag tells kernel that data shouldn't be cached after it was sent. The SF_READAHEAD() macro allows to specify readahead size in pages. The new syscalls is a drop in replacement. No modifications are required to applications. One can take nginx binary for stable/10 and run it successfully on head. Although SF_NODISKIO lost its original sense, as now sendfile doesn't block, and now means something completely different (tm), using the new sendfile the old way is absolutely safe. Celebrates: Netflix global launch! Sponsored by: Nginx, Inc. Sponsored by: Netflix Relnotes: yes
# b0cd2017	16-Dec-2015	Gleb Smirnoff <glebius@FreeBSD.org>	A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES(). o With new KPI consumers can request contiguous ranges of pages, and unlike before, all pages will be kept busied on return, like it was done before with the 'reqpage' only. Now the reqpage goes away. With new interface it is easier to implement code protected from race conditions. Such arrayed requests for now should be preceeded by a call to vm_pager_haspage() to make sure that request is possible. This could be improved later, making vm_pager_haspage() obsolete. Strenghtening the promises on the business of the array of pages allows us to remove such hacks as swp_pager_free_nrpage() and vm_pager_free_nonreq(). o New KPI accepts two integer pointers that may optionally point at values for read ahead and read behind, that a pager may do, if it can. These pages are completely owned by pager, and not controlled by the caller. This shifts the UFS-specific readahead logic from vm_fault.c, which should be file system agnostic, into vnode_pager.c. It also removes one VOP_BMAP() request per hard fault. Discussed with: kib, alc, jeff, scottl Sponsored by: Nginx, Inc. Sponsored by: Netflix
# b114aa79	27-Jul-2015	Ed Schouten <ed@FreeBSD.org>	Make shutdown() return ENOTCONN as required by POSIX, part deux. Summary: Back in 2005, maxim@ attempted to fix shutdown() to return ENOTCONN in case the socket was not connected (r150152). This had to be rolled back (r150155), as it broke some of the existing programs that depend on this behavior. I reapplied this change on my system and indeed, syslogd failed to start up. I fixed this back in February (279016) and MFC'ed it to the supported stable branches. Apart from that, things seem to work out all right. Since at least Linux and Mac OS X do the right thing, I'd like to go ahead and give this another try. To keep old copies of syslogd working, only start returning ENOTCONN for recent binaries. I took a look at the XNU sources and they seem to test against both SS_ISCONNECTED, SS_ISCONNECTING and SS_ISDISCONNECTING, instead of just SS_ISCONNECTED. That seams reasonable, so let's do the same. Test Plan: This issue was uncovered while writing tests for shutdown() in CloudABI: https://github.com/NuxiNL/cloudlibc/blob/master/src/libc/sys/socket/shutdown_test.c#L26 Reviewers: glebius, rwatson, #manpages, gnn, #network Reviewed By: gnn, #network Subscribers: bms, mjg, imp Differential Revision: https://reviews.freebsd.org/D3039
# 093c7f39	12-Jun-2015	Gleb Smirnoff <glebius@FreeBSD.org>	Make KPI of vm_pager_get_pages() more strict: if a pager changes a page in the requested array, then it is responsible for disposition of previous page and is responsible for updating the entry in the requested array. Now consumers of KPI do not need to re-lookup the pages after call to vm_pager_get_pages(). Reviewed by: kib Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 25742185	11-Apr-2015	Mateusz Guzik <mjg@FreeBSD.org>	Replace struct filedesc argument in getsock_cap with struct thread This is is a step towards removal of spurious arguments.
# 90f54cbf	11-Apr-2015	Mateusz Guzik <mjg@FreeBSD.org>	fd: remove filedesc argument from fdclose Just accept a thread instead. This makes it consistent with fdalloc. No functional changes.
# 3d597295	28-Feb-2015	Ryan Stone <rstone@FreeBSD.org>	Correct the use of an unitialized variable in sendfind_getobj() When sendfile_getobj() is called on a DTYPE_SHM file, it never initializes error, which is eventually returned to the caller. Differential Revision: https://reviews.freebsd.org/D1989 Reviewed by: kib Reported by: Brainy Code Scanner, by Maxime Villard.
# b7a39e9e	17-Feb-2015	Mateusz Guzik <mjg@FreeBSD.org>	filedesc: simplify fget_unlocked & friends Introduce fget_fcntl which performs appropriate checks when needed. This removes a branch from fget_unlocked. Introduce fget_mmap dealing with cap_rights_to_vmprot conversion. This removes a branch from _fget. Modify fget_unlocked to pass sequence counter to interested callers so that they can perform their own checks and make sure the result was otained from stable & current state. Reviewed by: silence on -hackers
# 6e646651	13-Nov-2014	Konstantin Belousov <kib@FreeBSD.org>	Remove the no-at variants of the kern_xx() syscall helpers. E.g., we have both kern_open() and kern_openat(); change the callers to use kern_openat(). This removes one (sometimes two) levels of indirection and consolidates arguments checks. Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week
# efe28398	11-Nov-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Fix build.
# 0e87b36e	11-Nov-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Remove SF_KQUEUE code. This code was developed at Netflix, but was not ever used. It didn't go into stable/10, neither was documented. It might be useful, but we collectively decided to remove it, rather leave it abandoned and unmaintained. It is removed in one single commit, so restoring it should be easy, if anyone wants to reopen this idea. Sponsored by: Netflix
# 80b47aef	09-Oct-2014	Marcel Moolenaar <marcel@FreeBSD.org>	Move the SCTP syscalls to netinet with the rest of the SCTP code. The syscalls themselves are tightly coupled with the network stack and therefore should not be in the generic socket code. The following four syscalls have been marked as NOSTD so they can be dynamically registered in sctp_syscalls_init() function: sys_sctp_peeloff sys_sctp_generic_sendmsg sys_sctp_generic_sendmsg_iov sys_sctp_generic_recvmsg The syscalls are also set up to be dynamically registered when COMPAT32 option is configured. As a side effect of moving the SCTP syscalls, getsock_cap needs to be made available outside of the uipc_syscalls.c source file. A proper prototype has been added to the sys/socketvar.h header file. API tests from the SCTP reference implementation have been run to ensure compatibility. (http://code.google.com/p/sctp-refimpl/source/checkout) Submitted by: Steve Kiernan <stevek@juniper.net> Reviewed by: tuexen, rrs Obtained from: Juniper Networks, Inc.
# 818d40d0	10-Aug-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Provide sf_buf_ref() to optimize refcounting of already allocated sendfile(2) buffers. Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 1fbe6a82	11-Jul-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Improve reference counting of EXT_SFBUF pages attached to mbufs. o Do not use UMA refcount zone. The problem with this zone is that several refcounting words (16 on amd64) share the same cache line, and issueing atomic(9) updates on them creates cache line contention. Also, allocating and freeing them is extra CPU cycles. Instead, refcount the page directly via vm_page_wire() and the sfbuf via sf_buf_alloc(sf_buf_page(sf)) [1]. o Call refcounting/freeing function for EXT_SFBUF via direct function call, instead of function pointer. This removes barrier for CPU branch predictor. o Do not cleanup the mbuf to be freed in mb_free_ext(), merely to satisfy assertion in mb_dtor_mbuf(). Remove the assertion from mb_dtor_mbuf(). Use bcopy() instead of manual assignments to copy m_ext in mb_dupcl(). [1] This has some problems for now. Using sf_buf_alloc() merely to increase refcount is expensive, and is broken on sparc64. To be fixed. Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 15c28f87	11-Jul-2014	Gleb Smirnoff <glebius@FreeBSD.org>	All mbuf external free functions never fail, so let them be void. Sponsored by: Nginx, Inc.
# 3ae10f74	16-Jun-2014	Attilio Rao <attilio@FreeBSD.org>	- Modify vm_page_unwire() and vm_page_enqueue() to directly accept the queue where to enqueue pages that are going to be unwired. - Add stronger checks to the enqueue/dequeue for the pagequeues when adding and removing pages to them. Of course, for unmanaged pages the queue parameter of vm_page_unwire() will be ignored, just as the active parameter today. This makes adding new pagequeues quicker. This change effectively modifies the KPI. __FreeBSD_version will be, however, bumped just when the full cache of free pages will be evicted. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho
# 857ce8a2	11-May-2014	Jilles Tjoelker <jilles@FreeBSD.org>	accept(),accept4(): Don't set addrlen = 0 on [ECONNABORTED]. If the underlying protocol reported an error (e.g. because a connection was closed while waiting in the queue), this error was also indicated by returning a zero-length address. For all other kinds of errors (e.g. [EAGAIN], [ENFILE], [EMFILE]), addrlen is unmodified and there are successful cases where a zero-length address is returned (e.g. a connection from an unbound Unix-domain socket), so this error indication is not reliable. As reported in Austin Group bug #836, modifying addrlen on error may cause subtle bugs if applications retry the call without resetting addrlen.
# 4a144410	16-Mar-2014	Robert Watson <rwatson@FreeBSD.org>	Update kernel inclusions of capability.h to use capsicum.h instead; some further refinement is required as some device drivers intended to be portable over FreeBSD versions rely on __FreeBSD_version to decide whether to include capability.h. MFC after: 3 weeks
# 0cfea1c8	16-Jan-2014	Adrian Chadd <adrian@FreeBSD.org>	Implement a kqueue notification path for sendfile. This fires off a kqueue note (of type sendfile) to the configured kqfd when the sendfile transaction has completed and the relevant memory backing the transaction is no longer in use by this transaction. This is analogous to SF_SYNC waiting for the mbufs to complete - except now you don't have to wait. Both SF_SYNC and SF_KQUEUE should work together, even if it doesn't necessarily make any practical sense. This is designed for use by applications which use backing cache/store files (eg Varnish) or POSIX shared memory (not sure anything is using it yet!) to know when a region of memory is free for re-use. Note it doesn't mark the region as free overall - only free from this transaction. The application developer still needs to track which ranges are in the process of being recycled and wait until all pending transactions are completed. TODO: * documentation, as always Sponsored by: Netflix, Inc.
# a43caef1	08-Jan-2014	Adrian Chadd <adrian@FreeBSD.org>	Refactor out the common sendfile code from the do_sendfile() and the compat32 sendfile syscall. Sponsored by: Netflix, Inc.
# dc3bdd4a	16-Dec-2013	Adrian Chadd <adrian@FreeBSD.org>	Remove the invariants stuff I copy/paste'd from the mbuf code when setting up the UMA zone. This should (a) be correct(er) and (b) it should build on non-amd64. Pointed out by: glebius
# 73242a5e	16-Dec-2013	Adrian Chadd <adrian@FreeBSD.org>	Migrate the sendfile_sync struct to use a UMA zone rather than M_TEMP. This allows it to be better tracked as well as being able to leverage UMA for more interesting/useful behaviour at a later date. Sponsored by: Netflix, Inc.
# ad4804a0	01-Dec-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Remove unused variable.
# 79750e3b	30-Nov-2013	Adrian Chadd <adrian@FreeBSD.org>	Migrate the sendfile_sync structure into a public(ish) API in preparation for extending and reusing it. The sendfile_sync wrapper is mostly just a "mbuf transaction" wrapper, used to indicate that the backing store for a group of mbufs has completed. It's only being used by sendfile for now and it's only implementing a sleep/wakeup rendezvous. However, there are other potential signaling paths (kqueue) and other potential uses (socket zero-copy write) where the same mechanism would also be useful. So, with that in mind: * extract the sendfile_sync code out into sf_sync_() methods teach the sf_sync_alloc method about the current config flag - it will eventually know about kqueue. * move the sendfile_sync code out of do_sendfile() - the only thing it now knows about is the sfs pointer. The guts of the sync rendezvous (setup, rendezvous/wait, free) is now done in the syscall wrapper. * .. and teach the 32-bit compat sendfile call the same. This should be a no-op. It's primarily preparation work for teaching the sendfile_sync about kqueue notification. Tested: * Peter Holm's sendfile stress / regression scripts Sponsored by: Netflix, Inc.
# 3287361e	25-Nov-2013	Adrian Chadd <adrian@FreeBSD.org>	Refactor out the sendfile copyout in order to make vn_sendfile() callable from the kernel. Right now vn_sendfile() can't be called from anything other than a syscall handler _and_ return the number of bytes queued. This simply moves the copyout() to do_sendfile() so that any kernel code can initiate vn_sendfile() outside of a syscall context. Tested: * tiny little sendfile program spitting things out a tcp socket Sponsored by: Netflix, Inc.
# 44f3b9c7	21-Oct-2013	Konstantin Belousov <kib@FreeBSD.org>	Print more useful information about the transfer that trigger the assertion. Other data is available with ddb command 'show pginfo'. Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 255c1caa	22-Sep-2013	Gleb Smirnoff <glebius@FreeBSD.org>	- Create kern.ipc.sendfile namespace, and put the new "readhead" OID there as "kern.ipc.sendfile.readahead". - Push all nsfbuf related tunables into MD code. Don't move them to new namespace in favor of POLA. Reviewed by: scottl Approved by: re (gjb)
# 85fdd534	17-Sep-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Fix assertion in sendfile_readpage() to assert only the validity of requested amount of data in a page. Move assertion down below object unlock. Approved by: re (kib) Sponsored by: Nginx, Inc. Sponsored by: Netflix
# 64c5de54	11-Sep-2013	Konstantin Belousov <kib@FreeBSD.org>	Fix build with gcc. Build-tested by: gjb Approved by: re (glebius)
# 227aaa86	11-Sep-2013	Konstantin Belousov <kib@FreeBSD.org>	Implement sendfile(2) for the posix shared memory segment file descriptor, in addition to the regular files. Requested by: alc Discussed with: emaste Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation Approved by: re (hrs)
# 1a05c762	10-Sep-2013	Dag-Erling Smørgrav <des@FreeBSD.org>	Fix the length calculation for the final block of a sendfile(2) transmission which could be tricked into rounding up to the nearest page size, leaking up to a page of kernel memory. [13:11] In IPv6 and NetATM, stop SIOCSIFADDR, SIOCSIFBRDADDR, SIOCSIFDSTADDR and SIOCSIFNETMASK at the socket layer rather than pass them on to the link layer without validation or credential checks. [SA-13:12] Prevent cross-mount hardlinks between different nullfs mounts of the same underlying filesystem. [SA-13:13] Security: CVE-2013-5666 Security: FreeBSD-SA-13:11.sendfile Security: CVE-2013-5691 Security: FreeBSD-SA-13:12.ifioctl Security: CVE-2013-5710 Security: FreeBSD-SA-13:13.nullfs Approved by: re
# 547561f1	04-Sep-2013	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Style fixes. Most fixes are about not treating integers and pointers as booleans.
# 7008be5b	04-Sep-2013	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Change the cap_rights_t type from uint64_t to a structure that we can extend in the future in a backward compatible (API and ABI) way. The cap_rights_t represents capability rights. We used to use one bit to represent one right, but we are running out of spare bits. Currently the new structure provides place for 114 rights (so 50 more than the previous cap_rights_t), but it is possible to grow the structure to hold at least 285 rights, although we can make it even larger if 285 rights won't be enough. The structure definition looks like this: struct cap_rights { uint64_t cr_rights[CAP_RIGHTS_VERSION + 2]; }; The initial CAP_RIGHTS_VERSION is 0. The top two bits in the first element of the cr_rights[] array contain total number of elements in the array - 2. This means if those two bits are equal to 0, we have 2 array elements. The top two bits in all remaining array elements should be 0. The next five bits in all array elements contain array index. Only one bit is used and bit position in this five-bits range defines array index. This means there can be at most five array elements in the future. To define new right the CAPRIGHT() macro must be used. The macro takes two arguments - an array index and a bit to set, eg. #define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL) We still support aliases that combine few rights, but the rights have to belong to the same array element, eg: #define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL) #define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL) #define CAP_FCHMODAT (CAP_FCHMOD \| CAP_LOOKUP) There is new API to manage the new cap_rights_t structure: cap_rights_t cap_rights_init(cap_rights_t rights, ...); void cap_rights_set(cap_rights_t rights, ...); void cap_rights_clear(cap_rights_t rights, ...); bool cap_rights_is_set(const cap_rights_t rights, ...); bool cap_rights_is_valid(const cap_rights_t rights); void cap_rights_merge(cap_rights_t dst, const cap_rights_t src); void cap_rights_remove(cap_rights_t dst, const cap_rights_t src); bool cap_rights_contains(const cap_rights_t big, const cap_rights_t little); Capability rights to the cap_rights_init(), cap_rights_set(), cap_rights_clear() and cap_rights_is_set() functions are provided by separating them with commas, eg: cap_rights_t rights; cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT); There is no need to terminate the list of rights, as those functions are actually macros that take care of the termination, eg: #define cap_rights_set(rights, ...) \ __cap_rights_set((rights), __VA_ARGS__, 0ULL) void __cap_rights_set(cap_rights_t *rights, ...); Thanks to using one bit as an array index we can assert in those functions that there are no two rights belonging to different array elements provided together. For example this is illegal and will be detected, because CAP_LOOKUP belongs to element 0 and CAP_PDKILL to element 1: cap_rights_init(&rights, CAP_LOOKUP \| CAP_PDKILL); Providing several rights that belongs to the same array's element this way is correct, but is not advised. It should only be used for aliases definition. This commit also breaks compatibility with some existing Capsicum system calls, but I see no other way to do that. This should be fine as Capsicum is still experimental and this change is not going to 9.x. Sponsored by: The FreeBSD Foundation
# bb25e5ab	25-Aug-2013	Andre Oppermann <andre@FreeBSD.org>	Give (*ext_free) an int return value allowing for very sophisticated external mbuf buffer management capabilities in the future. For now only EXT_FREE_OK is defined with current legacy behavior. Sponsored by: The FreeBSD Foundation
# 9a736876	24-Aug-2013	Andre Oppermann <andre@FreeBSD.org>	Add an mbuf pointer parameter to (*ext_free) to give the external free function access to the mbuf the external memory was attached to. Mechanically adjust all users to include the mbuf parameter. This fixes a long standing annoyance for external free functions. Before one had to sacrifice one of the argument pointers for this. Sponsored by: The FreeBSD Foundation
# f6d76b0e	23-Aug-2013	Konstantin Belousov <kib@FreeBSD.org>	Since the 253927, which removed the soft busy call for the sf page, it does not make sense to wait for the soft busy state of the page to drain. The vm object lock is dropped immediately after, so the result of the wait is invalidated. It might make sense to not wait for the hard busy state as well, esp. for the fully valid page, but this is postponed for now. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation
# 5944de8e	22-Aug-2013	Konstantin Belousov <kib@FreeBSD.org>	Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9). The flag was mandatory since r209792, where vm_page_grab(9) was changed to only support the alloc retry semantic. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation
# 42d875a5	16-Aug-2013	Xin LI <delphij@FreeBSD.org>	Fix build.
# ca04d21d	15-Aug-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Make sendfile() a method in the struct fileops. Currently only vnode backed file descriptors have this method implemented. Reviewed by: kib Sponsored by: Nginx, Inc. Sponsored by: Netflix
# 90c35c19	13-Aug-2013	Gleb Smirnoff <glebius@FreeBSD.org>	- Minor style(9) fix. - Bring a comment up to date.
# c7aebda8	09-Aug-2013	Attilio Rao <attilio@FreeBSD.org>	The soft and hard busy mechanism rely on the vm object lock to work. Unify the 2 concept into a real, minimal, sxlock where the shared acquisition represent the soft busy and the exclusive acquisition represent the hard busy. The old VPO_WANTED mechanism becames the hard-path for this new lock and it becomes per-page rather than per-object. The vm_object lock becames an interlock for this functionality: it can be held in both read or write mode. However, if the vm_object lock is held in read mode while acquiring or releasing the busy state, the thread owner cannot make any assumption on the busy state unless it is also busying it. Also: - Add a new flag to directly shared busy pages while vm_page_alloc and vm_page_grab are being executed. This will be very helpful once these functions happen under a read object lock. - Move the swapping sleep into its own per-object flag The KPI is heavilly changed this is why the version is bumped. It is very likely that some VM ports users will need to change their own code. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff, kib Tested by: gavin, bapt (older version) Tested by: pho, scottl
# 878a7887	04-Aug-2013	Attilio Rao <attilio@FreeBSD.org>	Remove unnecessary soft busy of the page before to do vn_rdwr() in kern_sendfile() which is unnecessary. The page is already wired so it will not be subjected to pagefault. The content cannot be effectively protected as it is full of races already. Multiple accesses to the same indexes are serialized through vn_rdwr(). Sponsored by: EMC / Isilon storage division Reviewed by: alc, jeff Tested by: pho
# fcd9ff2c	31-Jul-2013	Scott Long <scottl@FreeBSD.org>	Another fix for r253823; retain the default of 1 readahead block for sendfile. Submitted by: glebius Obtained from: Netflix MFC after: 3 days
# fc4a5f05	30-Jul-2013	Scott Long <scottl@FreeBSD.org>	Create a knob, kern.ipc.sfreadahead, that allows one to tune the amount of readahead that sendfile() will do. Default remains the same. Obtained from: Netflix MFC after: 3 days
# 05d1f5bc	15-Jul-2013	Andrey V. Elsukov <ae@FreeBSD.org>	Introduce new structure sfstat for collecting sendfile's statistics and remove corresponding fields from struct mbstat. Use PCPU counters and SFSTAT_INC() macro for update these statistics. Discussed with: glebius
# 3328431d	09-May-2013	Konstantin Belousov <kib@FreeBSD.org>	Item 1 in r248830 causes earlier exits from the sendfile(2), before all requested data was sent. The reason is that xfsize <= 0 condition must not be tested at all if space == loopbytes. Otherwise, the done is set to 1, and sendfile(2) is aborted too early. Instead of moving the condition to exiting the inner loop after the xfersize check, directly check for the completed transfer before the testing of the available space in the socket buffer, and revert item 1 of r248830. It is arguably another bug to sleep waiting for socket buffer space (or return EAGAIN for non-blocking socket) if all bytes are already transferred. Reported by: pho Discussed with: scottl, gibbs Tested by: scottl (stable/9 backport), pho
# da7d2afb	01-May-2013	Jilles Tjoelker <jilles@FreeBSD.org>	Add accept4() system call. The accept4() function, compared to accept(), allows setting the new file descriptor atomically close-on-exec and explicitly controlling the non-blocking status on the new socket. (Note that the latter point means that accept() is not equivalent to any form of accept4().) The linuxulator's accept4 implementation leaves a race window where the new file descriptor is not close-on-exec because it calls sys_accept(). This implementation leaves no such race window (by using falloc() flags). The linuxulator could be fixed and simplified by using the new code. Like accept(), accept4() is async-signal-safe, a cancellation point and permitted in capability mode.
# 3d317679	28-Apr-2013	Konstantin Belousov <kib@FreeBSD.org>	Eliminate the layering violation in the kern_sendfile(). When quering the file size, use VOP_GETATTR() instead of accessing vnode vm_object un_pager.vnp.vnp_size. Take the shared vnode lock earlier to cover the added VOP_GETATTR() call and, as consequence, the whole internal sendfile loop. Reduce vm object lock scope to not protect the local calculations. Note that this is the last misuse of the vnp_size in the tree, the others were removed from the ELF image activator by r230246. Reviewed by: alc Tested by: pho, bf (previous version) MFC after: 1 week
# 14658a80	19-Apr-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Don't compare unsigned socklen_t against < 0. Reviewed by: jhb
# 07dbf2c7	28-Mar-2013	Scott Long <scottl@FreeBSD.org>	Several fixes and improvements to sendfile() 1. If we wanted to send exactly as many bytes as the socket buffer is sized for, the inner loop of kern_sendfile() would see that the socket is full before seeing that it had no more bytes left to send. This would cause it to return EAGAIN to the caller instead of success. Fix by changing the order that these conditions are tested. 2. Simplify the calculation for the bytes to send in each iteration of the inner loop of kern_sendfile() 3. Fix some calls with bogus arguments to sf_buf_ext(). These would only trigger on mbuf allocation failure, but would be hilariously bad if they did trigger. Submitted by: gibbs(3), andre(2) Reviewed by: emax, andre Obtained from: Netflix MFC after: 1 week
# c2e3c52e	19-Mar-2013	Jilles Tjoelker <jilles@FreeBSD.org>	Implement SOCK_CLOEXEC, SOCK_NONBLOCK and MSG_CMSG_CLOEXEC. This change allows creating file descriptors with close-on-exec set in some situations. SOCK_CLOEXEC and SOCK_NONBLOCK can be OR'ed in socket() and socketpair()'s type parameter, and MSG_CMSG_CLOEXEC to recvmsg() makes file descriptors (SCM_RIGHTS) atomically close-on-exec. The numerical values for SOCK_CLOEXEC and SOCK_NONBLOCK are as in NetBSD. MSG_CMSG_CLOEXEC is the first free bit for MSG_. The SOCK_ flags are not passed to MAC because this may cause incorrect failures and can be done later via fcntl() anyway. On the other hand, audit is expected to cope with the new flags. For MSG_CMSG_CLOEXEC, unp_externalize() is extended to take a flags argument. Reviewed by: kib
# 93cfe763	15-Mar-2013	Gleb Smirnoff <glebius@FreeBSD.org>	- Use m_get2() instead of hand allocating. - No need for u_int cast here. Sponsored by: Nginx, Inc.
# 3b4a84e7	11-Mar-2013	Gleb Smirnoff <glebius@FreeBSD.org>	In kern_sendfile() use m_extadd() instead of MEXTADD() macro, supplying appropriate wait argument and checking return value. Before this change m_extadd() could fail, and kern_sendfile() ignored that. Sponsored by: Nginx, Inc.
# fbb34710	11-Mar-2013	Michael Tuexen <tuexen@FreeBSD.org>	Return an error if sctp_peeloff() fails because a socket can't be allocated. MFC after: 3 days
# 89f6b863	08-Mar-2013	Attilio Rao <attilio@FreeBSD.org>	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
# 7493f24e	02-Mar-2013	Pawel Jakub Dawidek <pjd@FreeBSD.org>	- Implement two new system calls: int bindat(int fd, int s, const struct sockaddr addr, socklen_t addrlen); int connectat(int fd, int s, const struct sockaddr name, socklen_t namelen); which allow to bind and connect respectively to a UNIX domain socket with a path relative to the directory associated with the given file descriptor 'fd'. - Add manual pages for the new syscalls. - Make the new syscalls available for processes in capability mode sandbox. - Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on the directory descriptor for the syscalls to work. - Update audit(4) to support those two new syscalls and to handle path in sockaddr_un structure relative to the given directory descriptor. - Update procstat(1) to recognize the new capability rights. - Document the new capability rights in cap_rights_limit(2). Sponsored by: The FreeBSD Foundation Discussed with: rwatson, jilles, kib, des
# 2609222a	01-Mar-2013	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Merge Capsicum overhaul: - Capability is no longer separate descriptor type. Now every descriptor has set of its own capability rights. - The cap_new(2) system call is left, but it is no longer documented and should not be used in new code. - The new syscall cap_rights_limit(2) should be used instead of cap_new(2), which limits capability rights of the given descriptor without creating a new one. - The cap_getrights(2) syscall is renamed to cap_rights_get(2). - If CAP_IOCTL capability right is present we can further reduce allowed ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed ioctls can be retrived with cap_ioctls_get(2) syscall. - If CAP_FCNTL capability right is present we can further reduce fcntls that can be used with the new cap_fcntls_limit(2) syscall and retrive them with cap_fcntls_get(2). - To support ioctl and fcntl white-listing the filedesc structure was heavly modified. - The audit subsystem, kdump and procstat tools were updated to recognize new syscalls. - Capability rights were revised and eventhough I tried hard to provide backward API and ABI compatibility there are some incompatible changes that are described in detail below: CAP_CREATE old behaviour: - Allow for openat(2)+O_CREAT. - Allow for linkat(2). - Allow for symlinkat(2). CAP_CREATE new behaviour: - Allow for openat(2)+O_CREAT. Added CAP_LINKAT: - Allow for linkat(2). ABI: Reuses CAP_RMDIR bit. - Allow to be target for renameat(2). Added CAP_SYMLINKAT: - Allow for symlinkat(2). Removed CAP_DELETE. Old behaviour: - Allow for unlinkat(2) when removing non-directory object. - Allow to be source for renameat(2). Removed CAP_RMDIR. Old behaviour: - Allow for unlinkat(2) when removing directory. Added CAP_RENAMEAT: - Required for source directory for the renameat(2) syscall. Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR): - Allow for unlinkat(2) on any object. - Required if target of renameat(2) exists and will be removed by this call. Removed CAP_MAPEXEC. CAP_MMAP old behaviour: - Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and PROT_WRITE. CAP_MMAP new behaviour: - Allow for mmap(2)+PROT_NONE. Added CAP_MMAP_R: - Allow for mmap(PROT_READ). Added CAP_MMAP_W: - Allow for mmap(PROT_WRITE). Added CAP_MMAP_X: - Allow for mmap(PROT_EXEC). Added CAP_MMAP_RW: - Allow for mmap(PROT_READ \| PROT_WRITE). Added CAP_MMAP_RX: - Allow for mmap(PROT_READ \| PROT_EXEC). Added CAP_MMAP_WX: - Allow for mmap(PROT_WRITE \| PROT_EXEC). Added CAP_MMAP_RWX: - Allow for mmap(PROT_READ \| PROT_WRITE \| PROT_EXEC). Renamed CAP_MKDIR to CAP_MKDIRAT. Renamed CAP_MKFIFO to CAP_MKFIFOAT. Renamed CAP_MKNODE to CAP_MKNODEAT. CAP_READ old behaviour: - Allow pread(2). - Disallow read(2), readv(2) (if there is no CAP_SEEK). CAP_READ new behaviour: - Allow read(2), readv(2). - Disallow pread(2) (CAP_SEEK was also required). CAP_WRITE old behaviour: - Allow pwrite(2). - Disallow write(2), writev(2) (if there is no CAP_SEEK). CAP_WRITE new behaviour: - Allow write(2), writev(2). - Disallow pwrite(2) (CAP_SEEK was also required). Added convinient defines: #define CAP_PREAD (CAP_SEEK \| CAP_READ) #define CAP_PWRITE (CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_R (CAP_MMAP \| CAP_SEEK \| CAP_READ) #define CAP_MMAP_W (CAP_MMAP \| CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_X (CAP_MMAP \| CAP_SEEK \| 0x0000000000000008ULL) #define CAP_MMAP_RW (CAP_MMAP_R \| CAP_MMAP_W) #define CAP_MMAP_RX (CAP_MMAP_R \| CAP_MMAP_X) #define CAP_MMAP_WX (CAP_MMAP_W \| CAP_MMAP_X) #define CAP_MMAP_RWX (CAP_MMAP_R \| CAP_MMAP_W \| CAP_MMAP_X) #define CAP_RECV CAP_READ #define CAP_SEND CAP_WRITE #define CAP_SOCK_CLIENT \ (CAP_CONNECT \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| CAP_GETSOCKOPT \| \ CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| CAP_SETSOCKOPT \| CAP_SHUTDOWN) #define CAP_SOCK_SERVER \ (CAP_ACCEPT \| CAP_BIND \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| \ CAP_GETSOCKOPT \| CAP_LISTEN \| CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| \ CAP_SETSOCKOPT \| CAP_SHUTDOWN) Added defines for backward API compatibility: #define CAP_MAPEXEC CAP_MMAP_X #define CAP_DELETE CAP_UNLINKAT #define CAP_MKDIR CAP_MKDIRAT #define CAP_RMDIR CAP_UNLINKAT #define CAP_MKFIFO CAP_MKFIFOAT #define CAP_MKNOD CAP_MKNODAT #define CAP_SOCK_ALL (CAP_SOCK_CLIENT \| CAP_SOCK_SERVER) Sponsored by: The FreeBSD Foundation Reviewed by: Christoph Mallon <christoph.mallon@gmx.de> Many aspects discussed with: rwatson, benl, jonathan ABI compatibility discussed with: kib
# fbda3d5d	06-Feb-2013	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Audit sockaddr argument for bind(2), connect(2), accept(2), sendto(2) and recvfrom(2) syscalls. Sponsored by: The FreeBSD Foundation
# 82b316b3	06-Feb-2013	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Minor style tweaks.
# eb1b1807	05-Dec-2012	Gleb Smirnoff <glebius@FreeBSD.org>	Mechanically substitute flags from historic mbuf allocator with malloc(9) flags within sys. Exceptions: - sys/contrib not touched - sys/mbuf.h edited manually
# 5050aa86	22-Oct-2012	Konstantin Belousov <kib@FreeBSD.org>	Remove the support for using non-mpsafe filesystem modules. In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems. The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes. Conducted and reviewed by: attilio Tested by: pho
# 1c771f92	05-Aug-2012	Konstantin Belousov <kib@FreeBSD.org>	After the PHYS_TO_VM_PAGE() function was de-inlined, the main reason to pull vm_param.h was removed. Other big dependency of vm_page.h on vm_param.h are PA_LOCK* definitions, which are only needed for in-kernel code, because modules use KBI-safe functions to lock the pages. Stop including vm_param.h into vm_page.h. Include vm_param.h explicitely for the kernel code which needs it. Suggested and reviewed by: alc MFC after: 2 weeks
# f3cd9805	11-Jun-2012	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Style fixes and simplifications. MFC after: 1 month
# 3b5da8d6	08-Jun-2012	Mateusz Guzik <mjg@FreeBSD.org>	Plug socket refcount leak on error in sys_sctp_peeloff. Reviewed by: tuexen Approved by: trasz (mentor) MFC after: 3 days
# 36eeafa0	04-Jun-2012	Gleb Smirnoff <glebius@FreeBSD.org>	style(9) for r236563.
# 8955d272	04-Jun-2012	Gleb Smirnoff <glebius@FreeBSD.org>	Microoptimisation of code from r236560, also coming from Nginx Inc. Submitted by: ru
# 835d8900	03-Jun-2012	Gleb Smirnoff <glebius@FreeBSD.org>	Optimise kern_sendfile(): skip cycling through the entire mbuf chain in m_cat(), storing pointer to last mbuf in chain in local variable and attaching new mbuf to the end of chain. Submitter reports that CPU load dropped for > 10% on a web server serving large files with this optimisation. Submitted by: Sergey Budnevitch <sb nginx.com>
# 99f293a2	15-Mar-2012	Michael Tuexen <tuexen@FreeBSD.org>	Fix bugs which can result in a panic when an non-SCTP socket it used with an sctp_ system-call which expects an SCTP socket. MFC after: 3 days.
# 526d0bd5	20-Feb-2012	Konstantin Belousov <kib@FreeBSD.org>	Fix found places where uio_resid is truncated to int. Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from the usermode. Discussed with: bde, das (previous versions) MFC after: 1 month
# 8451d0dd	16-Sep-2011	Kip Macy <kmacy@FreeBSD.org>	In order to maximize the re-usability of kernel code in user space this patch modifies makesyscalls.sh to prefix all of the non-compatibility calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel entry points and all places in the code that use them. It also fixes an additional name space collision between the kernel function psignal and the libc function of the same name by renaming the kernel psignal kern_psignal(). By introducing this change now we will ease future MFCs that change syscalls. Reviewed by: rwatson Approved by: re (bz)
# a9d2f8d8	10-Aug-2011	Robert Watson <rwatson@FreeBSD.org>	Second-to-last commit implementing Capsicum capabilities in the FreeBSD kernel for FreeBSD 9.0: Add a new capability mask argument to fget(9) and friends, allowing system call code to declare what capabilities are required when an integer file descriptor is converted into an in-kernel struct file *. With options CAPABILITIES compiled into the kernel, this enforces capability protection; without, this change is effectively a no-op. Some cases require special handling, such as mmap(2), which must preserve information about the maximum rights at the time of mapping in the memory map so that they can later be enforced in mprotect(2) -- this is done by narrowing the rights in the existing max_protection field used for similar purposes with file permissions. In namei(9), we assert that the code is not reached from within capability mode, as we're not yet ready to enforce namespace capabilities there. This will follow in a later commit. Update two capability names: CAP_EVENT and CAP_KEVENT become CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they represent. Approved by: re (bz) Submitted by: jonathan Sponsored by: Google Inc
# 12bc222e	30-Jun-2011	Jonathan Anderson <jonathan@FreeBSD.org>	Add some checks to ensure that Capsicum is behaving correctly, and add some more explicit comments about what's going on and what future maintainers need to do when e.g. adding a new operation to a sys_machdep.c. Approved by: mentor(rwatson), re(bz)
# c721b934	07-Jun-2011	John Baldwin <jhb@FreeBSD.org>	Log the socket address passed as the destination to sendto() and sendmsg() via ktrace. MFC after: 1 week
# 1fe80828	01-Apr-2011	Konstantin Belousov <kib@FreeBSD.org>	After the r219999 is merged to stable/8, rename fallocf(9) to falloc(9) and remove the falloc() version that lacks flag argument. This is done to reduce the KPI bloat. Requested by: jhb X-MFC-note: do not
# 1fb51a12	16-Feb-2011	Bjoern A. Zeeb <bz@FreeBSD.org>	Mfp4 CH=177274,177280,177284-177285,177297,177324-177325 VNET socket push back: try to minimize the number of places where we have to switch vnets and narrow down the time we stay switched. Add assertions to the socket code to catch possibly unset vnets as seen in r204147. While this reduces the number of vnet recursion in some places like NFS, POSIX local sockets and some netgraph, .. recursions are impossible to fix. The current expectations are documented at the beginning of uipc_socket.c along with the other information there. Sponsored by: The FreeBSD Foundation Sponsored by: CK Software GmbH Reviewed by: jhb Tested by: zec Tested by: Mikolaj Golub (to.my.trociny gmail.com) MFC after: 2 weeks
# 8189ac85	03-Feb-2011	Alan Cox <alc@FreeBSD.org>	Eliminate unnecessary page hold_count checks. These checks predate r90944, which introduced a general mechanism for handling the freeing of held pages. Reviewed by: kib@
# 9ca9fc53	28-Jan-2011	Konstantin Belousov <kib@FreeBSD.org>	If more than one thread allocated sf buffers for sendfile(2), and each of the threads needs more while current pool of the buffers is exhausted, then neither thread can make progress. Switch to nowait allocations after we got first buffer already. Reported by: az Reviewed by: alc (previous version) Tested by: pho MFC after: 1 week
# b452cf63	13-Dec-2010	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Just pass M_ZERO to malloc(9) instead of clearing allocated memory separately.
# a7d5f7eb	19-Oct-2010	Jamie Gritton <jamie@FreeBSD.org>	A new jail(8) with a configuration file, to replace the work currently done by /etc/rc.d/jail.
# 049640c1	05-Sep-2010	Michael Tuexen <tuexen@FreeBSD.org>	Implement correct handling of address parameter and sendinfo for SCTP send calls. MFC after: 4 weeks.
# 2f9f22ae	05-Jul-2010	Michael Tuexen <tuexen@FreeBSD.org>	MFC r209624 * Do not dereference a NULL pointer when calling an SCTP send syscall not providing a destination address and using ktrace. * Do not copy out kernel memory when providing sinfo for sctp_recvmsg(). Both bugs where reported by Valentin Nechayev. The first bug results in a kernel panic. Approved by: re@
# 7a6f3d78	29-Jun-2010	John Baldwin <jhb@FreeBSD.org>	Send SIGPIPE to the thread that issued the offending system call rather than to the entire process. Reported by: Anit Chakraborty Reviewed by: kib, deischen (concept) MFC after: 1 week
# e1c97831	26-Jun-2010	Michael Tuexen <tuexen@FreeBSD.org>	* Do not dereference a NULL pointer when calling an SCTP send syscall not providing a destination address and using ktrace. * Do not copy out kernel memory when providing sinfo for sctp_recvmsg(). Both bug where reported by Valentin Nechayev. The first bug results in a kernel panic. MFC after: 3 days.
# 60ae52f7	21-Jun-2010	Ed Schouten <ed@FreeBSD.org>	Use ISO C99 integer types in sys/kern where possible. There are only about 100 occurences of the BSD-specific u_int*_t datatypes in sys/kern. The ISO C99 integer types are used here more often.
# f0c0d399	06-May-2010	Alan Cox <alc@FreeBSD.org>	Remove page queues locking from all sf_buf_mext()-like functions. The page lock now suffices. Fix a couple nearby style violations.
# 52683078	06-May-2010	Alan Cox <alc@FreeBSD.org>	Eliminate a small bit of unneeded code from kern_sendfile(): While kern_sendfile() is running, the file's vm object can't be destroyed because kern_sendfile() increments the vm object's reference count. (Once kern_sendfile() decrements the reference count and returns, the vm object can, however, be destroyed. So, sf_buf_mext() must handle the case where the vm object is destroyed.) Reviewed by: kib
# 91381493	02-May-2010	Alan Cox <alc@FreeBSD.org>	This is the first step in transitioning responsibility for synchronizing access to the page's wire_count from the page queues lock to the page lock. Submitted by: kmacy
# a0b8e597	02-May-2010	Konstantin Belousov <kib@FreeBSD.org>	Lock the page around hold_count access. Reviewed by: alc
# 9fb7bf55	07-Apr-2010	Konstantin Belousov <kib@FreeBSD.org>	MFC r205318: Properly handle compat32 calls to sctp generic sendmsd/recvmsg functions that take iov.
# 003465f5	02-Apr-2010	Konstantin Belousov <kib@FreeBSD.org>	MFC r205317: Remove dead statement.
# 131e8de2	02-Apr-2010	Konstantin Belousov <kib@FreeBSD.org>	MFC r205316: Fix two style issues.
# 5322f02e	19-Mar-2010	Konstantin Belousov <kib@FreeBSD.org>	Properly handle compat32 calls to sctp generic sendmsd/recvmsg functions that take iov. Reviewed by: tuexen MFC after: 2 weeks
# fd9d1e76	19-Mar-2010	Konstantin Belousov <kib@FreeBSD.org>	Remove dead statement. Reviewed by: tuexen MFC after: 2 weeks
# 0a977ede	19-Mar-2010	Konstantin Belousov <kib@FreeBSD.org>	Fix two style issues. MFC after: 2 weeks
# 0454fe84	18-Feb-2010	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Use NULL instead of 0 when setting up pointer.
# fc946fd0	27-Dec-2009	Matt Jacob <mjacob@FreeBSD.org>	MFC 200620,200621: fix argument order to mtx_init call.
# e7d829a4	16-Dec-2009	Matt Jacob <mjacob@FreeBSD.org>	Fix argument order in a call to mtx_init. MFC after: 1 week
# cf19fced	07-Dec-2009	Michael Tuexen <tuexen@FreeBSD.org>	MFC 197288,197326,197327,197328,197342,197914,197929, 197955,199365,199370,199371,199373,199866 This MFCs all SCTP/VNET relevant fixes from head. Approved by: rrs (mentor)
# c3568f6f	17-Nov-2009	Konstantin Belousov <kib@FreeBSD.org>	MFC r198853: If socket buffer space appears to be lower then sum of count of already prepared bytes and next portion of transfer, inner loop of kern_sendfile() aborts, not preparing next mbuf for socket buffer, and not modifying any outer loop invariants. The thread loops in the outer loop forever. Instead of breaking from inner loop, prepare only bytes that fit into the socket buffer space.
# 1c89fc75	02-Nov-2009	Konstantin Belousov <kib@FreeBSD.org>	If socket buffer space appears to be lower then sum of count of already prepared bytes and next portion of transfer, inner loop of kern_sendfile() aborts, not preparing next mbuf for socket buffer, and not modifying any outer loop invariants. The thread loops in the outer loop forever. Instead of breaking from inner loop, prepare only bytes that fit into the socket buffer space. In collaboration with: pho Reviewed by: bz PR: kern/138999 MFC after: 2 weeks
# 7415a41f	29-Oct-2009	Konstantin Belousov <kib@FreeBSD.org>	Fix style issue.
# 88c45ef7	08-Oct-2009	Konstantin Belousov <kib@FreeBSD.org>	MFC r197662: Do not dereference vp->v_mount without holding vnode lock and checking that the vnode is not reclaimed. Approved by: re (bz)
# 75ffdc40	30-Sep-2009	Konstantin Belousov <kib@FreeBSD.org>	Do not dereference vp->v_mount without holding vnode lock and checking that the vnode is not reclaimed. Noted by: Igor Sysoev <is rambler-co ru> MFC after: 1 week
# 8518270e	19-Sep-2009	Michael Tuexen <tuexen@FreeBSD.org>	Get SCTP working in combination with VIMAGE. Contains code from bz. Approved by: rrs (mentor) MFC after: 1 month.
# 530c0060	01-Aug-2009	Robert Watson <rwatson@FreeBSD.org>	Merge the remainder of kern_vimage.c and vimage.h into vnet.c and vnet.h, we now use jails (rather than vimages) as the abstraction for virtualization management, and what remained was specific to virtual network stacks. Minor cleanups are done in the process, and comments updated to reflect these changes. Reviewed by: bz Approved by: re (vimage blanket)
# 15ca46f6	01-Jul-2009	Robert Watson <rwatson@FreeBSD.org>	Audit file descriptor numbers for various socket-related system calls. Approved by: re (audit argument blanket) MFC after: 3 days
# 9e4c1521	01-Jul-2009	Robert Watson <rwatson@FreeBSD.org>	Define missing audit argument macro AUDIT_ARG_SOCKET(), and capture the domain, type, and protocol arguments to socket(2) and socketpair(2). Approved by: re (audit argument blanket) MFC after: 3 days
# c03528b6	10-Jun-2009	Bjoern A. Zeeb <bz@FreeBSD.org>	SCTP needs either IPv4 or IPv6 as lower layer[1]. So properly hide the already #ifdef SCTP code with #if defined(INET) \|\| defined(INET6) as well to get us closer to a non-INET/INET6 kernel. Discussed with: tuexen [1]
# bcf11e8d	05-Jun-2009	Robert Watson <rwatson@FreeBSD.org>	Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd
# f93bfb23	02-Jun-2009	Robert Watson <rwatson@FreeBSD.org>	Add internal 'mac_policy_count' counter to the MAC Framework, which is a count of the number of registered policies. Rather than unconditionally locking sockets before passing them into MAC, lock them in the MAC entry points only if mac_policy_count is non-zero. This avoids locking overhead for a number of socket system calls when no policies are registered, eliminating measurable overhead for the MAC Framework for the socket subsystem when there are no active policies. Possibly socket locks should be acquired by policies if they are required for socket labels, which would further avoid locking overhead when there are policies but they don't require labeling of sockets, or possibly don't even implement socket controls. Obtained from: TrustedBSD Project
# 4202e1be	30-May-2009	Dmitry Chagin <dchagin@FreeBSD.org>	Split native socketpair() syscall onto kern_socketpair() which should be used by kernel consumers and socketpair() itself. Approved by: kib (mentor) MFC after: 1 month
# bf422e5f	13-May-2009	Jeff Roberson <jeff@FreeBSD.org>	- Implement a lockless file descriptor lookup algorithm in fget_unlocked(). - Save old file descriptor tables created on expansion until the entire descriptor table is freed so that pointers may be followed without regard for expanders. - Mark the file zone as NOFREE so we may attempt to reference potentially freed files. - Convert several fget_locked() users to fget_unlocked(). This requires us to manage reference counts explicitly but reduces locking overhead in the common case.
# 2114e063	08-May-2009	Marko Zec <zec@FreeBSD.org>	A NOP change: style / whitespace cleanup of the noise that slipped into r191816. Spotted by: bz Approved by: julian (mentor) (an earlier version of the diff)
# 21ca7b57	05-May-2009	Marko Zec <zec@FreeBSD.org>	Change the curvnet variable from a global const struct vnet , previously always pointing to the default vnet context, to a dynamically changing thread-local one. The currvnet context should be set on entry to networking code via CURVNET_SET() macros, and reverted to previous state via CURVNET_RESTORE(). Recursions on curvnet are permitted, though strongly discuouraged. This change should have no functional impact on nooptions VIMAGE kernel builds, where CURVNET_ macros expand to whitespace. The curthread->td_vnet (aka curvnet) variable's purpose is to be an indicator of the vnet context in which the current network-related operation takes place, in case we cannot deduce the current vnet context from any other source, such as by looking at mbuf's m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so far curvnet has turned out to be an invaluable consistency checking aid: it helps to catch cases when sockets, ifnets or any other vnet-aware structures may have leaked from one vnet to another. The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros was a result of an empirical iterative process, whith an aim to reduce recursions on CURVNET_SET() to a minimum, while still reducing the scope of CURVNET_SET() to networking only operations - the alternative would be calling CURVNET_SET() on each system call entry. In general, curvnet has to be set in three typicall cases: when processing socket-related requests from userspace or from within the kernel; when processing inbound traffic flowing from device drivers to upper layers of the networking stack, and when executing timer-driven networking functions. This change also introduces a DDB subcommand to show the list of all vnet instances. Approved by: julian (mentor)
# f0b9868d	11-Apr-2009	Kip Macy <kmacy@FreeBSD.org>	sendfile doesn't modify the vnode - acquire vnode lock shared Reviewed by: ups, jeffr
# 1ede983c	23-Oct-2008	Dag-Erling Smørgrav <des@FreeBSD.org>	Retire the MALLOC and FREE macros. They are an abomination unto style(9). MFC after: 3 months
# d7f03759	19-Oct-2008	Ulf Lilleengen <lulf@FreeBSD.org>	- Import the HEAD csup code which is the basis for the cvsmode work.
# 17c2fc0c	22-May-2008	Robert Watson <rwatson@FreeBSD.org>	When sendto(2) is called with an explicit destination address argument, call mac_socket_check_connect() on that address before proceeding with the send. Otherwise policies instrumenting the connect entry point for the purposes of checking destination addresses will not have the opportunity to check implicit connect requests. MFC after: 3 weeks Sponsored by: nCircle Network Security, Inc.
# ae11a989	27-Apr-2008	Robert Watson <rwatson@FreeBSD.org>	When writing trailers in sendfile(2), don't call kern_writev() while holding the socket buffer lock. These leads to an immediate panic due to recursing the socket buffer lock. This bug was introduced in uipc_syscalls.c:1.240, but masked by another bug until that was fixed in uipc_syscalls.c:1.269. Note that the current fix isn't perfect, but better than panicking: normally we guarantee that simultaneous invocations of a system call to write on a stream socket won't be interlaced, which is ensured by use of the socket buffer sleep lock. This is guaranteed for the sendfile headers, but not trailers. In practice, this is likely not a problem, but should be fixed. MFC after: 3 days Pointy hat to: andre (1.240), cperciva (1.269)
# ea26d587	25-Mar-2008	Ruslan Ermilov <ru@FreeBSD.org>	Replaced the misleading uses of a historical artefact M_TRYWAIT with M_WAIT. Removed dead code that assumed that M_TRYWAIT can return NULL; it's not true since the advent of MBUMA. Reviewed by: arch There are ongoing disputes as to whether we want to switch to directly using UMA flags M_WAITOK/M_NOWAIT for mbuf(9) allocation.
# 49186916	23-Feb-2008	Colin Percival <cperciva@FreeBSD.org>	After finishing sending file data in sendfile(2), don't forget to send the provided trailers. This has been broken since revision 1.240. Submitted by: Dan Nelson PR: kern/120948 "sounds ok to me" from: phk MFC after: 3 days
# 60e15db9	22-Feb-2008	Dag-Erling Smørgrav <des@FreeBSD.org>	This patch adds a new ktrace(2) record type, KTR_STRUCT, whose payload consists of the null-terminated name and the contents of any structure you wish to record. A new ktrstruct() function constructs and emits a KTR_STRUCT record. It is accompanied by convenience macros for struct stat and struct sockaddr. In kdump(1), KTR_STRUCT records are handled by a dispatcher function that runs stringent sanity checks on its contents before handing it over to individual decoding funtions for each type of structure. Currently supported structures are struct stat and struct sockaddr for the AF_INET, AF_INET6 and AF_UNIX families; support for AF_APPLETALK and AF_IPX is present but disabled, as I am unable to test it properly. Since 's' was already taken, the letter 't' is used by ktrace(1) to enable KTR_STRUCT trace points, and in kdump(1) to enable their decoding. Derived from patches by Andrew Li <andrew2.li@citi.com>. PR: kern/117836 MFC after: 3 weeks
# 1b708999	14-Feb-2008	Simon L. B. Nielsen <simon@FreeBSD.org>	Fix sendfile(2) write-only file permission bypass. Security: FreeBSD-SA-08:03.sendfile Submitted by: kib
# b75a1171	03-Feb-2008	Poul-Henning Kamp <phk@FreeBSD.org>	Give sendfile(2) a SF_SYNC flag which makes it wait until all mbufs referencing the files VM pages are returned from the network stack, making changes to the file safe. This flag does not guarantee that the data has been transmitted to the other end.
# cf827063	01-Feb-2008	Poul-Henning Kamp <phk@FreeBSD.org>	Give MEXTADD() another argument to make both void pointers to the free function controlable, instead of passing the KVA of the buffer storage as the first argument. Fix all conventional users of the API to pass the KVA of the buffer as the first argument, to make this a no-op commit. Likely break the only non-convetional user of the API, after informing the relevant committer. Update the mbuf(9) manual page, which was already out of sync on this point. Bump __FreeBSD_version to 800016 as there is no way to tell how many arguments a CPP macro needs any other way. This paves the way for giving sendfile(9) a way to wait for the passed storage to have been accessed before returning. This does not affect the memory layout or size of mbufs. Parental oversight by: sam and rwatson. No MFC is anticipated.
# 265de5bb	31-Jan-2008	Robert Watson <rwatson@FreeBSD.org>	Correct two problems relating to sorflush(), which is called to flush read socket buffers in shutdown() and close(): - Call socantrcvmore() before sblock() to dislodge any threads that might be sleeping (potentially indefinitely) while holding sblock(), such as a thread blocked in recv(). - Flag the sblock() call as non-interruptible so that a signal delivered to the thread calling sorflush() doesn't cause sblock() to fail. The sblock() is required to ensure that all other socket consumer threads have, in fact, left, and do not enter, the socket buffer until we're done flushin it. To implement the latter, change the 'flags' argument to sblock() to accept two flags, SBL_WAIT and SBL_NOINTR, rather than one M_WAITOK flag. When SBL_NOINTR is set, it forces a non-interruptible sx acquisition, regardless of the setting of the disposition of SB_NOINTR on the socket buffer; without this change it would be possible for another thread to clear SB_NOINTR between when the socket buffer mutex is released and sblock() is invoked. Reviewed by: bz, kmacy Reported by: Jos Backus <jos at catnook dot com>
# 22db15c0	13-Jan-2008	Attilio Rao <attilio@FreeBSD.org>	VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in conjuction with 'thread' argument passing which is always curthread. Remove the unuseful extra-argument and pass explicitly curthread to lower layer functions, when necessary. KPI results broken by this change, which should affect several ports, so version bumping and manpage update will be further committed. Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
# cb05b60a	09-Jan-2008	Attilio Rao <attilio@FreeBSD.org>	vn_lock() is currently only used with the 'curthread' passed as argument. Remove this argument and pass curthread directly to underlying VOP_LOCK1() VFS method. This modify makes the code cleaner and in particular remove an annoying dependence helping next lockmgr() cleanup. KPI results, obviously, changed. Manpage and FreeBSD_version will be updated through further commits. As a side note, would be valuable to say that next commits will address a similar cleanup about VFS methods, in particular vop_lock1 and vop_unlock. Tested by: Diego Sardina <siarodx at gmail dot com>, Andrea Di Pasquale <whyx dot it at gmail dot com>
# 397c19d1	29-Dec-2007	Jeff Roberson <jeff@FreeBSD.org>	Remove explicit locking of struct file. - Introduce a finit() which is used to initailize the fields of struct file in such a way that the ops vector is only valid after the data, type, and flags are valid. - Protect f_flag and f_count with atomic operations. - Remove the global list of all files and associated accounting. - Rewrite the unp garbage collection such that it no longer requires the global list of all files and instead uses a list of all unp sockets. - Mark sockets in the accept queue so we don't incorrectly gc them. Tested by: kris, pho
# 30d239bc	24-Oct-2007	Robert Watson <rwatson@FreeBSD.org>	Merge first in a series of TrustedBSD MAC Framework KPI changes from Mac OS X Leopard--rationalize naming for entry points to the following general forms: mac_<object>_<method/action> mac_<object>_check_<method/action> The previous naming scheme was inconsistent and mostly reversed from the new scheme. Also, make object types more consistent and remove spaces from object types that contain multiple parts ("posix_sem" -> "posixsem") to make mechanical parsing easier. Introduce a new "netinet" object type for certain IPv4/IPv6-related methods. Also simplify, slightly, some entry point names. All MAC policy modules will need to be recompiled, and modules not updates as part of this commit will need to be modified to conform to the new KPI. Sponsored by: SPARTA (original patches against Mac OS X) Obtained from: TrustedBSD Project, Apple Computer
# 2afb3e84	26-Aug-2007	Randall Stewart <rrs@FreeBSD.org>	- During shutdown pending, when the last sack came in and the last message on the send stream was "null" but still there, a state we allow, we could get hung and not clean it up and wait for the shutdown guard timer to clear the association without a graceful close. Fix this so that that we properly clean up. - Added support for Multiple ASCONF per new RFC. We only (so far) accept input of these and cannot yet generate a multi-asconf. - Sysctl'd support for experimental Fast Handover feature. Always disabled unless sysctl or socket option changes to enable. - Error case in add-ip where the peer supports AUTH and ADD-IP but does NOT require AUTH of ASCONF/ASCONF-ACK. We need to ABORT in this case. - According to the Kyoto summit of socket api developers (Solaris, Linux, BSD). We need to have: o non-eeor mode messages be atomic - Fixed o Allow implicit setup of an assoc in 1-2-1 model if using the sctp_**() send calls - Fixed o Get rid of HAVE_XXX declarations - Done o add a sctp_pr_policy in hole in sndrcvinfo structure - Done o add a PR_SCTP_POLICY_VALID type flag - yet to-do in a future patch! - Optimize sctp6 calls to reuse code in sctp_usrreq. Also optimize when we close sending out the data and disabling Nagle. - Change key concatenation order to match the auth RFC - When sending OOTB shutdown_complete always do csum. - Don't send PKT-DROP to a PKT-DROP - For abort chunks just always checksums same for shutdown-complete. - inpcb_free front state had a bug where in queue data could wedge an assoc. We need to just abandon ones in front states (free_assoc). - If a peer sends us a 64k abort, we would try to assemble a response packet which may be larger than 64k. This then would be dropped by IP. Instead make a "minimum" size for us 64k-2k (we want at least 2k for our initack). If we receive such an init discard it early without all the processing. - When we peel off we must increment the tcb ref count to keep it from being freed from underneath us. - handling fwd-tsn had bugs that caused memory overwrites when given faulty data, fixed so can't happen and we also stop at the first bad stream no. - Fixed so comm-up generates the adaption indication. - peeloff did not get the hmac params copied. - fix it so we lock the addr list when doing src-addr selection (in future we need to use a multi-reader/one writer lock here) - During lowlevel output, we could end up with a _l_addr set to null if the iterator is calling the output routine. This means we would possibly crash when we gather the MTU info. Fix so we only do the gather where we have a src address cached. - we need to be sure to set abort flag on conn state when we receive an abort. - peeloff could leak a socket. Moved code so the close will find the socket if the peeloff fails (uipc_syscalls.c) Approved by: re@freebsd.org(Ken Smith)
# 0bf686c1	06-Aug-2007	Robert Watson <rwatson@FreeBSD.org>	Remove the now-unused NET_{LOCK,UNLOCK,ASSERT}_GIANT() macros, which previously conditionally acquired Giant based on debug.mpsafenet. As that has now been removed, they are no longer required. Removing them significantly simplifies error-handling in the socket layer, eliminated quite a bit of unwinding of locking in error cases. While here clean up the now unneeded opt_net.h, which previously was used for the NET_WITH_GIANT kernel option. Clean up some related gotos for consistency. Reviewed by: bz, csjp Tested by: kris Approved by: re (kensmith)
# b8709d23	01-Jul-2007	Randall Stewart <rrs@FreeBSD.org>	- Add some needed error checking on bad fd passing in the sctp syscalls. Approved by: re@freebsd.org (Ken Smith) Obtained from: Weongyo Jeong (weongyo.jeong@gmail.com)
# 1c9dbd15	19-May-2007	Andre Oppermann <andre@FreeBSD.org>	In kern_sendfile() adjust byte accounting of the file sending loop to ignore the size of any headers that were passed with the sendfile(2) system call. Otherwise the file sent will be truncated by the header size if the nbytes parameter was provided. The bug doesn't show up when either nbytes is zero, meaning send the whole file, or no header iovec is provided. Resolve a potential error aliasing of errors from the VM and sf_buf parts and the protocol send parts where an error of the latter over- writes one of the former. Update comments. The byte accounting bug wasn't seen in earlier because none of the popular sendfile(2) consumers, Apache, lighttpd and our ftpd(8) use it in modes that trigger it. The varnish HTTP proxy makes full use of it and exposed the problem. Bug found by: phk Tested by: phk
# d19e16a7	16-May-2007	Robert Watson <rwatson@FreeBSD.org>	Generally migrate to ANSI function headers, and remove 'register' use.
# 7abab911	03-May-2007	Robert Watson <rwatson@FreeBSD.org>	sblock() implements a sleep lock by interlocking SB_WANT and SB_LOCK flags on each socket buffer with the socket buffer's mutex. This sleep lock is used to serialize I/O on sockets in order to prevent I/O interlacing. This change replaces the custom sleep lock with an sx(9) lock, which results in marginally better performance, better handling of contention during simultaneous socket I/O across multiple threads, and a cleaner separation between the different layers of locking in socket buffers. Specifically, the socket buffer mutex is now solely responsible for serializing simultaneous operation on the socket buffer data structure, and not for I/O serialization. While here, fix two historic bugs: (1) a bug allowing I/O to be occasionally interlaced during long I/O operations (discovere by Isilon). (2) a bug in which failed non-blocking acquisition of the socket buffer I/O serialization lock might be ignored (discovered by sam). SCTP portion of this patch submitted by rrs.
# eed20b37	20-Apr-2007	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Don't reinvent vm_page_grab(). Reviewed by: ups
# fb1daf81	18-Apr-2007	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Fix a bug in sendfile(2) when files larger than page size and nbytes=0. When nbytes=0, sendfile(2) should use file size. Because of the bug, it was sending half of a file. The bug is that 'off' variable can't be used for size calculation, because it changes inside the loop, so we should use uap->offset instead.
# 7b20aa9c	06-Apr-2007	Robert Watson <rwatson@FreeBSD.org>	Remove XXX comment that changes to file fields should be protected with the file lock rather than the filedesc lock: I fixed this in the last revision. Spotted by: kris
# 5e3f7694	04-Apr-2007	Robert Watson <rwatson@FreeBSD.org>	Replace custom file descriptor array sleep lock constructed using a mutex and flags with an sxlock. This leads to a significant and measurable performance improvement as a result of access to shared locking for frequent lookup operations, reduced general overhead, and reduced overhead in the event of contention. All of these are imported for threaded applications where simultaneous access to a shared file descriptor array occurs frequently. Kris has reported 2x-4x transaction rate improvements on 8-core MySQL benchmarks; smaller improvements can be expected for many workloads as a result of reduced overhead. - Generally eliminate the distinction between "fast" and regular acquisisition of the filedesc lock; the plan is that they will now all be fast. Change all locking instances to either shared or exclusive locks. - Correct a bug (pointed out by kib) in fdfree() where previously msleep() was called without the mutex held; sx_sleep() is now always called with the sxlock held exclusively. - Universally hold the struct file lock over changes to struct file, rather than the filedesc lock or no lock. Always update the f_ops field last. A further memory barrier is required here in the future (discussed with jhb). - Improve locking and reference management in linux_at(), which fails to properly acquire vnode references before using vnode pointers. Annotate improper use of vn_fullpath(), which will be replaced at a future date. In fcntl(), we conservatively acquire an exclusive lock, even though in some cases a shared lock may be sufficient, which should be revisited. The dropping of the filedesc lock in fdgrowtable() is no longer required as the sxlock can be held over the sleep operation; we should consider removing that (pointed out by attilio). Tested by: kris Discussed with: jhb, kris, attilio, jeff
# 1ce2bc91	02-Apr-2007	John Baldwin <jhb@FreeBSD.org>	Fix a fd leak in socketpair(): - Close the new file objects created during socketpair() if the copyout of the new file descriptors fails. - Add a test to the socketpair regression test for this edge case.
# 873fbcd7	05-Mar-2007	Robert Watson <rwatson@FreeBSD.org>	Further system call comment cleanup: - Remove also "MP SAFE" after prior "MPSAFE" pass. (suggested by bde) - Remove extra blank lines in some cases. - Add extra blank lines in some cases. - Remove no-op comments consisting solely of the function name, the word "syscall", or the system call name. - Add punctuation. - Re-wrap some comments.
# 0c14ff0e	04-Mar-2007	Robert Watson <rwatson@FreeBSD.org>	Remove 'MPSAFE' annotations from the comments above most system calls: all system calls now enter without Giant held, and then in some cases, acquire Giant explicitly. Remove a number of other MPSAFE annotations in the credential code and tweak one or two other adjacent comments.
# 6dbde030	23-Jan-2007	Randall Stewart <rrs@FreeBSD.org>	Fixes the MSG_PEEK for sctp_generic_recvmsg() the msg_flags were not being copied in properly so PEEK and any other msg_flags input operation were not being performed right. Approved by: gnn
# 3e932ca7	12-Nov-2006	Andre Oppermann <andre@FreeBSD.org>	In kern_sendfile() fix the calculation of sbytes (the total number of bytes written to the socket). The rewrite in revision 1.240 got confused by the FreeBSD 4.x bug compatibility code. For some reason lighttpd, that was used for testing the new sendfile code, was not affected by the problem but apache and others using headers/trailers in the sendfile call received incorrect sbytes values after return from non- blocking sockets. This then lead to restarts with wrong offsets and thus mixed up file contents when the socket was writeable again. All programs not using headers/trailers, like ftpd, were not affected by the bug. Reported by: Pawel Worach <pawel.worach-at-gmail.com> Tested by: Pawel Worach <pawel.worach-at-gmail.com>
# 62b36a7f	07-Nov-2006	Andre Oppermann <andre@FreeBSD.org>	Style cleanups to the sctp_* syscall functions.
# bda8b1f3	06-Nov-2006	Andre Oppermann <andre@FreeBSD.org>	Handle early errors in kern_sendfile() by introducing a new goto 'out' label after the sbunlock() part. This correctly handles calls to sendfile(2) without valid parameters that was broken in rev. 1.240. Coverity error: 272162
# f8829a4a	03-Nov-2006	Randall Stewart <rrs@FreeBSD.org>	Ok, here it is, we finally add SCTP to current. Note that this work is not just mine, but it is also the works of Peter Lei and Michael Tuexen. They both are my two key other developers working on the project.. and they need ata-boy's too: ** peterlei@cisco.com tuexen@fh-muenster.de ** I did do a make sysent which updated the syscall's and sysproto.. I hope that is correct... without it you don't build since we have new syscalls for SCTP :-0 So go out and look at the NOTES, add option SCTP (make sure inet and inet6 are present too) and play with SCTP. I will see about comitting some test tools I have after I figure out where I should place them. I also have a lib (libsctp.a) that adds some of the missing socketapi functions that I need to put into lib's.. I will talk to George about this :-) There may still be some 64 bit issues in here, none of us have a 64 bit processor to test with yet.. Michael may have a MAC but thats another beast too.. If you have a mac and want to use SCTP contact Michael he maintains a web site with a loadable module with this code :-) Reviewed by: gnn Approved by: gnn
# 5e20f43d	02-Nov-2006	Andre Oppermann <andre@FreeBSD.org>	Rename m_getm() to m_getm2() and rewrite it to allocate up to page sized mbuf clusters. Add a flags parameter to accept M_PKTHDR and M_EOR mbuf chain flags. Provide compatibility macro for m_getm() calling m_getm2() with M_PKTHDR set. Rewrite m_uiotombuf() to use m_getm2() for mbuf allocation and do the uiomove() in a tight loop over the mbuf chain. Add a flags parameter to accept mbuf flags to be passed to m_getm2(). Adjust all callers for the extra parameter. Sponsored by: TCP/IP Optimization Fundraise 2005 MFC after: 3 month
# d99b0dd2	02-Nov-2006	Andre Oppermann <andre@FreeBSD.org>	Rewrite kern_sendfile() to work in two loops, the inner which turns as many VM pages into mbufs as it can -- up to the free send socket buffer space. The outer loop then drops the whole mbuf chain into the send socket buffer, calls tcp_output() on it and then waits until 50% of the socket buffer are free again to repeat the cycle. This way tcp_output() gets the full amount of data to work with and can issue up to 64K sends for TSO to chop up in the network adapter without using any CPU cycles. Thus it gets very efficient especially with the readahead the VM and I/O system do. The previous sendfile(2) code simply looped over the file, turned each 4K page into an mbuf and sent it off. This had the effect that TSO could only generate 2 packets per send instead of up to 44 at its maximum of 64K. Add experimental SF_MNOWAIT flag to sendfile(2) to return ENOMEM instead of sleeping on mbuf allocation failures. Benchmarking shows significant improvements (95% confidence): 45% less cpu (or 1.81 times better) with new sendfile vs. old sendfile (non-TSO) 83% less cpu (or 5.7 times better) with new sendfile vs. old sendfile (TSO) (Sender AMD Opteron 852 (2.6GHz) with em(4) PCI-X-133 interface and receiver DELL Poweredge SC1425 P-IV Xeon 3.2GHz with em(4) LOM connected back to back at 1000Base-TX full duplex.) Sponsored by: TCP/IP Optimization Fundraise 2005 MFC after: 3 month
# aed55708	22-Oct-2006	Robert Watson <rwatson@FreeBSD.org>	Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead. This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd. Obtained from: TrustedBSD Project Sponsored by: SPARTA
# 9af80719	21-Oct-2006	Alan Cox <alc@FreeBSD.org>	Replace PG_BUSY with VPO_BUSY. In other words, changes to the page's busy flag, i.e., VPO_BUSY, are now synchronized by the per-vm object lock instead of the global page queues lock.
# 5786be7c	09-Aug-2006	Alan Cox <alc@FreeBSD.org>	Introduce a field to struct vm_page for storing flags that are synchronized by the lock on the object containing the page. Transition PG_WANTED and PG_SWAPINPROG to use the new field, eliminating the need for holding the page queues lock when setting or clearing these flags. Rename PG_WANTED and PG_SWAPINPROG to VPO_WANTED and VPO_SWAPINPROG, respectively. Eliminate the assertion that the page queues lock is held in vm_page_io_finish(). Eliminate the acquisition and release of the page queues lock around calls to vm_page_io_finish() in kern_sendfile() and vfs_unbusy_pages().
# 7c4b7ecc	05-Aug-2006	Alan Cox <alc@FreeBSD.org>	Reduce the scope of the page queues lock in kern_sendfile() now that vm_page_sleep_if_busy() no longer requires the caller to hold the page queues lock.
# 10c09f3f	03-Aug-2006	Alan Cox <alc@FreeBSD.org>	The page queues lock is no longer required by vm_page_io_start(). Reduce the scope of the page queues lock in kern_sendfile() accordingly.
# f30e89ce	27-Jul-2006	John Baldwin <jhb@FreeBSD.org>	Fix a file descriptor race I reintroduced when I split accept1() up into kern_accept() and accept1(). If another thread closed the new file descriptor and the first thread later got an error trying to copyout the socket address, then it would attempt to close the wrong file object. To fix, add a struct file ** argument to kern_accept(). If it is non-NULL, then on success kern_accept() will store a pointer to the new file object there and not release any of the references. It is up to the calling code to drop the references appropriately (including a call to fdclose() in case of error to safely handle the aforementioned race). While I'm at it, go ahead and fix the svr4 streams code to not leak the accept fd if it gets an error trying to copyout the streams structures.
# b0668f71	24-Jul-2006	Robert Watson <rwatson@FreeBSD.org>	soreceive_generic(), and sopoll_generic(). Add new functions sosend(), soreceive(), and sopoll(), which are wrappers for pru_sosend, pru_soreceive, and pru_sopoll, and are now used univerally by socket consumers rather than either directly invoking the old so*() functions or directly invoking the protocol switch method (about an even split prior to this commit). This completes an architectural change that was begun in 1996 to permit protocols to provide substitute implementations, as now used by UDP. Consumers now uniformly invoke sosend(), soreceive(), and sopoll() to perform these operations on sockets -- in particular, distributed file systems and socket system calls. Architectural head nod: sam, gnn, wollman
# b33887ea	19-Jul-2006	John Baldwin <jhb@FreeBSD.org>	Don't free the sockaddr in kern_bind() and kern_connect() as not all callers pass a sockaddr allocated via malloc() from M_SONAME anymore. Instead, free it in the callers when necessary.
# c870740e	10-Jul-2006	John Baldwin <jhb@FreeBSD.org>	- Split out kern_accept(), kern_getpeername(), and kern_getsockname() for use by ABI emulators. - Alter the interface of kern_recvit() somewhat. Specifically, go ahead and hard code UIO_USERSPACE in the uio as that's what all the callers specify. In place, add a new uioseg to indicate what type of pointer is in mp->msg_name. Previously it was always a userland address, but ABI emulators may pass in kernel-side sockaddrs. Also, remove the namelenp field and instead require the two places that used it to explicitly copy mp->msg_namelen out to userland. - Use the patched kern_recvit() to replace svr4_recvit() and the stock kern_sendit() to replace svr4_sendit(). - Use kern_bind() instead of stackgap use in ti_bind(). - Use kern_getpeername() and kern_getsockname() instead of stackgap in svr4_stream_ti_ioctl(). - Use kern_connect() instead of stackgap in svr4_do_putmsg(). - Use kern_getpeername() and kern_accept() instead of stackgap in svr4_do_getmsg(). - Retire the stackgap from SVR4 compat as it is no longer used.
# fb11be62	19-Jun-2006	George V. Neville-Neil <gnn@FreeBSD.org>	Properly cast the values of valsize (the size of the value passed in) in setsockopt so that they can be compared correctly against negative values. Passing in a negative value had a rather negative effect on our socket code, making it impossible to open new sockets. PR: 98858 Submitted by: James.Juran@baesystems.com MFC after: 1 week
# b37ffd31	10-Jun-2006	Robert Watson <rwatson@FreeBSD.org>	Move some functions and definitions from uipc_socket2.c to uipc_socket.c: - Move sonewconn(), which creates new sockets for incoming connections on listen sockets, so that all socket allocate code is together in uipc_socket.c. - Move 'maxsockets' and associated sysctls to uipc_socket.c with the socket allocation code. - Move kern.ipc sysctl node to uipc_socket.c, add a SYSCTL_DECL() for it to sysctl.h and remove lots of scattered implementations in various IPC modules. - Sort sodealloc() after soalloc() in uipc_socket.c for dependency order reasons. Statisticize soalloc() and sodealloc() as they are now required only in uipc_socket.c, and are internal to the socket implementation. After this change, socket allocation and deallocation is entirely centralized in one file, and uipc_socket2.c consists entirely of socket buffer manipulation and default protocol switch functions. MFC after: 1 month
# 20bdac8a	25-May-2006	Robert Watson <rwatson@FreeBSD.org>	Use getsock() and fput() instead of fgetsock() and fputsock() in sendfile(). This causes sendfile() to use the file descriptor reference to the socket instead of bumping the socket reference count, which avoids an additional refcount operation, as well as a potential expensive socket refcount drop, which can lead to contention on the accept mutex. This change also has the side effect of further reducing the number of cases where an in-progress I/O operation can occur on a socket after close, as using the file descriptor refcount prevents the socket from closing while in use. MFC after: 3 months
# 102ea033	25-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Extend getsock() to return the struct file flags read while holding the file lock, in the style of fgetsock(). Modify accept1() to use getsock() instead of fgetsock(), relying on the file descriptor reference rather than an acquired socket reference to prevent the listen socket from being destroyed during accept(). This avoids additional reference count operations, which should improve performance, and also avoids accept1() operating on a socket whose file descriptor has been torn down, which may have resulted in protocol shutdown starting. MFC after: 3 months
# fa4c5373	01-Apr-2006	Robert Watson <rwatson@FreeBSD.org>	Add comment to accept1() that it should use getsock() instead of fgetsock() to avoid additional mutex operations, and also to avoid use of soref/sorele which are now not preferred. MFC after: 3 months
# 7c8dcf2d	26-Mar-2006	Alan Cox <alc@FreeBSD.org>	Use NET_LOCK_GIANT() and VFS_LOCK_GIANT() instead of unconditionally acquiring Giant in kern_sendfile(). Guard against the forced reclamation of a vnode in kern_sendfile(). Discussed with: jeff Reviewed by: tegge MFC after: 3 weeks
# fa545f43	28-Feb-2006	Paul Saab <ps@FreeBSD.org>	Fix 32bit sendfile by implementing kern_sendfile so that it takes the header and trailers as iovec arguments instead of copying them in inside of sendfile. Reviewed by: jhb MFC after: 3 weeks
# ecc44de7	31-Oct-2005	Paul Saab <ps@FreeBSD.org>	Reformat socket control messages on input/output for 32bit compatibility on 64bit systems. Submitted by: ps, ups Reviewed by: jhb
# a372f822	14-Oct-2005	Paul Saab <ps@FreeBSD.org>	Implement the 32bit versions of recvmsg, recvfrom, sendmsg Partially obtained from: jhb
# 6758f88e	05-Jul-2005	Robert Watson <rwatson@FreeBSD.org>	Add MAC Framework and MAC policy entry point mac_check_socket_create(), which is invoked from socket() and socketpair(), permitting MAC policy modules to control the creation of sockets by domain, type, and protocol. Obtained from: TrustedBSD Project Sponsored by: SPARTA, SPAWAR Approved by: re (scottl) Requested by: SCC
# 75ae2570	04-May-2005	Maksim Yevmenkin <emax@FreeBSD.org>	Change m_uiotombuf so it will accept offset at which data should be copied to the mbuf. Offset cannot exceed MHLEN bytes. This is currently used to fix Ethernet header alignment problem on alpha and sparc64. Also change all users of m_uiotombuf to pass proper offset. Reviewed by: jmg, sam Tested by: Sten Spans "sten AT blinkenlights DOT nl" MFC after: 1 week
# 7f53207b	16-Apr-2005	Robert Watson <rwatson@FreeBSD.org>	Introduce three additional MAC Framework and MAC Policy entry points to control socket poll() (select()), fstat(), and accept() operations, required for some policies: poll() mac_check_socket_poll() fstat() mac_check_socket_stat() accept() mac_check_socket_accept() Update mac_stub and mac_test policies to be aware of these entry points. While here, add missing entry point implementations for: mac_stub.c stub_check_socket_receive() mac_stub.c stub_check_socket_send() mac_test.c mac_test_check_socket_send() mac_test.c mac_test_check_socket_visible() Obtained from: TrustedBSD Project Sponsored by: SPAWAR, SPARTA
# f247a524	30-Mar-2005	Jeff Roberson <jeff@FreeBSD.org>	- LK_NOPAUSE is a nop now. Sponsored by: Isilon Systems, Inc.
# 8d6e40c3	08-Mar-2005	Maxim Sobolev <sobomax@FreeBSD.org>	Add kernel-only flag MSG_NOSIGNAL to be used in emulation layers to surpress SIGPIPE signal for the duration of the sento-family syscalls. Use it to replace previously added hack in Linux layer based on temporarily setting SO_NOSIGPIPE flag. Suggested by: alfred
# 29bdd019	18-Feb-2005	Robert Watson <rwatson@FreeBSD.org>	Remove now unused 'int s' from spl(). MFC after: 3 days
# d8d716be	18-Feb-2005	Robert Watson <rwatson@FreeBSD.org>	De-spl kern_connect(). MFC after: 3 days
# 1e8f8954	17-Feb-2005	Robert Watson <rwatson@FreeBSD.org>	In accept1(), extend coverage of the socket lock from just covering soref() to also covering the update of so_state. While no other user threads can update the socket state here as it's not yet hooked up to the file descriptor array yet, the protocol could also frob the socket state here, leading to a lost update to the so_state field. No reported instances of this bug (as yet). MFC after: 3 days
# a6886ef1	30-Jan-2005	Maxim Sobolev <sobomax@FreeBSD.org>	Extend kern_sendit() to take another enum uio_seg argument, which specifies where the buffer to send lies and use it to eliminate yet another stackgap in linuxlator. MFC after: 2 weeks
# 8516dd18	24-Jan-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Don't use VOP_GETVOBJECT, use vp->v_object directly.
# f6dc414a	24-Jan-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Save a line by unlocking before we test.
# 9454b2d8	06-Jan-2005	Warner Losh <imp@FreeBSD.org>	/* -> /*- for copyright notices, minor format tweaks as necessary
# 124e4c3b	13-Nov-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Introduce an alias for FILEDESC_{UN}LOCK() with the suffix _FAST. Use this in all the places where sleeping with the lock held is not an issue. The distinction will become significant once we finalize the exact lock-type to use for this kind of case.
# d3cb0d99	07-Nov-2004	Alan Cox <alc@FreeBSD.org>	Introduce two new options, "CPU private" and "no wait", to sf_buf_alloc(). Change the spelling of the "catch" option to be consistent with the new options. Implement the "no wait" option. An implementation of the "CPU private" for i386 will be committed at a later date.
# ef11fbd7	07-Nov-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Introduce fdclose() which will clean an entry in a filedesc. Replace homerolled versions with call to fdclose(). Make fdunused() static to kern_descrip.c
# b1fa7527	07-Nov-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Use fget_locked() instead of homerolled
# d19ef814	03-Nov-2004	Alan Cox <alc@FreeBSD.org>	The synchronization provided by vm object locking has eliminated the need for most calls to vm_page_busy(). Specifically, most calls to vm_page_busy() occur immediately prior to a call to vm_page_remove(). In such cases, the containing vm object is locked across both calls. Consequently, the setting of the vm page's PG_BUSY flag is not even visible to other threads that are following the synchronization protocol. This change (1) eliminates the calls to vm_page_busy() that immediately precede a call to vm_page_remove() or functions, such as vm_page_free() and vm_page_rename(), that call it and (2) relaxes the requirement in vm_page_remove() that the vm page's PG_BUSY flag is set. Now, the vm page's PG_BUSY flag is set only when the vm object lock is released while the vm page is still in transition. Typically, this is when it is undergoing I/O.
# ae1e5c5d	24-Oct-2004	Robert Watson <rwatson@FreeBSD.org>	Move from using the socket reference count to the file reference count to prevent sockets from being garbage collected during socket-specific system calls. This is the same approach used in most VFS-specific system calls, as well as generic file descriptor system calls such as read() and write(). To do this, add a utility function getsock(), which is logically identical to getvnode() used for the same purpose in VFS. Unlike fgetsock(), it returns with the file reference count elevated, but no bump of the socket reference count. Replace matching calls to fputsock() with fdrop(). This change is made to all socket system calls other than sendfile() and accept(), but the approach should be applicable to those system calls also. This shaves about four mutex operations off of each of these system calls, including send() and recv() variants, adding about 1% to pps on minimal UDP packets for UP using netblast, and 4% on SMP. Reviewed by: pjd
# 01ad40da	24-Oct-2004	Alan Cox <alc@FreeBSD.org>	Use VM_ALLOC_NOBUSY instead of calling vm_page_wakeup().
# 0f777d7d	20-Oct-2004	Alan Cox <alc@FreeBSD.org>	Modify the vm object locking in do_sendfile() so that the containing object is locked when vm_page_io_finish() is called on a page. This is to satisfy a new, post-RELENG_5 assertion in vm_page_io_finish(). (I am in the process of transitioning the responsibility for synchronizing access to various fields/flags on the page from the global page queues lock to the per-object lock.) Tripped over by: obrien@
# 86dac448	01-Oct-2004	Alan Cox <alc@FreeBSD.org>	Add a SOCKBUF_LOCK() to a rarely executed path in do_sendfile().
# ad3b9257	15-Aug-2004	John-Mark Gurney <jmg@FreeBSD.org>	Add locking to the kqueue subsystem. This also makes the kqueue subsystem a more complete subsystem, and removes the knowlege of how things are implemented from the drivers. Include locking around filter ops, so a module like aio will know when not to be unloaded if there are outstanding knotes using it's filter ops. Currently, it uses the MTX_DUPOK even though it is not always safe to aquire duplicate locks. Witness currently doesn't support the ability to discover if a dup lock is ok (in some cases). Reviewed by: green, rwatson (both earlier versions)
# e140eb43	17-Jul-2004	David Malone <dwmalone@FreeBSD.org>	Add a kern_setsockopt and kern_getsockopt which can read the option values from either user land or from the kernel. Use them for [gs]etsockopt and to clean up some calls to [gs]etsockopt in the Linux emulation code that uses the stackgap.
# 552afd9c	10-Jul-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Clean up and wash struct iovec and struct uio handling. Add copyiniov() which copies a struct iovec array in from userland into a malloc'ed struct iovec. Caller frees. Change uiofromiov() to malloc the uio (caller frees) and name it copyinuio() which is more appropriate. Add cloneuio() which returns a malloc'ed copy. Caller frees. Use them throughout.
# 6ec70e64	08-Jul-2004	Robert Watson <rwatson@FreeBSD.org>	Remove spl()'s from do_sendfile().
# ad6b0eff	23-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Acquire socket lock in the "waiting for connection" loop in kern_connect(), replacing tsleep() with msleep() with the socket mutex.
# a3146ff9	22-Jun-2004	Bruce M Simpson <bms@FreeBSD.org>	Fix an inconsistency in socket option propagation on accept(). Propagate the SS_NBIO flag from the parent socket to the child socket during an accept() operation. The file descriptor O_NONBLOCK flag would have been propagated already by the fflag assignment, and therefore would have been inconsistent with the underlying socket's so_state member. This makes accept() more closely adhere to the API contract we effectively outline in the manual page. Note also that Linux continues to differ here; O_NONBLOCK is not propagated. The other BSDs do propagate the flag, as does Solaris. The Single UNIX Specification does not offer specific advice on this issue. PR: kern/45733 Requested by: Jayanth Vijayaraghavan Reviewed by: rwatson
# 31f555a1	18-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Assert socket buffer lock in sb_lock() to protect socket buffer sleep lock state. Convert tsleep() into msleep() with socket buffer mutex as argument. Hold socket buffer lock over sbunlock() to protect sleep lock state. Assert socket buffer lock in sbwait() to protect the socket buffer wait state. Convert tsleep() into msleep() with socket buffer mutex as argument. Modify sofree(), sosend(), and soreceive() to acquire SOCKBUF_LOCK() in order to call into these functions with the lock, as well as to start protecting other socket buffer use in their implementation. Drop the socket buffer mutexes around calls into the protocol layer, around potentially blocking operations, for copying to/from user space, and VM operations relating to zero-copy. Assert the socket buffer mutex strategically after code sections or at the beginning of loops. In some cases, modify return code to ensure locks are properly dropped. Convert the potentially blocking allocation of storage for the remote address in soreceive() into a non-blocking allocation; we may wish to move the allocation earlier so that it can block prior to acquisition of the socket buffer lock. Drop some spl use. NOTE: Some races exist in the current structuring of sosend() and soreceive(). This commit only merges basic socket locking in this code; follow-up commits will close additional races. As merged, these changes are not sufficient to run without Giant safely. Reviewed by: juli, tjr
# c0b99ffa	14-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	The socket field so_state is used to hold a variety of socket related flags relating to several aspects of socket functionality. This change breaks out several bits relating to send and receive operation into a new per-socket buffer field, sb_state, in order to facilitate locking. This is required because, in order to provide more granular locking of sockets, different state fields have different locking properties. The following fields are moved to sb_state: SS_CANTRCVMORE (so_state) SS_CANTSENDMORE (so_state) SS_RCVATMARK (so_state) Rename respectively to: SBS_CANTRCVMORE (so_rcv.sb_state) SBS_CANTSENDMORE (so_snd.sb_state) SBS_RCVATMARK (so_rcv.sb_state) This facilitates locking by isolating fields to be located with other identically locked fields, and permits greater granularity in socket locking by avoiding storing fields with different locking semantics in the same short (avoiding locking conflicts). In the future, we may wish to coallesce sb_state and sb_flags; for the time being I leave them separate and there is no additional memory overhead due to the packing/alignment of shorts in the socket buffer structure.
# 310e7ceb	12-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Socket MAC labels so_label and so_peerlabel are now protected by SOCK_LOCK(so): - Hold socket lock over calls to MAC entry points reading or manipulating socket labels. - Assert socket lock in MAC entry point implementations. - When externalizing the socket label, first make a thread-local copy while holding the socket lock, then release the socket lock to externalize to userspace.
# 3e87b34a	12-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Correct whitespace errors in merge from rwatson_netperf: tabs instead of spaces, no trailing tab at the end of line. Pointed out by: csjp
# 395a08c9	12-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Extend coverage of SOCK_LOCK(so) to include so_count, the socket reference count: - Assert SOCK_LOCK(so) macros that directly manipulate so_count: soref(), sorele(). - Assert SOCK_LOCK(so) in macros/functions that rely on the state of so_count: sofree(), sotryfree(). - Acquire SOCK_LOCK(so) before calling these functions or macros in various contexts in the stack, both at the socket and protocol layers. - In some cases, perform soisdisconnected() before sotryfree(), as this could result in frobbing of a non-present socket if sotryfree() actually frees the socket. - Note that sofree()/sotryfree() will release the socket lock even if they don't free the socket. Submitted by: sam Sponsored by: FreeBSD Foundation Obtained from: BSD/OS
# 1930e303	11-Jun-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Deorbit COMPAT_SUNOS. We inherited this from the sparc32 port of BSD4.4-Lite1. We have neither a sparc32 port nor a SunOS4.x compatibility desire these days.
# aa57bb04	07-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Correct a resource leak introduced in recent accept locking changes: when I reordered events in accept1() to allocate a file descriptor earlier, I didn't properly update use of goto on exit to unwind for cases where the file descriptor is now held, but wasn't previously. The result was that, in the event of accept() on a non-blocking socket, or in the event of a socket error, a file descriptor would be leaked. This ended up being non-fatal in many cases, as the file descriptor would be properly GC'd on process exit, so only showed up for processes that do a lot of non-blocking accept() calls, and also live for a long time (such as qmail). This change updates the use of goto targets to do additional unwinding. Eyes provided by: Brian Feldman <green@freebsd.org> Feet, hands provided by: Stefan Ehmann <shoesoft@gmx.net>, Dimitry Andric <dimitry@andric.com> Arjan van Leeuwen <avleeuwen@piwebs.com>
# 7a1a900c	07-Jun-2004	Hajimu UMEMOTO <ume@FreeBSD.org>	allow more than MLEN bytes for ancillary data to meet the requirement of Section 20.1 of RFC3542. Obtained from: KAME MFC after: 1 week
# 2658b3bb	01-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Integrate accept locking from rwatson_netperf, introducing a new global mutex, accept_mtx, which serializes access to the following fields across all sockets: so_qlen so_incqlen so_qstate so_comp so_incomp so_list so_head While providing only coarse granularity, this approach avoids lock order issues between sockets by avoiding ownership of the fields by a specific socket and its per-socket mutexes. While here, rewrite soclose(), sofree(), soaccept(), and sonewconn() to add assertions, close additional races and address lock order concerns. In particular: - Reorganize the optimistic concurrency behavior in accept1() to always allocate a file descriptor with falloc() so that if we do find a socket, we don't have to encounter the "Oh, there wasn't a socket" race that can occur if falloc() sleeps in the current code, which broke inbound accept() ordering, not to mention requiring backing out socket state changes in a way that raced with the protocol level. We may want to add a lockless read of the queue state if polling of empty queues proves to be important to optimize. - In accept1(), soref() the socket while holding the accept lock so that the socket cannot be free'd in a race with the protocol layer. Likewise in netgraph equivilents of the accept1() code. - In sonewconn(), loop waiting for the queue to be small enough to insert our new socket once we've committed to inserting it, or races can occur that cause the incomplete socket queue to overfill. In the previously implementation, it was sufficient to simply tested once since calling soabort() didn't release synchronization permitting another thread to insert a socket as we discard a previous one. - In soclose()/sofree()/et al, it is the responsibility of the caller to remove a socket from the incomplete connection queue before calling soabort(), which prevents soabort() from having to walk into the accept socket to release the socket from its queue, and avoids races when releasing the accept mutex to enter soabort(), permitting soabort() to avoid lock ordering issues with the caller. - Generally cluster accept queue related operations together throughout these functions in order to facilitate locking. Annotate new locking in socketvar.h.
# 36568179	31-May-2004	Robert Watson <rwatson@FreeBSD.org>	The SS_COMP and SS_INCOMP flags in the so_state field indicate whether the socket is on an accept queue of a listen socket. This change renames the flags to SQ_COMP and SQ_INCOMP, and moves them to a new state field on the socket, so_qstate, as the locking for these flags is substantially different for the locking on the remainder of the flags in so_state.
# 099a0e58	31-May-2004	Bosko Milekic <bmilekic@FreeBSD.org>	Bring in mbuma to replace mballoc. mbuma is an Mbuf & Cluster allocator built on top of a number of extensions to the UMA framework, all included herein. Extensions to UMA worth noting: - Better layering between slab <-> zone caches; introduce Keg structure which splits off slab cache away from the zone structure and allows multiple zones to be stacked on top of a single Keg (single type of slab cache); perhaps we should look into defining a subset API on top of the Keg for special use by malloc(9), for example. - UMA_ZONE_REFCNT zones can now be added, and reference counters automagically allocated for them within the end of the associated slab structures. uma_find_refcnt() does a kextract to fetch the slab struct reference from the underlying page, and lookup the corresponding refcnt. mbuma things worth noting: - integrates mbuf & cluster allocations with extended UMA and provides caches for commonly-allocated items; defines several zones (two primary, one secondary) and two kegs. - change up certain code paths that always used to do: m_get() + m_clget() to instead just use m_getcl() and try to take advantage of the newly defined secondary Packet zone. - netstat(1) and systat(1) quickly hacked up to do basic stat reporting but additional stats work needs to be done once some other details within UMA have been taken care of and it becomes clearer to how stats will work within the modified framework. From the user perspective, one implication is that the NMBCLUSTERS compile-time option is no longer used. The maximum number of clusters is still capped off according to maxusers, but it can be made unlimited by setting the kern.ipc.nmbclusters boot-time tunable to zero. Work should be done to write an appropriate sysctl handler allowing dynamic tuning of kern.ipc.nmbclusters at runtime. Additional things worth noting/known issues (READ): - One report of 'ips' (ServeRAID) driver acting really slow in conjunction with mbuma. Need more data. Latest report is that ips is equally sucking with and without mbuma. - Giant leak in NFS code sometimes occurs, can't reproduce but currently analyzing; brueffer is able to reproduce but THIS IS NOT an mbuma-specific problem and currently occurs even WITHOUT mbuma. - Issues in network locking: there is at least one code path in the rip code where one or more locks are acquired and we end up in m_prepend() with M_WAITOK, which causes WITNESS to whine from within UMA. Current temporary solution: force all UMA allocations to be M_NOWAIT from within UMA for now to avoid deadlocks unless WITNESS is defined and we can determine with certainty that we're not holding any locks when we're M_WAITOK. - I've seen at least one weird socketbuffer empty-but- mbuf-still-attached panic. I don't believe this to be related to mbuma but please keep your eyes open, turn on debugging, and capture crash dumps. This change removes more code than it adds. A paper is available detailing the change and considering various performance issues, it was presented at BSDCan2004: http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf Please read the paper for Future Work and implementation details, as well as credits. Testing and Debugging: rwatson, brueffer, Ketrien I. Saihr-Kesenchedra, ... Reviewed by: Lots of people (for different parts)
# f7250466	07-May-2004	Robert Watson <rwatson@FreeBSD.org>	Unconditionally lock Giant in do_sendfile(), rather than locking it conditional on debug.mpsafenet. We can try pushing down Giant here later, but we don't want to enter VFS without holding Giant. Bumped into by: kris
# 5a324893	05-May-2004	Alan Cox <alc@FreeBSD.org>	Make vm_page's PG_ZERO flag immutable between the time of the page's allocation and deallocation. This flag's principal use is shortly after allocation. For such cases, clearing the flag is pointless. The only unusual use of PG_ZERO is in vfs_bio_clrbuf(). However, allocbuf() never requests a prezeroed page. So, vfs_bio_clrbuf() never sees a prezeroed page. Reviewed by: tegge@
# e8410540	08-Apr-2004	Mike Silbersack <silby@FreeBSD.org>	Fix a regression in my change which sends headers along with data; a side effect of that change caused headers to not be sent if a 0 byte file was passed to sendfile. This change fixes that behavior, allowing sendfile to send out the headers even with a 0 byte file again. Noticed by: Dirk Engling
# 7f8a436f	05-Apr-2004	Warner Losh <imp@FreeBSD.org>	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999. Approved by: core
# 051bbf60	04-Apr-2004	Robert Watson <rwatson@FreeBSD.org>	Detatch incorrect spellings of detach.
# 121230a4	03-Apr-2004	Alan Cox <alc@FreeBSD.org>	In some cases, sf_buf_alloc() should sleep with pri PCATCH; in others, it should not. Add a new parameter so that the caller can specify which is the case. Reported by: dillon
# 627e4a99	28-Mar-2004	Robert Watson <rwatson@FreeBSD.org>	Conditionally acquire Giant when entering the sockets layer via the socket-specific system calls based on debug.mpsafenet, rather than acquiring Giant unconditionally.
# 74041f5a	28-Mar-2004	Robert Watson <rwatson@FreeBSD.org>	When validating that the length sum in recvit(), we fail to release Giant on an error. Add a Giant acquisition. Reviewed by: sam, bms
# 90ecfebd	16-Mar-2004	Alan Cox <alc@FreeBSD.org>	Refactor the existing machine-dependent sf_buf_free() into a machine- dependent function by the same name and a machine-independent function, sf_buf_mext(). Aside from the virtue of making more of the code machine- independent, this change also makes the interface more logical. Before, sf_buf_free() did more than simply undo an sf_buf_alloc(); it also unwired and if necessary freed the page. That is now the purpose of sf_buf_mext(). Thus, sf_buf_alloc() and sf_buf_free() can now be used as a general-purpose emphemeral map cache.
# 0b759971	03-Mar-2004	Robert Watson <rwatson@FreeBSD.org>	Remove unneeded label 'done2' from socket(). We now grab Giant only around socreate(), and don't need it for file descriptor accesses. Submitted by: sam
# b49d824e	08-Feb-2004	Mike Silbersack <silby@FreeBSD.org>	Add the SF_NODISKIO flag to sendfile. This flag causes sendfile to be mindful of blocking on disk I/O and instead return EBUSY when such blocking would occur. Results from the DeBox project indicate that blocking on disk I/O can slow the performance of a kqueue/poll based webserver. Using a flag such as SF_NODISKIO and throwing connections that would block to helper processes/threads helped increase performance. Currently, only the Flash webserver uses this flag, although it could probably be applied to thttpd with relative ease. Idea by: Yaoping Ruan & Vivek Pai
# ff5e43a3	04-Feb-2004	Mike Silbersack <silby@FreeBSD.org>	Rename iov_to_uio to uiofromiov to be more consistent with other uio* functions. Suggested by: bde
# beb699c7	01-Feb-2004	Mike Silbersack <silby@FreeBSD.org>	Rewrite sendfile's header support so that headers are now sent in the first packet along with data, instead of in their own packet. When serving files of size (packetsize - headersize) or smaller, this will result in one less packet crossing the network. Quick testing with thttpd and http_load has shown a noticeable performance improvement in this case (350 vs 330 fetches per second.) Included in this commit are two support routines, iov_to_uio, and m_uiotombuf; these routines are used by sendfile to construct the header mbuf chain that will be linked to the rest of the data in the socket buffer.
# 54556cc7	19-Jan-2004	Alexander Kabaev <kan@FreeBSD.org>	One more instance of magic number used in place of IO_SEQSHIFT. Submitted by: alc
# a2fe44e8	15-Jan-2004	Dag-Erling Smørgrav <des@FreeBSD.org>	New file descriptor allocation code, derived from similar code introduced in OpenBSD by Niels Provos. The patch introduces a bitmap of allocated file descriptors which is used to locate available descriptors when a new one is needed. It also moves the task of growing the file descriptor table out of fdalloc(), reducing complexity in both fdalloc() and do_dup(). Debts of gratitude are owed to tjr@ (who provided the original patch on which this work is based), grog@ (for the gdb(4) man page) and rwatson@ (for assistance with pxeboot(8)).
# d7a1c7e3	11-Jan-2004	Dag-Erling Smørgrav <des@FreeBSD.org>	Back out 1.166, which was committed by mistake.
# f1ea6d81	11-Jan-2004	Dag-Erling Smørgrav <des@FreeBSD.org>	Mechanical whitespace cleanup + other minor style nits.
# 012b5531	11-Jan-2004	Dag-Erling Smørgrav <des@FreeBSD.org>	Mechanical whitespace cleanup + minor style nits.
# d41457da	10-Jan-2004	Dag-Erling Smørgrav <des@FreeBSD.org>	More unparenthesized return values.
# b91a5997	10-Jan-2004	Dag-Erling Smørgrav <des@FreeBSD.org>	Style: parenthesize return values.
# 2b77864f	10-Jan-2004	Don Lewis <truckman@FreeBSD.org>	Add a somewhat redundant check on the len arguement to getsockaddr() to avoid relying on the minimum memory allocation size to avoid problems. The check is somewhat redundant because the consumers of the returned structure will check that sa_len is a protocol-specific larger size. Submitted by: Matthew Dillon <dillon@apollo.backplane.com> Reviewed by: nectar MFC after: 30 days
# ddeb5b24	28-Dec-2003	Mike Silbersack <silby@FreeBSD.org>	Track three new sendfile-related statistics: - The number of times sendfile had to do disk I/O - The number of times sfbuf allocation failed - The number of times sfbuf allocation had to wait
# 93220782	25-Dec-2003	David Malone <dwmalone@FreeBSD.org>	In socket(2) we only need Giant around the call to socreate, so just grab it there.
# 9f144cff	24-Dec-2003	Alfred Perlstein <alfred@FreeBSD.org>	Add restrict qualifiers. PR: 44394 Submitted by: Craig Rodrigues <rodrige@attbi.com>
# 186e347f	01-Dec-2003	David Greenman <dg@FreeBSD.org>	Fixed a bug in sendfile(2) where the sent data would be corrupted due to sendfile(2) being erroneously automatically restarted after a signal is delivered. Fixed by converting ERESTART to EINTR prior to exiting. Updated manual page to indicate the potential EINTR error, its cause and consequences. Approved by: re@freebsd.org
# e45db9b8	15-Nov-2003	Alan Cox <alc@FreeBSD.org>	- Modify alpha's sf_buf implementation to use the direct virtual-to- physical mapping. - Move the sf_buf API to its own header file; make struct sf_buf's definition machine dependent. In this commit, we remove an unnecessary field from struct sf_buf on the alpha, amd64, and ia64. Ultimately, we may eliminate struct sf_buf on those architecures except as an opaque pointer that references a vm page.
# e1419c08	19-Oct-2003	David Malone <dwmalone@FreeBSD.org>	falloc allocates a file structure and adds it to the file descriptor table, acquiring the necessary locks as it works. It usually returns two references to the new descriptor: one in the descriptor table and one via a pointer argument. As falloc releases the FILEDESC lock before returning, there is a potential for a process to close the reference in the file descriptor table before falloc's caller gets to use the file. I don't think this can happen in practice at the moment, because Giant indirectly protects closes. To stop the file being completly closed in this situation, this change makes falloc set the refcount to two when both references are returned. This makes life easier for several of falloc's callers, because the first thing they previously did was grab an extra reference on the file. Reviewed by: iedowse Idea run past: jhb
# 411d10a6	29-Aug-2003	Alan Cox <alc@FreeBSD.org>	Migrate the sf_buf allocator that is used by sendfile(2) and zero-copy sockets into machine-dependent files. The rationale for this migration is illustrated by the modified amd64 allocator. It uses the amd64's direct map to avoid emphemeral mappings in the kernel's address space. On an SMP, the emphemeral mappings result in an IPI for TLB shootdown for each transmitted page. Yuck. Maintainers of other 64-bit platforms with direct maps should be able to use the amd64 allocator as a reference implementation.
# 660ebf0e	11-Aug-2003	Alexander Kabaev <kan@FreeBSD.org>	Drop Giant in recvit before returning an error to the caller to avoid leaking the Giant on the syscall exit.
# b81694ed	06-Aug-2003	Yaroslav Tykhiy <ytykhiy@gmail.com>	If connect(2) has been interrupted by a signal and therefore the connection is to be established asynchronously, behave as in the case of non-blocking mode: - keep the SS_ISCONNECTING bit set thus indicating that the connection establishment is in progress, which is the case (clearing the bit in this case was just a bug); - return EALREADY, instead of the confusing and unreasonable EADDRINUSE, upon further connect(2) attempts on this socket until the connection is established (this also brings our connect(2) into accord with IEEE Std 1003.1.)
# d2cce3d6	04-Aug-2003	David Malone <dwmalone@FreeBSD.org>	Do some minor Giant pushdown made possible by copyin, fget, fdrop, malloc and mbuf allocation all not requiring Giant. 1) ostat, fstat and nfstat don't need Giant until they call fo_stat. 2) accept can copyin the address length without grabbing Giant. 3) sendit doesn't need Giant, so don't bother grabbing it until kern_sendit. 4) move Giant grabbing from each indivitual recv* syscall to recvit.
# efd02757	01-Aug-2003	Alan Cox <alc@FreeBSD.org>	Use kmem_alloc_nofault() rather than kmem_alloc_pageable() in sf_buf_init(). (See revision 1.140 of kern/sys_pipe.c for a detailed rationale.) Submitted by: tegge
# 8d5f9131	18-Jun-2003	Don Lewis <truckman@FreeBSD.org>	VOP_GETVOBJECT() wants to be called with the vnode lock held.
# c10c5378	11-Jun-2003	Alan Cox <alc@FreeBSD.org>	Finish the vm object locking in sendfile(2). More generally, the vm locking in sendfile(2) is complete.
# 2ab3670a	11-Jun-2003	Alan Cox <alc@FreeBSD.org>	Lock the vm object when removing a page.
# 677b542e	10-Jun-2003	David E. O'Brien <obrien@FreeBSD.org>	Use __FBSDID().
# de1cab2b	29-May-2003	David Malone <dwmalone@FreeBSD.org>	Grab giant in sendit rather than kern_sendit because sockargs may allocate mbufs with M_TRYWAIT, which may require Giant. Reviewed by: bmilekic Approved by: re (scottl)
# 710c5645	05-May-2003	David Malone <dwmalone@FreeBSD.org>	Split sendit into two parts. The first part, still called sendit, that does the copyin stuff and then calls the second part kern_sendit to do the hard work. Don't bother holding Giant during the copyin phase. The intent of this is to allow the Linux emulator to impliment send* syscalls without using the stackgap.
# 7be80f55	30-Mar-2003	Alan Cox <alc@FreeBSD.org>	Recent changes to uipc_cow.c have eliminated the need for some sf_buf- related variables to be global. Make them either local to sf_buf_init() or static.
# 9f6d45b1	28-Mar-2003	Alan Cox <alc@FreeBSD.org>	Pass the vm_page's address to sf_buf_alloc(); map the vm_page as part of sf_buf_alloc() instead of expecting sf_buf_alloc()'s caller to map it. The ultimate reason for this change is to enable two optimizations: (1) that there never be more than one sf_buf mapping a vm_page at a time and (2) 64-bit architectures can transparently use their 1-1 virtual to physical mapping (e.g., "K0SEG") avoiding the overhead of pmap_qenter() and pmap_qremove().
# 42de97a5	16-Mar-2003	Alan Cox <alc@FreeBSD.org>	Pass the sf buf to MEXTADD() as the optional argument. This permits the simplification of socow_iodone() and sf_buf_free(); they don't have to reverse engineer the sf buf from the data's address.
# 7c4351aa	05-Mar-2003	Alan Cox <alc@FreeBSD.org>	Remove GIANT_REQUIRED from sf_buf_free().
# 6a07a139	23-Feb-2003	Tor Egge <tegge@FreeBSD.org>	Sync new socket nonblocking/async state with file flags in accept(). PR: 1775 Reviewed by: mbr
# d6bf2378	19-Feb-2003	Olivier Houchard <cognet@FreeBSD.org>	Remove duplicate includes. Submitted by: Cyril Nguyen-Huu <cyril@ci0.org>
# a163d034	18-Feb-2003	Warner Losh <imp@FreeBSD.org>	Back out M_* changes, per decision of the TRB. Approved by: trb
# 12e4397e	03-Feb-2003	Hajimu UMEMOTO <ume@FreeBSD.org>	Break out the bind and connect syscalls to intend to make calling these syscalls internally easy. This is preparation for force coming IPv6 support for Linuxlator. Submitted by: dwmalone MFC after: 10 days
# 8deebb01	02-Feb-2003	Alfred Perlstein <alfred@FreeBSD.org>	Consolidate MIN/MAX macros into one place (param.h). Submitted by: Hiten Pandya <hiten@unixdaemons.com>
# 44956c98	21-Jan-2003	Alfred Perlstein <alfred@FreeBSD.org>	Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
# 48e3128b	12-Jan-2003	Matthew Dillon <dillon@FreeBSD.org>	Bow to the whining masses and change a union back into void *. Retain removal of unnecessary casts and throw in some minor cleanups to see if anyone complains, just for the hell of it.
# cd72f218	11-Jan-2003	Matthew Dillon <dillon@FreeBSD.org>	Change struct file f_data to un_data, a union of the correct struct pointer types, and remove a huge number of casts from code using it. Change struct xfile xf_data to xun_data (ABI is still compatible). If we need to add a #define for f_data and xf_data we can, but I don't think it will be necessary. There are no operational changes in this commit.
# 08c7670a	23-Dec-2002	Poul-Henning Kamp <phk@FreeBSD.org>	Move the declaration of the socket fileops from socketvar.h to file.h. This allows us to use the new typedefs and removes the needs for a number of forward struct declarations in socketvar.h
# b371c939	06-Oct-2002	Robert Watson <rwatson@FreeBSD.org>	Integrate mac_check_socket_send() and mac_check_socket_receive() checks from the MAC tree: allow policies to perform access control for the ability of a process to send and receive data via a socket. At some point, we might also pass in additional address information if an explicit address is requested on send. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 91e97a82	02-Oct-2002	Don Lewis <truckman@FreeBSD.org>	In an SMP environment post-Giant it is no longer safe to blindly dereference the struct sigio pointer without any locking. Change fgetown() to take a reference to the pointer instead of a copy of the pointer and call SIGIO_LOCK() before copying the pointer and dereferencing it. Reviewed by: rwatson
# f2f03122	28-Aug-2002	Archie Cobbs <archie@FreeBSD.org>	accept(2) on a socket that has been shutdown(2) normally returns ECONNABORTED. Make this happen in the non-blocking case as well. The previous behavior was to return EAGAIN, which (a) is not consistent with the blocking case and (b) causes the application to think the socket is still valid. PR: bin/42100 Reviewed by: freebsd-net MFC after: 3 days
# 9ca43589	15-Aug-2002	Robert Watson <rwatson@FreeBSD.org>	In order to better support flexible and extensible access control, make a series of modifications to the credential arguments relating to file read and write operations to cliarfy which credential is used for what: - Change fo_read() and fo_write() to accept "active_cred" instead of "cred", and change the semantics of consumers of fo_read() and fo_write() to pass the active credential of the thread requesting an operation rather than the cached file cred. The cached file cred is still available in fo_read() and fo_write() consumers via fp->f_cred. These changes largely in sys_generic.c. For each implementation of fo_read() and fo_write(), update cred usage to reflect this change and maintain current semantics: - badfo_readwrite() unchanged - kqueue_read/write() unchanged pipe_read/write() now authorize MAC using active_cred rather than td->td_ucred - soo_read/write() unchanged - vn_read/write() now authorize MAC using active_cred but VOP_READ/WRITE() with fp->f_cred Modify vn_rdwr() to accept two credential arguments instead of a single credential: active_cred and file_cred. Use active_cred for MAC authorization, and select a credential for use in VOP_READ/WRITE() based on whether file_cred is NULL or not. If file_cred is provided, authorize the VOP using that cred, otherwise the active credential, matching current semantics. Modify current vn_rdwr() consumers to pass a file_cred if used in the context of a struct file, and to always pass active_cred. When vn_rdwr() is used without a file_cred, pass NOCRED. These changes should maintain current semantics for read/write, but avoid a redundant passing of fp->f_cred, as well as making it more clear what the origin of each credential is in file descriptor read/write operations. Follow-up commits will make similar changes to other file descriptor operations, and modify the MAC framework to pass both credentials to MAC policy modules so they can implement either semantic for revocation. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 4b9c2fa1	15-Aug-2002	Robert Watson <rwatson@FreeBSD.org>	Fix return case for negative namelen by jumping to normal exit processing rather than immediately returning, or we may not unlock necessary locks. Noticed by: Mike Heffner <mheffner@acm.vt.edu>
# 9e63574e	13-Aug-2002	David Greenman <dg@FreeBSD.org>	Moved sf_buf_alloc and sf_buf_free function declarations to sys/socketvar.h so that they can be seen by external callers.
# a370c700	13-Aug-2002	David Greenman <dg@FreeBSD.org>	Remove obsolete comment about sf_buf_* functions being static. They were made un-static in rev 1.114.
# 87df4f8f	11-Aug-2002	Semen Ustimenko <semenu@FreeBSD.org>	Fix sendfile(), who was calling vn_rdwr() without aresid parameter and thus hiting EIO at the end of file. This is believed to be a feature (not a bug) of vn_rdwr(), so we turn it off by supplying aresid param. Reviewed by: rwatson, dg
# 5b770403	08-Aug-2002	Jacques Vidrine <nectar@FreeBSD.org>	While we're at it, add range checks similar to those in previous commit to getsockname() and getpeername(), too.
# 82d9ad33	08-Aug-2002	Robert Watson <rwatson@FreeBSD.org>	Add additional range checks for copyout targets. Submitted by: Silvio Cesare <silvio@qualys.com>
# f9d0d524	01-Aug-2002	Robert Watson <rwatson@FreeBSD.org>	Include file cleanup; mac.h and malloc.h at one point had ordering relationship requirements, and no longer do. Reminded by: bde
# 62f5f684	31-Jul-2002	Robert Watson <rwatson@FreeBSD.org>	Introduce support for Mandatory Access Control and extensible kernel access control. Instrument connect(), listen(), and bind() system calls to invoke MAC framework entry points to permit policies to authorize these requests. This can be useful for policies that want to limit the activity of processes involving particular types of IPC and network activity. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 1161b86a	30-Jul-2002	Alan Cox <alc@FreeBSD.org>	o In do_sendfile(), replace vm_page_sleep_busy() by vm_page_sleep_if_busy() and extend the scope of the page queues lock to cover all accesses to the page's flags and busy fields.
# 5d323204	22-Jul-2002	Andrew R. Reiter <arr@FreeBSD.org>	- Make use of the VM_ALLOC_WIRED flag in the call to vm_page_alloc() in do_sendfile(). This allows us to rearrange an if statement in order to avoid doing an unnecesary call to vm_page_lock_queues(), and an attempt at re-wiring the pages (which were wired in the vm_page_alloc() call). Reviewed by: alc, jhb
# ae0ffa73	12-Jul-2002	Alan Cox <alc@FreeBSD.org>	Lock accesses to the page queues by sendfile() and friends.
# 9c341296	12-Jul-2002	Alfred Perlstein <alfred@FreeBSD.org>	Create a bug-for-bug FreeBSD4 compatible version of sendfile and move the fixed sendfile over. This is needed to preserve binary compatibility from 4.x to 5.x.
# a551e20e	28-Jun-2002	Alfred Perlstein <alfred@FreeBSD.org>	nuke more instances of caddr_t
# 64f0b9d7	28-Jun-2002	Alfred Perlstein <alfred@FreeBSD.org>	remove or replace caddr_t with void. make the mbuf external free function take a void * rather than caddr_t.
# 98cb733c	25-Jun-2002	Kenneth D. Merry <ken@FreeBSD.org>	At long last, commit the zero copy sockets code. MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes. ti.4: Update the ti(4) man page to include information on the TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options, and also include information about the new character device interface and the associated ioctls. man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated links. jumbo.9: New man page describing the jumbo buffer allocator interface and operation. zero_copy.9: New man page describing the general characteristics of the zero copy send and receive code, and what an application author should do to take advantage of the zero copy functionality. NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS, TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT. conf/files: Add uipc_jumbo.c and uipc_cow.c. conf/options: Add the 5 options mentioned above. kern_subr.c: Receive side zero copy implementation. This takes "disposable" pages attached to an mbuf, gives them to a user process, and then recycles the user's page. This is only active when ZERO_COPY_SOCKETS is turned on and the kern.ipc.zero_copy.receive sysctl variable is set to 1. uipc_cow.c: Send side zero copy functions. Takes a page written by the user and maps it copy on write and assigns it kernel virtual address space. Removes copy on write mapping once the buffer has been freed by the network stack. uipc_jumbo.c: Jumbo disposable page allocator code. This allocates (optionally) disposable pages for network drivers that want to give the user the option of doing zero copy receive. uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are enabled if ZERO_COPY_SOCKETS is turned on. Add zero copy send support to sosend() -- pages get mapped into the kernel instead of getting copied if they meet size and alignment restrictions. uipc_syscalls.c:Un-staticize some of the sf* functions so that they can be used elsewhere. (uipc_cow.c) if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid calling malloc() with M_WAITOK. Return an error if the M_NOWAIT malloc fails. The ti(4) driver and the wi(4) driver, at least, call this with a mutex held. This causes witness warnings for 'ifconfig -a' with a wi(4) or ti(4) board in the system. (I've only verified for ti(4)). ip_output.c: Fragment large datagrams so that each segment contains a multiple of PAGE_SIZE amount of data plus headers. This allows the receiver to potentially do page flipping on receives. if_ti.c: Add zero copy receive support to the ti(4) driver. If TI_PRIVATE_JUMBOS is not defined, it now uses the jumbo(9) buffer allocator for jumbo receive buffers. Add a new character device interface for the ti(4) driver for the new debugging interface. This allows (a patched version of) gdb to talk to the Tigon board and debug the firmware. There are also a few additional debugging ioctls available through this interface. Add header splitting support to the ti(4) driver. Tweak some of the default interrupt coalescing parameters to more useful defaults. Add hooks for supporting transmit flow control, but leave it turned off with a comment describing why it is turned off. if_tireg.h: Change the firmware rev to 12.4.11, since we're really at 12.4.11 plus fixes from 12.4.13. Add defines needed for debugging. Remove the ti_stats structure, it is now defined in sys/tiio.h. ti_fw.h: 12.4.11 firmware. ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13, and my header splitting patches. Revision 12.4.13 doesn't handle 10/100 negotiation properly. (This firmware is the same as what was in the tree previously, with the addition of header splitting support.) sys/jumbo.h: Jumbo buffer allocator interface. sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to indicate that the payload buffer can be thrown away / flipped to a userland process. socketvar.h: Add prototype for socow_setup. tiio.h: ioctl interface to the character portion of the ti(4) driver, plus associated structure/type definitions. uio.h: Change prototype for uiomoveco() so that we'll know whether the source page is disposable. ufs_readwrite.c:Update for new prototype of uiomoveco(). vm_fault.c: In vm_fault(), check to see whether we need to do a page based copy on write fault. vm_object.c: Add a new function, vm_object_allocate_wait(). This does the same thing that vm_object allocate does, except that it gives the caller the opportunity to specify whether it should wait on the uma_zalloc() of the object structre. This allows vm objects to be allocated while holding a mutex. (Without generating WITNESS warnings.) vm_object_allocate() is implemented as a call to vm_object_allocate_wait() with the malloc flag set to M_WAITOK. vm_object.h: Add prototype for vm_object_allocate_wait(). vm_page.c: Add page-based copy on write setup, clear and fault routines. vm_page.h: Add page based COW function prototypes and variable in the vm_page structure. Many thanks to Drew Gallatin, who wrote the zero copy send and receive code, and to all the other folks who have tested and reviewed this code over the years.
# c33c8251	20-Jun-2002	Alfred Perlstein <alfred@FreeBSD.org>	Implement SO_NOSIGPIPE option for sockets. This allows one to request that an EPIPE error return not generate SIGPIPE on sockets. Submitted by: lioux Inspired by: Darwin
# 60a9bb19	06-Jun-2002	John Baldwin <jhb@FreeBSD.org>	Catch up to changes in ktrace API.
# 4cc20ab1	31-May-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Back out my lats commit of locking down a socket, it conflicts with hsu's work. Requested by: hsu
# 243917fe	19-May-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Lock down a socket, milestone 1. o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a socket buffer. The mutex in the receive buffer also protects the data in struct socket. o Determine the lock strategy for each members in struct socket. o Lock down the following members: - so_count - so_options - so_linger - so_state o Remove *_locked() socket APIs. Make the following socket APIs touching the members above now require a locked socket: - sodisconnect() - soisconnected() - soisconnecting() - soisdisconnected() - soisdisconnecting() - sofree() - soref() - sorele() - sorwakeup() - sotryfree() - sowakeup() - sowwakeup() Reviewed by: alfred
# 89e9e6e7	19-Apr-2002	Robert Watson <rwatson@FreeBSD.org>	In sendfile(), use the vn_rdwr() helper function, rather than manually constructing a struct aio and invoking VOP_READ() directly. This cleans up the code a little, but also has the advantage of making sure almost all vnode read/write access in the kernel goes through the helper function, meaning that instrumentation of that helper function can impact almost all relevant read/write operations. In this case, it permits us to put MAC hooks into vn_rdwr() and not modify uipc_syscalls.c (yet). In general, if helper vn_*() functions exist, they should be used in preference to direct VOP's in system call service code. Submitted by: green Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 6008862b	04-Apr-2002	John Baldwin <jhb@FreeBSD.org>	Change callers of mtx_init() to pass in an appropriate lock type name. In most cases NULL is passed, but in some cases such as network driver locks (which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used. Tested on: i386, alpha, sparc64
# 70f52b48	23-Mar-2002	Bruce Evans <bde@FreeBSD.org>	Fixed some style bugs in the removal of __P(()). The main ones were not removing tabs before "__P((", and not outdenting continuation lines to preserve non-KNF lining up of code with parentheses. Switch to KNF formatting and/or rewrap the whole prototype in some cases.
# 4d77a549	19-Mar-2002	Alfred Perlstein <alfred@FreeBSD.org>	Remove __P.
# a854ed98	27-Feb-2002	John Baldwin <jhb@FreeBSD.org>	Simple p_ucred -> td_ucred changes to start using the per-thread ucred reference.
# 7228268a	22-Jan-2002	David Greenman <dg@FreeBSD.org>	Fixed bug in calculation of amount of file to send when nbytes !=0 and headers or trailers are supplied. Reported by Vladislav Shabanov <vs@rambler-co.ru>. PR: 33771 Submitted by: Maxim Konovalov <maxim@macomnet.ru> MFC after: 3 days
# 426da3bc	13-Jan-2002	Alfred Perlstein <alfred@FreeBSD.org>	SMP Lock struct file, filedesc and the global file list. Seigo Tanimura (tanimura) posted the initial delta. I've polished it quite a bit reducing the need for locking and adapting it for KSE. Locks: 1 mutex in each filedesc protects all the fields. protects "struct file" initialization, while a struct file is being changed from &badfileops -> &pipeops or something the filedesc should be locked. 1 mutex in each struct file protects the refcount fields. doesn't protect anything else. the flags used for garbage collection have been moved to f_gcflag which was the FILLER short, this doesn't need locking because the garbage collection is a single threaded container. could likely be made to use a pool mutex. 1 sx lock for the global filelist. struct file * fhold(struct file fp); / increments reference count on a file / struct file fhold_locked(struct file fp); / like fhold but expects file to locked / struct file ffind_hold(struct thread , int fd); / finds the struct file in thread, adds one reference and returns it unlocked / struct file ffind_lock(struct thread , int fd); / ffind_hold, but returns file locked */ I still have to smp-safe the fget cruft, I'll get to that asap.
# 078a4e89	08-Jan-2002	Alfred Perlstein <alfred@FreeBSD.org>	Sockets are called 'so' not 'sp'.
# 9c4d63da	31-Dec-2001	Robert Watson <rwatson@FreeBSD.org>	o Make the credential used by socreate() an explicit argument to socreate(), rather than getting it implicitly from the thread argument. o Make NFS cache the credential provided at mount-time, and use the cached credential (nfsmount->nm_cred) when making calls to socreate() on initially connecting, or reconnecting the socket. This fixes bugs involving NFS over TCP and ipfw uid/gid rules, as well as bugs involving NFS and mandatory access control implementations. Reviewed by: freebsd-arch
# b1e4abd2	16-Nov-2001	Matthew Dillon <dillon@FreeBSD.org>	Give struct socket structures a ref counting interface similar to vnodes. This will hopefully serve as a base from which we can expand the MP code. We currently do not attempt to obtain any mutex or SX locks, but the door is open to add them when we nail down exactly how that part of it is going to work.
# b064d43d	13-Nov-2001	Matthew Dillon <dillon@FreeBSD.org>	remove holdfp() Replace uses of holdfp() with fget() or fgetvp() calls as appropriate introduce fget(), fget_read(), fget_write() - these functions will take a thread and file descriptor and return a file pointer with its ref count bumped. introduce fgetvp(), fgetvp_read(), fgetvp_write() - these functions will take a thread and file descriptor and return a vref()'d vnode. _read() requires that the file pointer be FREAD, _write that it be FWRITE. This continues the cleanup of struct filedesc and struct file access routines which, when are all through with it, will allow us to then make the API calls MP safe and be able to move Giant down into the fo_* functions.
# b40ce416	12-Sep-2001	Julian Elischer <julian@FreeBSD.org>	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
# df998760	30-Aug-2001	Matthew Dillon <dillon@FreeBSD.org>	Giant pushdown syscalls in kern/uipc_syscalls.c. Affected calls: recvmsg(), sendmsg(), recvfrom(), accept(), getpeername(), getsockname(), socket(), connect(), accept(), send(), recv(), bind(), setsockopt(), listen(), sendto(), shutdown(), socketpair(), sendfile()
# 0cddd8f0	04-Jul-2001	Matthew Dillon <dillon@FreeBSD.org>	With Alfred's permission, remove vm_mtx in favor of a fine-grained approach (this commit is just the first stage). Also add various GIANT_ macros to formalize the removal of Giant, making it easy to test in a more piecemeal fashion. These macros will allow us to test fine-grained locks to a degree before removing Giant, and also after, and to remove Giant in a piecemeal fashion via sysctl's on those subsystems which the authors believe can operate without Giant.
# db3cc2d0	23-Jun-2001	David Malone <dwmalone@FreeBSD.org>	Don't dereference a NULL pointer if we fail to get a sendfilebuf.
# 9d127f9f	25-May-2001	John Baldwin <jhb@FreeBSD.org>	Add vm locking to sendfile(2) and sf_buf_free(). Reported by: Tamiji Homma <thomma@BayNetworks.com> Tested by: Tamiji Homma <thomma@BayNetworks.com>
# fb919e4d	01-May-2001	Mark Murray <markm@FreeBSD.org>	Undo part of the tangle of having sys/lock.h and sys/mutex.h included in other "system" header files. Also help the deprecation of lockmgr.h by making it a sub-include of sys/lock.h and removing sys/lockmgr.h form kernel .c files. Sort sys/*.h includes where possible in affected files. OK'ed by: bde (with reservations)
# 60fb0ce3	28-Apr-2001	Greg Lehey <grog@FreeBSD.org>	Revert consequences of changes to mount.h, part 2. Requested by: bde
# 06336fb2	25-Apr-2001	Alfred Perlstein <alfred@FreeBSD.org>	Sendfile is documented to return 0 on success, however if when a sf_hdtr is used to provide writev(2) style headers/trailers on the sent data the return value is actually either the result of writev(2) from the trailers or headers of no tailers are specified. Fix sendfile to comply with the documentation, by returning 0 on success. Ok'd by: dg
# d98dc34f	23-Apr-2001	Greg Lehey <grog@FreeBSD.org>	Correct #includes to work with fixed sys/mount.h.
# 4bde2ac5	08-Mar-2001	Bosko Milekic <bmilekic@FreeBSD.org>	Fix is a similar race condition as existed in the mbuf code. When we go into an interruptable sleep and we increment a sleep count, we make sure that we are the thread that will decrement the count when we wakeup. Otherwise, what happens is that if we get interrupted (signal) and we have to wake up, but before we get our mutex, some thread that wants to wake us up detects that the count is non-zero and so enters wakeup_one(), but there's nothing on the sleep queue and so we don't get woken up. The thread will still decrement the sleep count, which is bad because we will also decrement it again later (as we got interrupted) and are already off the sleep queue.
# 2239c07d	08-Mar-2001	David Malone <dwmalone@FreeBSD.org>	Make the wait for sendfile buffers interruptable. Stops one process consuming them all and then getting stuck. Reviewed by: dg Reviewed by: bmilekic Observed by: Andreas Persson <pap@garen.net>
# 19eb87d2	06-Mar-2001	John Baldwin <jhb@FreeBSD.org>	Grab the process lock while calling psignal and before calling psignal.
# 2fd7d53d	13-Feb-2001	Jonathan Lemon <jlemon@FreeBSD.org>	Return ECONNABORTED from accept if connection is closed while on the listen queue, as well as the current behavior of a zero-length sockaddr. Obtained from: KAME Reviewed by: -net
# 9ed346ba	08-Feb-2001	Bosko Milekic <bmilekic@FreeBSD.org>	Change and clean the mutex lock interface. mtx_enter(lock, type) becomes: mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks) mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized) similarily, for releasing a lock, we now have: mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN. We change the caller interface for the two different types of locks because the semantics are entirely different for each case, and this makes it explicitly clear and, at the same time, it rids us of the extra `type' argument. The enter->lock and exit->unlock change has been made with the idea that we're "locking data" and not "entering locked code" in mind. Further, remove all additional "flags" previously passed to the lock acquire/release routines with the exception of two: MTX_QUIET and MTX_NOSWITCH The functionality of these flags is preserved and they can be passed to the lock/unlock routines by calling the corresponding wrappers: mtx_{lock, unlock}_flags(lock, flag(s)) and mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN locks, respectively. Re-inline some lock acq/rel code; in the sleep lock case, we only inline the _obtain_lock()s in order to ensure that the inlined code fits into a cache line. In the spin lock case, we inline recursion and actually only perform a function call if we need to spin. This change has been made with the idea that we generally tend to avoid spin locks and that also the spin locks that we do have and are heavily used (i.e. sched_lock) do recurse, and therefore in an effort to reduce function call overhead for some architectures (such as alpha), we inline recursion for this case. Create a new malloc type for the witness code and retire from using the M_DEV type. The new type is called M_WITNESS and is only declared if WITNESS is enabled. Begin cleaning up some machdep/mutex.h code - specifically updated the "optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently need those. Finally, caught up to the interface changes in all sys code. Contributors: jake, jhb, jasone (in no particular order)
# 1550c317	02-Jan-2001	Poul-Henning Kamp <phk@FreeBSD.org>	Fix the <sys/queue.h> abuse. Submitted by: Dima Dorfman <dima@unixfreak.org> Reviewed by: /sbin/md5
# 7f9cb018	02-Jan-2001	Poul-Henning Kamp <phk@FreeBSD.org>	Add an XXX about a <sys/queue.h> transgression which needs cleaned up.
# 2a0c503e	21-Dec-2000	Bosko Milekic <bmilekic@FreeBSD.org>	* Rename M_WAIT mbuf subsystem flag to M_TRYWAIT. This is because calls with M_WAIT (now M_TRYWAIT) may not wait forever when nothing is available for allocation, and may end up returning NULL. Hopefully we now communicate more of the right thing to developers and make it very clear that it's necessary to check whether calls with M_(TRY)WAIT also resulted in a failed allocation. M_TRYWAIT basically means "try harder, block if necessary, but don't necessarily wait forever." The time spent blocking is tunable with the kern.ipc.mbuf_wait sysctl. M_WAIT is now deprecated but still defined for the next little while. * Fix a typo in a comment in mbuf.h * Fix some code that was actually passing the mbuf subsystem's M_WAIT to malloc(). Made it pass M_WAITOK instead. If we were ever to redefine the value of the M_WAIT flag, this could have became a big problem.
# 7cc0979f	08-Dec-2000	David Malone <dwmalone@FreeBSD.org>	Convert more malloc+bzero to malloc+M_ZERO. Submitted by: josh@zipperup.org Submitted by: Robert Drehmel <robd@gmx.net>
# 8f9a5273	02-Dec-2000	David Greenman <dg@FreeBSD.org>	Changed second argument in a call to sf_buf_free() to be NULL instead of PAGE_SIZE to match the prototype better. The argument is ignored, so this is just to silence the compile-time warning. Pointed out by: jhb
# 794cd879	01-Dec-2000	Bosko Milekic <bmilekic@FreeBSD.org>	Make sure to free the sf_buf if we've allocated it but fail to allocate an mbuf (ENOBUFS) before returning so that we don't leak sf_bufs in the case where we're out of mbufs. Submitted by: David Greenman (dg)
# 279d7226	18-Nov-2000	Matthew Dillon <dillon@FreeBSD.org>	This patchset fixes a large number of file descriptor race conditions. Pre-rfork code assumed inherent locking of a process's file descriptor array. However, with the advent of rfork() the file descriptor table could be shared between processes. This patch closes over a dozen serious race conditions related to one thread manipulating the table (e.g. closing or dup()ing a descriptor) while another is blocked in an open(), close(), fcntl(), read(), write(), etc... PR: kern/11629 Discussed with: Alexander Viro <viro@math.psu.edu>
# 866746b6	12-Nov-2000	David Greenman <dg@FreeBSD.org>	Fixed a certain panic on IO error in sendfile(): Page must be set PG_BUSY before calling vm_page_free() on it.
# e7789181	11-Nov-2000	Bosko Milekic <bmilekic@FreeBSD.org>	* Have m_pulldown() use the new M_WRITABLE() macro in order to determine whether the given ext_buf is shared. * Have the sf_bufs be setup with the mbuf subsystem using MEXTADD() with the two new arguments. Note: m_pulldown() is somewhat crotchy; the added comment explains the situation. Reviewed by: jlemon
# fe27eea9	04-Nov-2000	Bosko Milekic <bmilekic@FreeBSD.org>	Change the sf_bufs wakeups to be wakeup_one(), because we don't want to wakeup all of the sleeping threads when we free only one buffer. This avoids us having to needlessly try again (and fail, and go back to sleep) for all the threads sleeping. We will now only wakeup the thread we know will succeed. Reviewed by: green
# 0eecc427	04-Nov-2000	Bosko Milekic <bmilekic@FreeBSD.org>	Setup and put to use the mutex lock for sf_freelist, the sendfile(2) bufs freelist. Should now be thread-friendly, in part. Note: More work is needed in uipc_syscalls.c, but it will have to wait until the socket locking issues are at least 80% implemented and committed.
# 9ff5ce6b	12-Sep-2000	Boris Popov <bp@FreeBSD.org>	Add three new VOPs: VOP_CREATEVOBJECT, VOP_DESTROYVOBJECT and VOP_GETVOBJECT. They will be used by nullfs and other stacked filesystems to support full cache coherency. Reviewed in general by: mckusick, dillon
# a5c4836d	19-Aug-2000	David Malone <dwmalone@FreeBSD.org>	Replace the mbuf external reference counting code with something that should be better. The old code counted references to mbuf clusters by using the offset of the cluster from the start of memory allocated for mbufs and clusters as an index into an array of chars, which did the reference counting. If the external storage was not a cluster then reference counting had to be done by the code using that external storage. NetBSD's system of linked lists of mbufs was cosidered, but Alfred felt it would have locking issues when the kernel was made more SMP friendly. The system implimented uses a pool of unions to track external storage. The union contains an int for counting the references and a pointer for forming a free list. The reference counts are incremented and decremented atomically and so should be SMP friendly. This system can track reference counts for any sort of external storage. Access to the reference counting stuff is now through macros defined in mbuf.h, so it should be easier to make changes to the system in the future. The possibility of storing the reference count in one of the referencing mbufs was considered, but was rejected 'cos it would often leave extra mbufs allocated. Storing the reference count in the cluster was also considered, but because the external storage may not be a cluster this isn't an option. The size of the pool of reference counters is available in the stats provided by "netstat -m". PR: 19866 Submitted by: Bosko Milekic <bmilekic@dsuper.net> Reviewed by: alfred (glanced at by others on -net)
# 42ebfbf2	02-Jul-2000	Brian Feldman <green@FreeBSD.org>	Modify ktrace's general I/O tracing, ktrgenio(), to use a struct uio * instead of a struct iovec * array and int len. Get rid of stupidly trying to allocate all of the memory and copyin()ing the entire iovec[], and instead just do the proper VOP_WRITE() in ktrwrite() using a copy of the struct uio that the syscall originally used. This solves the DoS which could easily be performed; to work around the DoS, one could also remove "options KTRACE" from the kernel. This is a very strong MFC candidate for 4.1. Found by: art@OpenBSD.org
# 8757e5bb	12-Jun-2000	Alfred Perlstein <alfred@FreeBSD.org>	unstatic getfp() so that other subsystems can use it. make sendfile() use it. Approved by: dg
# e3975643	25-May-2000	Jake Burkholder <jake@FreeBSD.org>	Back out the previous change to the queue(3) interface. It was not discussed and should probably not happen. Requested by: msmith and others
# 740a1973	23-May-2000	Jake Burkholder <jake@FreeBSD.org>	Change the way that the queue(3) structures are declared; don't assume that the type argument to _HEAD and _ENTRY is a struct. Suggested by: phk Reviewed by: phk Approved by: mdodd
# cb679c38	16-Apr-2000	Jonathan Lemon <jlemon@FreeBSD.org>	Introduce kqueue() and kevent(), a kernel event notification facility.
# f48b807f	11-Dec-1999	Brian Feldman <green@FreeBSD.org>	This is Bosko Milekic's mbuf allocation waiting code. Basically, this means that running out of mbuf space isn't a panic anymore, and code which runs out of network memory will sleep to wait for it. Submitted by: Bosko Milekic <bmilekic@dsuper.net> Reviewed by: green, wollman
# 9b962c56	24-Nov-1999	Poul-Henning Kamp <phk@FreeBSD.org>	General clean-up of socket.h and associated sources to synchronise up with NetBSD and the Single Unix Specification v2. This updates some structures with other, almost equivalent types and effort is under way to get the whole more consistent. Also removes a double definition of INET6 and some other clean-ups. Reviewed by: green, bde, phk Some part obtained from: NetBSD, SUSv2 specification
# 2e3c8fcb	16-Nov-1999	Poul-Henning Kamp <phk@FreeBSD.org>	This is a partial commit of the patch from PR 14914: Alot of the code in sys/kern directly accesses the Q_HEAD and Q_ENTRY structures for list operations. This patch makes all list operations in sys/kern use the queue(3) macros, rather than directly accessing the *Q_{HEAD,ENTRY} structures. This batch of changes compile to the same object files. Reviewed by: phk Submitted by: Jake Burkholder <jake@checker.org> PR: 14914
# 923502ff	29-Oct-1999	Poul-Henning Kamp <phk@FreeBSD.org>	useracc() the prequel: Merge the contents (less some trivial bordering the silly comments) of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts the #defines for the vm_inherit_t and vm_prot_t types next to their typedefs. This paves the road for the commit to follow shortly: change useracc() to use VM_PROT_{READ\|WRITE} rather than B_{READ\|WRITE} as argument.
# afce0034	13-Oct-1999	Brian Feldman <green@FreeBSD.org>	Add a missing spl lowering. Submitted by: Ville-Pertti Keinonen <will@iki.fi>
# d1f088da	11-Oct-1999	Peter Wemm <peter@FreeBSD.org>	Trim unused options (or #ifdef for undoc options). Submitted by: phk
# bdf7fdcb	30-Sep-1999	Guido van Rooij <guido@FreeBSD.org>	Plug a potential filedescriptor leak. This will probably almost never be triggered. Reviewed by: David Greenman
# 13ccadd4	19-Sep-1999	Brian Feldman <green@FreeBSD.org>	This is what was "fdfix2.patch," a fix for fd sharing. It's pretty far-reaching in fd-land, so you'll want to consult the code for changes. The biggest change is that now, you don't use fp->f_ops->fo_foo(fp, bar) but instead fo_foo(fp, bar), which increments and decrements the fp refcount upon entry and exit. Two new calls, fhold() and fdrop(), are provided. Each does what it seems like it should, and if fdrop() brings the refcount to zero, the fd is freed as well. Thanks to peter ("to hell with it, it looks ok to me.") for his review. Thanks to msmith for keeping me from putting locks everywhere :) Reviewed by: peter
# c3aac50f	27-Aug-1999	Peter Wemm <peter@FreeBSD.org>	$Id$ -> $FreeBSD$
# e32c66c5	04-Aug-1999	Brian Feldman <green@FreeBSD.org>	Fix fd race conditions (during shared fd table usage.) Badfileops is now used in f_ops in place of NULL, and modifications to the files are more carefully ordered. f_ops should also be set to &badfileops upon "close" of a file. This does not fix other problems mentioned in this PR than the first one. PR: 11629 Reviewed by: peter
# d254af07	27-Jan-1999	Matthew Dillon <dillon@FreeBSD.org>	Fix warnings in preparation for adding -Wall -Wcast-qual to the kernel compile
# ec42cbfc	25-Jan-1999	Bill Fenner <fenner@FreeBSD.org>	Don't free the socket address if soaccept() / pru_accept() doesn't return one.
# 257aefa7	23-Jan-1999	Matthew Dillon <dillon@FreeBSD.org>	Addendum: The original code that the last commit 'fixed' actually did not have a bug in it, but the last commit did make it more readable so we are keeping it.
# 89600e86	23-Jan-1999	Matthew Dillon <dillon@FreeBSD.org>	There was a situation where sendfile() might attempt to initiate I/O on a PG_BUSY page, due to a bug in its sequencing of a conditional.
# 0069f505	21-Jan-1999	Matthew Dillon <dillon@FreeBSD.org>	Fixed a potential bug ( but maybe not ), where sendfile() clears PG_BUSY on a page without testing for waiters. Also collapsed busy wait into new vm_page_sleep_busy() inline ( see vm/vm_page.h )
# 1c7c3c6a	21-Jan-1999	Matthew Dillon <dillon@FreeBSD.org>	This is a rather large commit that encompasses the new swapper, changes to the VM system to support the new swapper, VM bug fixes, several VM optimizations, and some additional revamping of the VM code. The specific bug fixes will be documented with additional forced commits. This commit is somewhat rough in regards to code cleanup issues. Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
# f1d19042	07-Dec-1998	Archie Cobbs <archie@FreeBSD.org>	The "easy" fixes for compiling the kernel -Wunused: remove unreferenced static and local variables, goto labels, and functions declared but not defined.
# 911e8dbc	02-Dec-1998	David Greenman <dg@FreeBSD.org>	Fixed broken code in sendfile(2) when using file offsets.
# 9d2b0909	22-Nov-1998	Don Lewis <truckman@FreeBSD.org>	We can't call fsetown() from sonewconn() because sonewconn() is be called from an interrupt context and fsetown() wants to peek at curproc, call malloc(..., M_WAITOK), and fiddle with various unprotected data structures. The fix is to move the code that duplicates the F_SETOWN/FIOSETOWN state of the original socket to the new socket from sonewconn() to accept1(), since accept1() runs in the correct context. Deferring this until the process calls accept() is harmless since the process can't do anything useful with SIGIO on the new socket until it has the descriptor for that socket. One could make the case for not bothering to duplicate the F_SETOWN/FIOSETOWN state and requiring the process to explicitly make the fcntl() or ioctl() call on the new socket, but this would be incompatible with the previous implementation and might break programs which rely on the old semantics. This bug was discovered by Andrew Gallatin <gallatin@cs.duke.edu>.
# 4f699173	18-Nov-1998	David Greenman <dg@FreeBSD.org>	Closed a very narrow and rare race condition that involved net interrupts, bio interrupts, and a truncated file that along with the precise alignment of the planets could result in a page being freed multiple times or a just-freed page being put onto the inactive queue.
# efac52b4	15-Nov-1998	David Greenman <dg@FreeBSD.org>	In sendfile(2), check against sb_lowat when filling the socket buffer, rather than 0.
# f2efb8e4	14-Nov-1998	David Greenman <dg@FreeBSD.org>	Fixed a couple of nits in sendfile(2): clear PG_ZERO before unbusying the page, and use passed-in "p" rather than curproc in uio struct.
# bd81f199	06-Nov-1998	David Greenman <dg@FreeBSD.org>	Added support for non-blocking sockets to sendfile(2).
# dd0b2081	05-Nov-1998	David Greenman <dg@FreeBSD.org>	Implemented zero-copy TCP/IP extensions via sendfile(2) - send a file to a stream socket. sendfile(2) is similar to implementations in HP-UX, Linux, and other systems, but the API is more extensive and addresses many of the complaints that the Apache Group and others have had with those other implementations. Thanks to Marc Slemko of the Apache Group for helping me work out the best API for this. Anyway, this has the "net" result of speeding up sends of files over TCP/IP sockets by about 10X (that is to say, uses 1/10th of the CPU cycles) when compared to a traditional read/write loop.
# cfe8b629	22-Aug-1998	Garrett Wollman <wollman@FreeBSD.org>	Yow! Completely change the way socket options are handled, eliminating another specialized mbuf type in the process. Also clean up some of the cruft surrounding IPFW, multicast routing, RSVP, and other ill-explored corners.
# 2b605d08	10-Jun-1998	Doug Rabson <dfr@FreeBSD.org>	64bit fixes: don't cast p->p_retval to an int*.
# 115facb2	14-Apr-1998	Poul-Henning Kamp <phk@FreeBSD.org>	Fix a minor mbuf leak created by the previous change. Reviewed by: phk Submitted by: pb@fasterix.freenix.org (Pierre Beyssac)
# aba55893	11-Apr-1998	Poul-Henning Kamp <phk@FreeBSD.org>	setsockopt() transports user option data in an mbuf. if the user data is greater than MLEN, setsockopt is unable to pass it onto the protocol handler. Allocate a cluster in such case. PR: 2575 Reviewed by: phk Submitted by: Julian Assange proff@iq.org
# 08637435	28-Mar-1998	Bruce Evans <bde@FreeBSD.org>	Moved some #includes from <sys/param.h> nearer to where they are actually used.
# 303b270b	08-Feb-1998	Eivind Eklund <eivind@FreeBSD.org>	Staticize.
# 5591b823d	16-Dec-1997	Eivind Eklund <eivind@FreeBSD.org>	Make COMPAT_43 and COMPAT_SUNOS new-style options.
# 0bec68bf	14-Dec-1997	Mike Smith <msmith@FreeBSD.org>	Consult sa_len before trampling it with MSG_COMPAT set. PR: kern/5291 Submitted by: pb@fasterix.freenix.org (Pierre Beyssac)
# 5af7db2b	13-Dec-1997	Mike Smith <msmith@FreeBSD.org>	As described by the submitter: ... fix a bug with orecvfrom() or recvfrom() called with the MSG_COMPAT flag on kernels compiled with the COMPAT_43 option. The symptom is that the fromaddr is not correctly returned. This affects the Linux emulator. Submitted by: pb@fasterix.freenix.org (Pierre Beyssac)
# cb226aaa	06-Nov-1997	Poul-Henning Kamp <phk@FreeBSD.org>	Move the "retval" (3rd) parameter from all syscall functions and put it in struct proc instead. This fixes a boatload of compiler warning, and removes a lot of cruft from the sources. I have not removed the /ARGSUSED/, they will require some looking at. libkvm, ps and other userland struct proc frobbing programs will need recompiled.
# a1c995b6	12-Oct-1997	Poul-Henning Kamp <phk@FreeBSD.org>	Last major round (Unless Bruce thinks of somthing :-) of malloc changes. Distribute all but the most fundamental malloc types. This time I also remembered the trick to making things static: Put "static" in front of them. A couple of finer points by: bde
# e4ba6a82	02-Sep-1997	Bruce Evans <bde@FreeBSD.org>	Removed unused #includes.
# fa5cde12	17-Aug-1997	Garrett Wollman <wollman@FreeBSD.org>	Delete a bit of debugging code that mistakenly crept in, and as a consequence revert rev. 1.28's header file additions which are no longer needed.
# 19c0663e	17-Aug-1997	Tor Egge <tegge@FreeBSD.org>	Use KERNBASE, not 0xf0000000.
# 57bf258e	16-Aug-1997	Garrett Wollman <wollman@FreeBSD.org>	Fix all areas of the system (or at least all those in LINT) to avoid storing socket addresses in mbufs. (Socket buffers are the one exception.) A number of kernel APIs needed to get fixed in order to make this happen. Also, fix three protocol families which kept PCBs in mbufs to not malloc them instead. Delete some old compatibility cruft while we're at it, and add some new routines in the in_cksum family.
# a29f300e	27-Apr-1997	Garrett Wollman <wollman@FreeBSD.org>	The long-awaited mega-massive-network-code- cleanup. Part I. This commit includes the following changes: 1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility glue for them is deleted, and the kernel will panic on boot if any are compiled in. 2) Certain protocol entry points are modified to take a process structure, so they they can easily tell whether or not it is possible to sleep, and also to access credentials. 3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt() call. Protocols should use the process pointer they are now passed. 4) The PF_LOCAL and PF_ROUTE families have been updated to use the new style, as has the `raw' skeleton family. 5) PF_LOCAL sockets now obey the process's umask when creating a socket in the filesystem. As a result, LINT is now broken. I'm hoping that some enterprising hacker with a bit more time will either make the broken bits work (should be easy for netipx) or dike them out.
# 9dd8309d	09-Apr-1997	Bruce Evans <bde@FreeBSD.org>	Removed support for OLD_PIPE. <sys/stat.h> is now missing the hack that supported nameless pipes being indistinguishable from fifos. We're not going back.
# a91b8721	30-Mar-1997	David Greenman <dg@FreeBSD.org>	In accept1(), falloc() is called after the process has awoken, but prior to removing the connection from the queue. The problem here is that falloc() may block and this would allow another process to accept the connection instead. If this happens to leave the queue empty, then the system will panic with an "accept: nothing queued". Also changed a wakeup() to a wakeup_one() to avoid the "thundering herd" problem on new connections in Apache (or any other application that has multiple processes blocked in accept() for the same socket).
# 3ac4d1ef	22-Mar-1997	Bruce Evans <bde@FreeBSD.org>	Don't #include <sys/fcntl.h> in <sys/file.h> if KERNEL is defined. Fixed everything that depended on getting fcntl.h stuff from the wrong place. Most things don't depend on file.h stuff at all.
# 6875d254	22-Feb-1997	Peter Wemm <peter@FreeBSD.org>	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# 1130b656	14-Jan-1997	Jordan K. Hubbard <jkh@FreeBSD.org>	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# 67f7ea2d	15-Oct-1996	Garrett Wollman <wollman@FreeBSD.org>	Preserve file flags in accept(2). Submitted by: fredriks@mcs.com in PR#1775 (this implmentaion is different)
# b12e5e82	23-Aug-1996	Peter Wemm <peter@FreeBSD.org>	The socketpair(0 syscall is bogusly returning the fd numbers through the primary and secondary return codes, causing it to not behave as documented. This probably originates from the ancient BSD kernels that had pipe(2) implemented by socketpair(2), there are no binaries left that we can run that do this. Pointed out by: Robert Withrow <witr@rwwa.com>, PR#731
# 2c37256e	11-Jul-1996	Garrett Wollman <wollman@FreeBSD.org>	Modify the kernel to use the new pr_usrreqs interface rather than the old pr_usrreq mechanism which was poorly designed and error-prone. This commit renames pr_usrreq to pr_ousrreq so that old code which depended on it would break in an obvious manner. This commit also implements the new interface for TCP, although the old function is left as an example (#ifdef'ed out). This commit ALSO fixes a longstanding bug in the TCP timer processing (introduced by davidg on 1995/04/12) which caused timer processing on a TCB to always stop after a single timer had expired (because it misinterpreted the return value from tcp_usrreq() to indicate that the TCB had been deleted). Finally, some code related to polling has been deleted from if.c because it is not relevant t -current and doesn't look at all like my current code.
# 82dab6ce	09-May-1996	Garrett Wollman <wollman@FreeBSD.org>	Make it possible to return more than one piece of control information (PR #1178). Define a new SO_TIMESTAMP socket option for datagram sockets to return packet-arrival timestamps as control information (PR #1179). Submitted by: Louis Mamakos <loiue@TransSys.com>
# edbfedac	11-Mar-1996	Peter Wemm <peter@FreeBSD.org>	Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all files are off the vendor branch, so this should not change anything. A "U" marker generally means that the file was not changed in between the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally means that there was a change. [note new unused (in this form) syscalls.conf, to be 'cvs rm'ed]
# be24e9e8	11-Mar-1996	David Greenman <dg@FreeBSD.org>	Changed socket code to use 4.4BSD queue macros. This includes removing the obsolete soqinsque and soqremque functions as well as collapsing so_q0len and so_qlen into a single queue length of unaccepted connections. Now the queue of unaccepted & complete connections is checked directly for queued sockets. The new code should be functionally equivilent to the old while being substantially faster - especially in cases where large numbers of connections are often queued for accept (e.g. http).
# 09bb5f75	24-Feb-1996	Poul-Henning Kamp <phk@FreeBSD.org>	Make getsockopt() capable of handling more than one mbuf worth of data. Use this to read rules out of ipfw. Add the lkm code to ipfw.c
# dc915e7c	13-Feb-1996	Garrett Wollman <wollman@FreeBSD.org>	Kill XNS. While we're at it, fix socreate() to take a process argument. (This was supposed to get committed days ago...)
# f9827213	28-Jan-1996	John Dyson <dyson@FreeBSD.org>	Enable the new fast pipe code. The old pipes can be used with the "OLD_PIPE" config option.
# db6a20e2	03-Jan-1996	Garrett Wollman <wollman@FreeBSD.org>	Converted two options over to the new scheme: USER_LDT and KTRACE.
# 6ee78bf0	01-Jan-1996	Peter Wemm <peter@FreeBSD.org>	Make pipe() return a set of bidirectional pipe fd's rather than one-way only just like on SVR4. This has no effect on any current programs in our source, but makes the use of SVR4 code a little easier. There is no code or implementation cost in the kernel.. This two-line change merely sets the modes on the ends of the pipes to be bidirectional. There are no other changes.
# 47daf5d5	14-Dec-1995	Bruce Evans <bde@FreeBSD.org>	Nuked ambiguous sleep message strings: old: new: netcls[] = "netcls" "soclos" netcon[] = "netcon" "accept", "connec" netio[] = "netio" "sblock", "sbwait"
# 5fdb8324	23-Oct-1995	Bruce Evans <bde@FreeBSD.org>	Simplify the pseudo-argument removal changes by not optimizing for the !COMPAT_43 case - use a common function even when there is no `old' function. The diffs for this are large because of code motion to restore the function order to what it was before the pseudo-argument changes. Include <sys/sysproto.h> to get correct args structs and prototypes. The diffs for this are large because the declarations of the args structs were moved to become comments in the function headers. The comments may actually match the automatically generated declarations right now. Add prototypes.
# 88c94611	11-Oct-1995	Steven Wallace <swallace@FreeBSD.org>	Remove the '1' from getpeername1 and getsockname1 when NOT COMPAT_OLDSOCK. Left it in there by mistake.
# 93c9414e	07-Oct-1995	Steven Wallace <swallace@FreeBSD.org>	Remove compat_43 psuedo-argument hack, and replace with a better hack. Instead of using a fake "compat" argument, pass a real compat int to function if COMPAT_43 is defined. Functions involved: wait4, accept, recvfrom, getsockname. With the compat psuedo-argument, this introduces an argument structure that can have two possible sizes depending on compat options. This makes life difficult for lkm modules like ibcs2, which would have to guess what size used in kernel when compiled. Also, the prototype generator for these structures cannot generate proper sizes. Now there is only one fixed structure and makes everybody happy. I recommend these changes be introduced to 2.1 so that ibcs2, linux lkm's generated for 2.2 can still run on a 2.1 kernel.
# 9b2e5354	30-May-1995	Rodney W. Grimes <rgrimes@FreeBSD.org>	Remove trailing whitespace.
# b5e8ce9f	16-Mar-1995	Bruce Evans <bde@FreeBSD.org>	Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) and most of the warnings from `gcc -Wnested-externs'. Fix all the bugs found. There were no serious ones.
# 797f2d22	02-Oct-1994	Poul-Henning Kamp <phk@FreeBSD.org>	All of this is cosmetic. prototypes, #includes, printfs and so on. Makes GCC a lot more silent.
# 3c4dd356	02-Aug-1994	David Greenman <dg@FreeBSD.org>	Added $Id$
# 26f9a767	25-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
# df8bae1d	24-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	BSD 4.4 Lite Kernel Sources