History log of /freebsd-current/sys/kern/syscalls.master
Revision Date Author Comments
# 6b7e4254 21-May-2024 Edward Tomasz Napierala <trasz@FreeBSD.org>

capsicum: allow rfork(2) in capability mode

Reviewed by: brooks, rwatson
MFC after: 4 days
Differential Revision: https://reviews.freebsd.org/D45040


# 050555e1 13-May-2024 Edward Tomasz Napierala <trasz@FreeBSD.org>

syscalls.master: allow vfork(2) in capsicum(4) capability mode

There is no reason not do do this, we already allow fork(2),
and I need vfork(2) for CHERI process colocation.

Reviewed by: brooks, emaste, oshogbo
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39829


# 78101d43 24-Apr-2024 Brooks Davis <brooks@FreeBSD.org>

syscalls.master: correct return type of {read,write}v

This was missed when read/write, etc were updated to return ssize_t.

Fixes: 2e83b2816183 Fix a few syscall arguments to use size_t instead of u_int.

Reviewed by: imp, kib
Differential Revision: https://reviews.freebsd.org/D44930


# 27676ae3 19-Mar-2024 Brooks Davis <brooks@FreeBSD.org>

syscalls.master: use __acl_type_t

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D44418


# d0efabdf 19-Mar-2024 Brooks Davis <brooks@FreeBSD.org>

syscalls.master: make __sys_fcntl take an intptr_t

The (optional) third argument of fcntl is sometimes a pointer so change
the type to intptr_t. Update the libc-internal defintion (actually used
by libthr) to take a fixed intptr_t argument rather than pretending it's
a variadic function. (That worked because all supported architectures
pass variadic arguments as though the function was declared with those
types. In CheriBSD that changes because variadic arguments are passed
via a bounded array.)

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D44381


# cab73e53 18-Mar-2024 Brooks Davis <brooks@FreeBSD.org>

syscalls.master: struct siginfo -> struct __siginfo

struct siginfo doesn't exist, it's struct __siginfo (and siginfo_t).

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D44380


# 7936d4e4 19-Mar-2024 Brooks Davis <brooks@FreeBSD.org>

syscalls.master: align with sigfastblock declaration

sigfastblock is declared to take a void * argument in the manpage in
headers so declare it that way and use SAL annotations to say it
interacts with a 32-bit word.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D44379


# d8d4ed26 19-Mar-2024 Brooks Davis <brooks@FreeBSD.org>

syscall.master: fix aio_suspend signature

It takes a `const struct iovec *iovp`.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D44378


# 128443a9 19-Mar-2024 Brooks Davis <brooks@FreeBSD.org>

syscalls.master: fix readv and writev iovp decl

Both take const struct iovec * and only read the values.

Reviewed by: olce, kib
Differential Revision: https://reviews.freebsd.org/D44377


# d6d4183c 01-Feb-2024 Brooks Davis <brooks@FreeBSD.org>

syscalls.master: Remove stray blank lines

No functional change.


# d8decc9a 19-Jan-2024 Konstantin Belousov <kib@FreeBSD.org>

Add kcmp(2) kernel bits

This is based purely on reading the Linux kcmp(2) man page.
In addition to the Linux set of comparators, I also added KCMP_FILEOBJ to
compare underlying file' objects.

Tested by: manu
Reviewed by: brooks, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D43518


# 7893419d 04-Dec-2023 Brooks Davis <brooks@FreeBSD.org>

Remove never implemented sbrk and sstk syscalls

Both system calls were stubs returning EOPNOTSUPP and libc did not
provide _ or __sys_ prefixed symbols. The actual implementation of
sbrk(2) is on top of the undocumented break(2) system call.

Technically this is a change in ABI, but no non-contrived program ever
called these syscalls.

Reviewed by: kib, emaste
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D42872


# 5b31cc94 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sccs: Manual changes

For the uncommon items: Go through the tree and remove sccs tags that
didn't fit any nice pattern. If in the neighborhood, other SCM tags were
removed when they were detritis of long-ago CVS somehow in the early
mists of the project. Some adjacent copyrights stringswere removed (they
duplicated the copyright notices in the file). This also removed
non-standard formations of omission of SCCS tags (usually by adding an
extra #if 0 somewhere.

After this commit, a number of strings tagged with the 'what' @(#)
prefix remain, but they are primarily copyright notices.

Sponsored by: Netflix


# 84d12f88 06-Oct-2023 Kristof Provost <kp@FreeBSD.org>

Add a COMPAT_FREEBSD14 kernel option

Use it wherever COMPAT_FREEBSD13 is currently specified.

Reviewed by: brooks, zlei
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D42100


# 5e29272b 24-Sep-2023 Haoyu Gu <guhaoyu2005@gmail.com>

syscalls.master: Fix SAL annotation for getdirentires basep argument

getdirentires last argument "off_t *basep" is an optional output
argument. It returns the value only when the passed-in value(pointer)
is non-NULL.

This is a part of the research work at RCSLab, University of Waterloo.

Reviewed by: imp, emaste
Differential Revision: https://reviews.freebsd.org/D41969


# af93fea7 23-Aug-2023 Jake Freeland <jfree@freebsd.org>

timerfd: Move implementation from linux compat to sys/kern

Move the timerfd impelemntation from linux compat code to sys/kern. Use
it to implement the new system calls for timerfd. Add a hook to kern_tc
to allow timerfd to know when the system time has stepped. Add kqueue
support to timerfd. Adjust a few names to be less Linux centric.

RelNotes: YES
Reviewed by: markj (on irc), imp, kib (with reservations), jhb (slack)
Differential Revision: https://reviews.freebsd.org/D38459


# 4a69fc16 07-Oct-2021 Konstantin Belousov <kib@FreeBSD.org>

Add membarrier(2)

This is an attempt at clean-room implementation of the Linux'
membarrier(2) syscall. For documentation, you would need to read
both membarrier(2) Linux man page, the comments in Linux
kernel/sched/membarrier.c implementation and possibly look at
actual uses.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32360


# 78d14616 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line bare tag

Remove /^\s*\$FreeBSD\$$\n/


# 602b575a 20-Apr-2023 Warner Losh <imp@FreeBSD.org>

syscall.master: Remove stray 4.2

Back in 4.3BSD, the system call table wasn't generated, and there was an
entry:
"4.2 sigreturn", /* 139 = old 4.2 sigreturn */
This got converted to
139 OBSOL 0 4.2 sigreturn
in 4.3 RENO. Since it was obsolete, nothing bad happened. In fact,
there was code in makeyscalls.sh to cope:
{ comment = $4
for (i = 5; i <= NF; i++)
comment = comment " " $i
if (NF < 5)
$5 = $4
}
so the generated comment in syscalls.c was almost correct:
"obs_4.2", /* 139 = obsolete 4.2 sigreturn */
a bug that we have to this very day, despite makesyscalls.sh being
rewritten in lua.

However, this historical wart is the only place in our current
syscalls.master file where we have an extra field for the 'not
generated' class of system calls. Remove the historical wart so that the
re-write of makesyscalls.lua can be simpler (so, I hope, qemu's bsd-user
can large swathes of code automatically generated too). This should help
make things more understandable (changes to simplify makesyscalls.lue
aren't quite debugged, so have to wait for another day).

There's 3 different obsolete sigreturns (but only 1 that was ever in
FreeBSD 2.x and newer).

Sponsored by: Netflix


# 559b94a1 20-Apr-2023 Warner Losh <imp@FreeBSD.org>

syscall.master: Fix comments

Have more accruate comments. While #if, #else, etc are copied to the
header files, lines that don't start with # are not. And #include files
are only output to sysinc (which winds up at the front of init_sysent.c
which seems a bit odd). This is all radically undocumented, and likely
has drifted somewhat from 4.4BSD and what other systems do (they've
drifted too, fwiw).

Sponsored by: Netflix


# dac31024 31-Mar-2023 Konstantin Belousov <kib@FreeBSD.org>

Rename kqueue1(2) to kqueuex(2) to avoid compat issues with NetBSD

Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39377


# 61194e98 25-Mar-2023 Konstantin Belousov <kib@FreeBSD.org>

Add kqueue1() syscall

It takes the flags argument. Immediate use is to provide the KQUEUE_CLOEXEC
flag for kqueue(2).

Reviewed by: emaste, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39271


# 52a1d90c 13-Apr-2022 Ed Maste <emaste@FreeBSD.org>

Allow posix_fadvise in capability mode

posix_fadvise operates only on a provided fd. Noted by
Mathieu <sigsys@gmail.com> in review D34761.

No new CAP_ rights are added for posix_fadvise(), as 'advice' in
general only influences when I/O happens; the fd must have existing
CAP_ rights for actual data access.

Reviewed by: markj
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34903


# e5821a21 30-Mar-2022 Ed Maste <emaste@FreeBSD.org>

syscalls.master: remove obsolete comment about compatibility tables

Compatibility ABIs no longer use a separate syscalls.master.

Fixes: be67ea40c5a0 ("freebsd32: generate from ...")
Sponsored by: The FreeBSD Foundation


# 53465702 08-Dec-2021 Konstantin Belousov <kib@FreeBSD.org>

swapoff: add one more variant of the syscall

Requested and reviewed by: brooks
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33343


# c1a84727 08-Dec-2021 Konstantin Belousov <kib@FreeBSD.org>

syscalls: add COMPAT13

Reviewed by: brooks
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33343


# 6d37a167 29-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls: mprotect does not take a const

The mprotect syscall decleration is not const. I added this one
incorrectly in a944d28d0edf7ceb1bef4d789dfa4e8e18331658.

Reported by: kib
Reviewed by: kib, imp


# a8efd4d1 29-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls: make syscall and __syscall SYSMUX

Rather than combining the declearation of nosys with the registration
of SYS_syscall, declare syscall(2) and __syscall(2) with the new
SYSMUX type in syscalls.master and declare nosys directly. This
eliminates the last use of syscall aliases in the tree.

Reviewed by: kib, imp


# d7f306c5 29-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

makesyscalls: add a new SYSMUX type

This type is for system call multiplexers (syscall(2), __syscall(2))
that don't have a normal handler and instead are handled in the
machine-dependent syscall code.

Reviewed by: kib, imp


# cffb55f0 29-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls: normalize exit

Declare the exit system call normally. This results in the
implementation being named sys_exit rather than sys_sys_exit and
being decalred as returning an int. Infact it does not return
at all because exit1 does not, so add an __unreachable() to let the
compiler know that.

Reviewed by: kib, imp


# 638c5fa8 29-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls: normalize (get|set)rlimit

Declare normal <foo>_args structs rather than going out of the way
to declare __<foo>_args.

Reviewed by: kib, imp


# ba4e5253 29-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls: normalize orecvfrom and ogetsockname

Declare o<foo>_args rather than reusing the equivalent <foo>_args
structs. Avoiding the addition of a new type isn't worth the
gratutious differences.

Reviewed by: kib, imp


# 3660e76a 29-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls: correct a couple style issues

Reviewed by: kib, imp


# 33f9ea20 29-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls: add missing SAL annotations

freebsd7_shmctl was missing an annotation

Reviewed by: kib, imp


# be67ea40 22-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

freebsd32: generate from sys/kern/syscalls.master

This avoids the need to keep a freebsd32-specific syscalls.master
in sync with the default ABI. As evidenced by the number of commits
required to sync the two, it is extremely easy for them to get out
of sync due to misunderstandings and user errors.

Reviewed by: kevans, kib


# 799ce8b8 22-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls: annotate args pointing to long, pointer, or time_t

Add _Contains_ annotations indicating that the data pointed to by a
pointer argument contains types that vary between FreeBSD ABIs. The
supported set is long (including size_t), pointer (including
intptr_t), and time_t. The first two vary between 32- and 64-bit
ABIs. The laste betwen i386 and everything else.

These will be used to detect which syscalls require handling on
particular ABIs.

Reviewed by: kevans, kib


# 00e0a4c0 22-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls: abort2 doesn't return so declare as void

Reviewed by: kib


# 4b2e1f14 22-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls: umask returns a mode_t

Reviewed by: kib


# 27f5b514 22-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls: update a few return types to ssize_t

Reviewed by: kib


# 717e7fb2 22-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls: struct ucontext4 -> struct freebsd4_ucontext

This aligns with struct freebsd4_ucontext32 in freebsd32.

Reviewed by: kib


# d8bd949b 22-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

sys___sysctl: regularize argument struct

Let makesyscalls generate the normal struct __sysctl_args structure.
It works fine.

Reviewed by: kib


# 88dfcfa2 22-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

sys_sigaltstack: use struct sigaltstack arg

This is idential to stack_t and more amenable to prepending "32" to
for freebsd32.

Reviewed by: kib


# 85d1d2a6 17-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls: use struct siginfo rather than siginfo_t

This allows freebsd32 to use struct siginfo32 with an automatable
conversion.

Reviewed by: kevans


# f5032882 17-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls: fix type of osendmsg

osendmsg takes an struct omsghdr * not a void *.

Reviewed by: kevans


# 2385f4d1 17-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls: use __socklen_t as appropriate

No functional change as __socklen_t is an int.

Obtained from: CheriBSD

Reviewed by: kevans


# b64f3dc2 17-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls: [gs]etitimer takes an int which

Match the function decleration which takes an int not a signed int.
No functional change as the range of valid values is 0-2.

Obtained from: CheriBSD

Reviewed by: kevans


# b7fd8611 17-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls: sprinkle in const values

Add missing const qualifiers to a number of syscall arguments.

Obtained from: CheriBSD

Reviewed by: kevans


# 8e4a3add 15-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

struct kevent_freebsd11 -> struct freebsd11_kevent

Rename to match the naming of syscalls and allow 32 to be appended
without making an ugly name like kevent_freebsd1132.

While here, make the kevent changelist argument const.

Reviewed by: kib


# f0da2a14 15-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls: unwrap a long line

Style dictates that each variable is on a single line

Reviewed by: kib


# 77b2c2f8 22-Oct-2021 Konstantin Belousov <kib@FreeBSD.org>

Add sched_getcpu()

for compatibility with Linux.

Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32901


# 6bc90e8a 01-Sep-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls.master: correct formatting issues

Reviewed by: kevans, emaste
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D31351


# df501bac 01-Sep-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls.master: switch to CAPENABLED flags

Switch the main syscall table to use CAPENABLED flags rather than
capabilities.conf. This avoid synchronization issues between
syscalls.master and capabilities.conf (e.g. when renaming a syscall
during development).

For now, move capabilities.conf to sys/compat/freebsd32 and use it
there. Use of sys/compat/freebsd32/syscalls.master should be replaced
by makesyscalls.lua enhancements to allow the main one to be used.

This change results in no changes to generated files after running
`make sysent`.

Reviewed by: kevans, emaste
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D31350


# 6945df3f 01-Sep-2021 Brooks Davis <brooks@FreeBSD.org>

makesyscalls.lua: add a CAPENABLED flag

The CAPENABLED flag indicates that the syscall can be used in capsicum
capability mode. It is intended to replace capabilities.conf.

Reviewed by: kevans, emaste
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D31349


# 0dc332bf 05-Aug-2021 Ka Ho Ng <khng@FreeBSD.org>

Add fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9).

fspacectl(2) is a system call to provide space management support to
userspace applications. VOP_DEALLOCATE(9) is a VOP call to perform the
deallocation. vn_deallocate(9) is a public KPI for kmods' use.

The purpose of proposing a new system call, a KPI and a VOP call is to
allow bhyve or other hypervisor monitors to emulate the behavior of SCSI
UNMAP/NVMe DEALLOCATE on a plain file.

fspacectl(2) comprises of cmd and flags parameters to specify the
space management operation to be performed. Currently cmd has to be
SPACECTL_DEALLOC, and flags has to be 0.

fo_fspacectl is added to fileops.
VOP_DEALLOCATE(9) is added as a new VOP call. A trivial implementation
of VOP_DEALLOCATE(9) is provided.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D28347


# 9b6b793b 19-Jul-2021 Konstantin Belousov <kib@FreeBSD.org>

Revert most of ce42e793100b460f597e4c85ec0da12e274f9394

to restore ABI compatibility for pre-10.x binaries.

It restores _umtx_lock() and _umtx_unlock() syscalls, and UMTX_OP_LOCK/
UMTX_OP_UNLOCK umtx_op(2) operations. UMUTEX_ERROR_CHECK flag is left
out for now, I do not think it makes a difference.

PR: 218571
Reviewed by: brooks (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31220


# 4bc2174a 09-Jul-2019 Moritz Buhl <gh@moritzbuhl.de>

kern: fail getgroup and setgroup with negative int

Found using
https://github.com/NetBSD/src/blob/trunk/tests/lib/libc/sys/t_getgroups.c

getgroups/setgroups want an int and therefore casting it to u_int
resulted in `getgroups(-1, ...)` not returning -1 / errno = EINVAL.

imp@ updated syscall.master and made changes markj@ suggested

PR: 189941
Tested by: imp@
Reviewed by: markj@
Pull Request: https://github.com/freebsd/freebsd-src/pull/407
Differential Revision: https://reviews.freebsd.org/D30617


# d89c1c46 26-Jan-2021 Brooks Davis <brooks@FreeBSD.org>

Reserve gaps in syscall numbers for local use

It is best for auditing of syscalls.master if we only append to the
file. Reserving unimplemented system call numbers for local use makes
this policy and provides a large set of syscall numbers FreeBSD
derivatives can use without risk of conflict.

Reviewed by: jhb, kevans, kib
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D27988


# 119fa6ee 26-Jan-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls.master: Add a new syscall type: RESERVED

RESERVED syscall number are reserved for local/vendor use. RESERVED is
identical to UNIMPL except that comments are ignored.

Reviewed by: kevans
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D27988


# 65a524b4 26-Jan-2021 Brooks Davis <brooks@FreeBSD.org>

Remove documentation of unimplemented syscalls

We have not been able to run binaries from other BSDs well over a
decade. There is no need to document their allocation decisions here.

We also don't need to reserve syscall numbers of never-implemented
syscalls.

Reviewed by: jhb, kib
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D27988


# b3286afa 05-Jan-2021 Alan Somers <asomers@FreeBSD.org>

Reallocate syscall numbers for aio_writev and aio_readv

The originally chosen numbers interfere with downstream projects'
syscalls. Move them to the end of the syscall table instead.

Reported by: jrtc27
Reviewed by: brooks
MFC-With: 022ca2fc7fe08d51f33a1d23a9be49e6d132914e
Differential Revision: 022ca2fc7fe08d51f33a1d23a9be49e6d132914e


# 022ca2fc 02-Jan-2021 Alan Somers <asomers@FreeBSD.org>

Add aio_writev and aio_readv

POSIX AIO is great, but it lacks vectored I/O functions. This commit
fixes that shortcoming by adding aio_writev and aio_readv. They aren't
part of the standard, but they're an obvious extension. They work just
like their synchronous equivalents pwritev and preadv.

It isn't yet possible to use vectored aiocbs with lio_listio, but that
could be added in the future.

Reviewed by: jhb, kib, bcr
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D27743


# 7a202823 23-Dec-2020 Konstantin Belousov <kib@FreeBSD.org>

Expose eventfd in the native API/ABI using a new __specialfd syscall

eventfd is a Linux system call that produces special file descriptors
for event notification. When porting Linux software, it is currently
usually emulated by epoll-shim on top of kqueues. Unfortunately, kqueues
are not passable between processes. And, as noted by the author of
epoll-shim, even if they were, the library state would also have to be
passed somehow. This came up when debugging strange HW video decode
failures in Firefox. A native implementation would avoid these problems
and help with porting Linux software.

Since we now already have an eventfd implementation in the kernel (for
the Linuxulator), it's pretty easy to expose it natively, which is what
this patch does.

Submitted by: greg@unrelenting.technology
Reviewed by: markj (previous version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26668


# d9021e38 28-May-2020 Rick Macklem <rmacklem@FreeBSD.org>

Add a syscall for the nfs-over-tls daemons to use.

The nfs-over-tls daemons need a system call to perform operations such as
associate a file descriptor with a krpc socket.
The daemons will not be in head for some time, but it will make it
easier for testers of nfs-over-tls to do testing if the system call
is in head (basically the stub for libc which will be commited soon).

Reviewed by: brooks
Differential Revision: https://reviews.freebsd.org/D24949


# 3e6b8291 23-Apr-2020 Kyle Evans <kevans@FreeBSD.org>

close_range(2): use newly assigned AUE_CLOSERANGE


# 7d03e081 14-Apr-2020 Kyle Evans <kevans@FreeBSD.org>

Mark closefrom(2) COMPAT12, reimplement in libc to wrap close_range

Include a temporarily compatibility shim as well for kernels predating
close_range, since closefrom is used in some critical areas.

Reviewed by: markj (previous version), kib
Differential Revision: https://reviews.freebsd.org/D24399


# 472ced39 12-Apr-2020 Kyle Evans <kevans@FreeBSD.org>

Implement a close_range(2) syscall

close_range(min, max, flags) allows for a range of descriptors to be
closed. The Python folk have indicated that they would much prefer this
interface to closefrom(2), as the case may be that they/someone have special
fds dup'd to higher in the range and they can't necessarily closefrom(min)
because they don't want to hit the upper range, but relocating them to lower
isn't necessarily feasible.

sys_closefrom has been rewritten to use kern_close_range() using ~0U to
indicate closing to the end of the range. This was chosen rather than
requiring callers of kern_close_range() to hold FILEDESC_SLOCK across the
call to kern_close_range for simplicity.

The flags argument of close_range(2) is currently unused, so any flags set
is currently EINVAL. It was added to the interface in Linux so that future
flags could be added for, e.g., "halt on first error" and things of this
nature.

This patch is based on a syscall of the same design that is expected to be
merged into Linux.

Reviewed by: kib, markj, vangyzen (all slightly earlier revisions)
Differential Revision: https://reviews.freebsd.org/D21627


# 0573d0a9 20-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add realpathat syscall

realpath(3) is used a lot e.g., by clang and is a major source of getcwd
and fstatat calls. This can be done more efficiently in the kernel.

This works by performing a regular lookup while saving the name and found
parent directory. If the terminal vnode is a directory we can resolve it using
usual means. Otherwise we can use the name saved by lookup and resolve the
parent.

See the review for sample syscall counts.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23574


# 146fc63f 09-Feb-2020 Konstantin Belousov <kib@FreeBSD.org>

Add a way to manage thread signal mask using shared word, instead of syscall.

A new syscall sigfastblock(2) is added which registers a uint32_t
variable as containing the count of blocks for signal delivery. Its
content is read by kernel on each syscall entry and on AST processing,
non-zero count of blocks is interpreted same as the signal mask
blocking all signals.

The biggest downside of the feature that I see is that memory
corruption that affects the registered fast sigblock location, would
cause quite strange application misbehavior. For instance, the process
would be immune to ^C (but killable by SIGKILL).

With consumers (rtld and libthr added), benchmarks do not show a
slow-down of the syscalls in micro-measurements, and macro benchmarks
like buildworld do not demonstrate a difference. Part of the reason is
that buildworld time is dominated by compiler, and clang already links
to libthr. On the other hand, small utilities typically used by shell
scripts have the total number of syscalls cut by half.

The syscall is not exported from the stable libc version namespace on
purpose. It is intended to be used only by our C runtime
implementation internals.

Tested by: pho
Disscussed with: cem, emaste, jilles
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D12773


# 2d5603fe 18-Nov-2019 David Bright <dab@FreeBSD.org>

Jail and capability mode for shm_rename; add audit support for shm_rename

Co-mingling two things here:

* Addressing some feedback from Konstantin and Kyle re: jail,
capability mode, and a few other things
* Adding audit support as promised.

The audit support change includes a partial refresh of OpenBSM from
upstream, where the change to add shm_rename has already been
accepted. Matthew doesn't plan to work on refreshing anything else to
support audit for those new event types.

Submitted by: Matthew Bryan <matthew.bryan@isilon.com>
Reviewed by: kib
Relnotes: Yes
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22083


# 96c914ee 14-Nov-2019 Brooks Davis <brooks@FreeBSD.org>

Tidy syscall declerations.

Pointer arguments should be of the form "<type> *..." and not "<type>* ...".

No functional change.

Reviewed by: kevans
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D22373


# f403831e 01-Oct-2019 Ed Maste <emaste@FreeBSD.org>

sysalls.master: remove superfluous ellipsis in comment

A single period is sufficient in this comment, and making this change
lets us find references to varargs syscalls by searching for ...


# 11fd6a60 30-Sep-2019 Kyle Evans <kevans@FreeBSD.org>

syscalls.master: consistency, move ); to newline (no functional change)


# 9afb12ba 26-Sep-2019 David Bright <dab@FreeBSD.org>

Add an shm_rename syscall

Add an atomic shm rename operation, similar in spirit to a file
rename. Atomically unlink an shm from a source path and link it to a
destination path. If an existing shm is linked at the destination
path, unlink it as part of the same atomic operation. The caller needs
the same permissions as shm_unlink to the shm being renamed, and the
same permissions for the shm at the destination which is being
unlinked, if it exists. If those fail, EACCES is returned, as with the
other shm_* syscalls.

truss support is included; audit support will come later.

This commit includes only the implementation; the sysent-generated
bits will come in a follow-on commit.

Submitted by: Matthew Bryan <matthew.bryan@isilon.com>
Reviewed by: jilles (earlier revision)
Reviewed by: brueffer (manpages, earlier revision)
Relnotes: yes
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21423


# 234879a7 25-Sep-2019 Kyle Evans <kevans@FreeBSD.org>

Mark shm_open(2) as COMPAT12, succeeded by shm_open2

Implementation and regenerated files will follow.


# 20f70576 25-Sep-2019 Kyle Evans <kevans@FreeBSD.org>

Add a shm_open2 syscall to support upcoming memfd_create

shm_open2 allows a little more flexibility than the original shm_open.
shm_open2 doesn't enforce CLOEXEC on its callers, and it has a separate
shmflag argument that can be expanded later. Currently the only shmflag is
to allow file sealing on the returned fd.

shm_open and memfd_create will both be implemented in libc to use this new
syscall.

__FreeBSD_version is bumped to indicate the presence.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D21393


# 85c5f3cb 25-Sep-2019 Kyle Evans <kevans@FreeBSD.org>

Add COMPAT12 support to makesyscalls.sh

Reviewed by: kib, imp, brooks (all without syscalls.master edits)
Differential Revision: https://reviews.freebsd.org/D21366


# d05b53e0 02-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

Add sysctlbyname system call

Previously userspace would issue one syscall to resolve the sysctl and then
another one to actually use it. Do it all in one trip.

Fallback is provided in case newer libc happens to be running on an older
kernel.

Submitted by: Pawel Biernacki
Reported by: kib, brooks
Differential Revision: https://reviews.freebsd.org/D17282


# bbbbeca3 24-Jul-2019 Rick Macklem <rmacklem@FreeBSD.org>

Add kernel support for a Linux compatible copy_file_range(2) syscall.

This patch adds support to the kernel for a Linux compatible
copy_file_range(2) syscall and the related VOP_COPY_FILE_RANGE(9).
This syscall/VOP can be used by the NFSv4.2 client to implement the
Copy operation against an NFSv4.2 server to do file copies locally on
the server.
The vn_generic_copy_file_range() function in this patch can be used
by the NFSv4.2 server to implement the Copy operation.
Fuse may also me able to use the VOP_COPY_FILE_RANGE() method.

vn_generic_copy_file_range() attempts to maintain holes in the output
file in the range to be copied, but may fail to do so if the input and
output files are on different file systems with different _PC_MIN_HOLE_SIZE
values.

Separate commits will be done for the generated syscall files and userland
changes. A commit for a compat32 syscall will be done later.

Reviewed by: kib, asomers (plus comments by brooks, jilles)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D20584


# e8ee7d90 16-Apr-2019 Ed Maste <emaste@FreeBSD.org>

correct readlinkat(2) return type

r176215 corrected readlink(2)'s return type and the type of the last
argument. readlink(2) was introduced in r177788 after being developed
as part of Google Summer of Code 2007; it appears to have inherited the
wrong return type.

Man pages and header files were already ssize_t; update syscalls.master
to match.

PR: 197915
Submitted by: Henning Petersen <henning.petersen@t-online.de>
MFC after: 2 weeks


# a1304030 06-Apr-2019 Mariusz Zaborski <oshogbo@FreeBSD.org>

Introduce funlinkat syscall that always us to check if we are removing
the file associated with the given file descriptor.

Reviewed by: kib, asomers
Reviewed by: cem, jilles, brooks (they reviewed previous version)
Discussed with: pjd, and many others
Differential Revision: https://reviews.freebsd.org/D14567


# 10f7b12c 17-Dec-2018 Brooks Davis <brooks@FreeBSD.org>

const poison the `new` pointer of __sysctl.

Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D18444


# d1fd400a 07-Dec-2018 Konstantin Belousov <kib@FreeBSD.org>

Add new file handle system calls.

Namely, getfhat(2), fhlink(2), fhlinkat(2), fhreadlink(2). The
syscalls are provided for a NFS userspace server (nfs-ganesha).

Submitted by: Jack Halford <jack@gandi.net>
Sponsored by: Gandi.net
Tested by: pho
Feedback from: brooks, markj
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D18359


# 41f7b253 04-Dec-2018 Brooks Davis <brooks@FreeBSD.org>

Remove NOARGS from oaccept.

This was in the orignal patch, but lost in a rebase.

Reported by: andrew
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15816


# d48719bd 04-Dec-2018 Brooks Davis <brooks@FreeBSD.org>

Normalize COMPAT_43 syscall declarations.

Have ogetkerninfo, ogetpagesize, ogethostname, osethostname, and oaccept
declare o<foo>_args structs rather than non-compat ones. Due to a
failure to use NOARGS in most cases this adds only one new declaration.

No changes required in freebsd32 as only ogetpagesize() is implemented
and it has a 32-bit specific implementation.

Reviewed by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15816


# 9a38df59 09-Nov-2018 Brooks Davis <brooks@FreeBSD.org>

Fix freebsd32 mknod(at).

As dev_t is now a 64-bit integer, it requires special handling as a
system call argument. 64-bit arguments are split between two 64-bit
integers due to the way arguments are promoted to allow reuse of most
system call implementations. They must be reassembled before use.
Further, 64-bit arguments at an odd offset (counting from zero) are
padded and slid to the next slot on powerpc and mips. Fix the
non-COMPAT11 system call by adding a freebsd32_mknodat() and
appropriately padded declerations.

The COMPAT11 system calls are fully compatible with the 64-bit
implementations so remove the freebsd32_ versions.

Use uint32_t consistently as the type of the old dev_t. This matches
the old definition.

Reviewed by: kib
MFC after: 3 days
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17928


# e56ec0e5 07-Nov-2018 Brooks Davis <brooks@FreeBSD.org>

makesyscalls.sh: allow pointer return types.

The previous code required that the return type be a single word. This
allows it to be a pointer without using a typedef.

Update the return types of break, mmap, and shmat to be void * as
declared. This only effects systrace output in-tree, but can aid in
generating system call wrappers from syscalls.master.

Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17873


# dd4d2f21 06-Nov-2018 Brooks Davis <brooks@FreeBSD.org>

Update some comments made obsolete by recent commits.


# 318f0d77 06-Nov-2018 Brooks Davis <brooks@FreeBSD.org>

Use declared types for caddr_t arguments.

Leave ptrace(2) alone for the moment as it's defined to take a caddr_t.

Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17852


# 44cbc1c2 05-Nov-2018 Brooks Davis <brooks@FreeBSD.org>

Fix a couple indentation errors in r339958.


# 12e69f96 02-Nov-2018 Brooks Davis <brooks@FreeBSD.org>

Add const to input-only char * arguments.

These arguments are mostly paths handled by NAMEI*() macros which already
take const char * arguments.

This change improves the match between syscalls.master and the public
declerations of system calls.

Reviewed by: kib (prior version)
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17812


# 2105ac07 01-Nov-2018 Brooks Davis <brooks@FreeBSD.org>

Use mode_t when the documented signature does.

This is more clear and produces better results when generating function
stubs from syscalls.master.

Reviewed by: kib, emaste
Obtained from: CheribSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17784


# e3e54813 31-Oct-2018 Brooks Davis <brooks@FreeBSD.org>

Reformat syscalls.master for better readability.

This takes advantage of two recents changes to makesyscalls.sh:
r328598: Permit a range of syscall numbers for UNIMPL
r339624: Remove the need for backslashes in syscalls.master

Syscall declerations are now split across multiple lines with the
syscall name and variables each on seperate lines (with an exception for
syscalls taking no arguments.)

Reviewed by: imp
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17706


# 22c0c9a4 22-Oct-2018 Brooks Davis <brooks@FreeBSD.org>

Remove __restrict qualifiers from syscalls.master.

The restruct qualifier is intended to aid code generation in the
compiler, but the only access to storage through these pointers is via
structs using copyin/copyout and the like which can not be written in C
or C++ and thus the compiler gains nothing from the qualifiers.

As such, the qualifiers add no value in current usage.

Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17574


# 29bf3a7b 15-Oct-2018 Kyle Evans <kevans@FreeBSD.org>

Correct COMPAT* macro names in syscalls.master

Both ^/sys/compat/freebsd32/syscalls.master and ^/sys/kern/syscalls.master
cited "COMPAT[n] #ifdef" instead of "COMPAT_FREEBSD[n] #ifdef" in places.

Approved by: re (glebius)


# 46e20549 28-Sep-2018 John Baldwin <jhb@FreeBSD.org>

Mark various removed system calls as OBSOL instead of UNIMPL.

This is mostly a cosmetic change except that obsolete system calls are
assigned meaningful names in the names arrays which means that using
tools like kdump or truss against binaries invoking these system calls
will print out the name instead of the number. The script I use to
generate the XML list of syscalls for GDB also ignores UNIMPL but not
OBSOL entries. In general UNIMPL should only be used to reserve
placeholders for system calls that have never been implemented while
system calls that existed at one time in FreeBSD but were removed
should be marked OBSOL instead.

Reviewed by: brooks, kib, imp
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17344


# c542c43e 16-Aug-2018 Jamie Gritton <jamie@FreeBSD.org>

Revert r337922, except for some documention-only bits. This needs to wait
until user is changed to stop using jail(2).

Differential Revision: D14791


# 284001a2 16-Aug-2018 Jamie Gritton <jamie@FreeBSD.org>

Put jail(2) under COMPAT_FREEBSD11. It has been the "old" way of creating
jails since FreeBSD 7.

Along with the system call, put the various security.jail.allow_foo and
security.jail.foo_allowed sysctls partly under COMPAT_FREEBSD11 (or
BURN_BRIDGES). These sysctls had two disparate uses: on the system side,
they were global permissions for jails created via jail(2) which lacked
fine-grained permission controls; inside a jail, they're read-only
descriptions of what the current jail is allowed to do. The first use
is obsolete along with jail(2), but keep them for the second-read-only use.

Differential Revision: D14791


# 7cc923f8 10-Jul-2018 Brooks Davis <brooks@FreeBSD.org>

Get rid of netbsd_lchown and netbsd_msync syscall entries.

No valid FreeBSD binary very called them (they would call lchown and
msync directly) and we haven't supported NetBSD binaries in ages.

This is a respin of r335983 with a workaround for the ancient BFD linker
in the libc stubs.

Reviewed by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D16193


# 714c03c8 05-Jul-2018 Brooks Davis <brooks@FreeBSD.org>

Revert r335983.

The bfd linker in tree doesn't support multiple names for the same
symbol (at least with current flags).


# 5b04a71d 05-Jul-2018 Brooks Davis <brooks@FreeBSD.org>

Get rid of netbsd_lchown and netbsd_msync syscall entries.

No valid FreeBSD binary ever called them (they would call lchown and
msync directly) and we haven't supported NetBSD binaries in ages.

Reviewed by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15814


# 9da5364e 14-Jun-2018 Brooks Davis <brooks@FreeBSD.org>

Name the implementation of brk and sbrk sys_break().

The break() system call was renamed (several times) starting in v3
AT&T UNIX when C was invented and break was a language keyword. The
last vestage of a need for it to be called something else (eg obreak)
was removed in r225617 which consistantly prefixed all syscall
implementations.

Reviewed by: emaste, kib (older version)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15638


# 9f9c9b22 04-Jun-2018 Mark Johnston <markj@FreeBSD.org>

Reimplement brk() and sbrk() to avoid the use of _end.

Previously, libc.so would initialize its notion of the break address
using _end, a special symbol emitted by the static linker following
the bss section. Compatibility issues between lld and ld.bfd could
cause the wrong definition of _end (libc.so's definition rather than
that of the executable) to be used, breaking the brk()/sbrk()
interface.

Avoid this problem and future interoperability issues by simply not
relying on _end. Instead, modify the break() system call to return
the kernel's view of the current break address, and have libc
initialize its state using an extra syscall upon the first use of the
interface. As a side effect, this appears to fix brk()/sbrk() usage
in executables run with rtld direct exec, since the kernel and libc.so
no longer maintain separate views of the process' break address.

PR: 228574
Reviewed by: kib (previous version)
MFC after: 2 months
Differential Revision: https://reviews.freebsd.org/D15663


# 64b378f1 30-May-2018 Brooks Davis <brooks@FreeBSD.org>

Remove alternative names that are identical to the default.

Verified by make sysent producing no changes.


# 7351a8bd 25-May-2018 Brooks Davis <brooks@FreeBSD.org>

Make vadvise compat freebsd11.

The vadvise syscall (aka ovadvise) is undocumented and has always been
implmented as returning EINVAL. Put the syscall under COMPAT11 and
provide a userspace implementation.

Reviewed by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15557


# 89ea4a30 05-Apr-2018 Brooks Davis <brooks@FreeBSD.org>

Added SAL annotatations to system calls.

Modify makesyscalls.sh to strip out SAL annotations.

No functional change.

This is based on work I started in CheriBSD and use to validate fat
pointers at the syscall boundary. Tal Garfinkel reviewed the changes,
added annotations to COMPAT* syscalls and is using them in a record and
playback framework. One can envision other uses such as a WITNESS-like
validator for copyin/out as speculated on in the review.

As this time we are only annotating sys/kern/syscalls.master as that is
sufficient for userspace work. If kernel use cases materialize, we can
annotate other syscalls.master as needed.

Submitted by: Tal Garfinkel <talg@cs.stanford.edu>
Sponsored by: DARPA, AFRL (in part)
Differential Revision: https://reviews.freebsd.org/D14285


# e9ac2743 20-Mar-2018 Conrad Meyer <cem@FreeBSD.org>

Implement getrandom(2) and getentropy(3)

The general idea here is to provide userspace programs with well-defined
sources of entropy, in a fashion that doesn't require opening a new file
descriptor (ulimits) or accessing paths (/dev/urandom may be restricted
by chroot or capsicum).

getrandom(2) is the more general API, and comes from the Linux world.
Since our urandom and random devices are identical, the GRND_RANDOM flag
is ignored.

getentropy(3) is added as a compatibility shim for the OpenBSD API.

truss(1) support is included.

Tests for both system calls are provided. Coverage is believed to be at
least as comprehensive as LTP getrandom(2) test coverage. Additionally,
instructions for running the LTP tests directly against FreeBSD are provided
in the "Test Plan" section of the Differential revision linked below. (They
pass, of course.)

PR: 194204
Reported by: David CARLIER <david.carlier AT hardenedbsd.org>
Discussed with: cperciva, delphij, jhb, markj
Relnotes: maybe
Differential Revision: https://reviews.freebsd.org/D14500


# 1c1b4c66 05-Mar-2018 Brooks Davis <brooks@FreeBSD.org>

Remove remenants of 1990s efforts to let us run Net/OpenBSD binaries.

No functional change (comments change in some generated files.)

Reviewed by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14571


# 315fbaec 23-Feb-2018 Ed Maste <emaste@FreeBSD.org>

Correct pseudo misspelling in sys/ comments

contrib code and #define in intel_ata.h unchanged.


# 3f289c3f 12-Jan-2018 Jeff Roberson <jeff@FreeBSD.org>

Implement 'domainset', a cpuset based NUMA policy mechanism. This allows
userspace to control NUMA policy administratively and programmatically.

Implement domainset based iterators in the page layer.

Remove the now legacy numa_* syscalls.

Cleanup some header polution created by having seq.h in proc.h.

Reviewed by: markj, kib
Discussed with: alc
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13403


# 5cd667e6 28-Nov-2017 Brooks Davis <brooks@FreeBSD.org>

Disable vim syntax highlighting.

Vim's default pick doesn't understand that ';' is a comment character
and the result looks horrible.

Reviewed by: emaste


# 2b34e843 16-Jun-2017 Konstantin Belousov <kib@FreeBSD.org>

Add abstime kqueue(2) timers and expand struct kevent members.

This change implements NOTE_ABSTIME flag for EVFILT_TIMER, which
specifies that the data field contains absolute time to fire the
event.

To make this useful, data member of the struct kevent must be extended
to 64bit. Using the opportunity, I also added ext members. This
changes struct kevent almost to Apple struct kevent64, except I did
not changed type of ident and udata, the later would cause serious API
incompatibilities.

The type of ident was kept uintptr_t since EVFILT_AIO returns a
pointer in this field, and e.g. CHERI is sensitive to the type
(discussed with brooks, jhb).

Unlike Apple kevent64, symbol versioning allows us to claim ABI
compatibility and still name the new syscall kevent(2). Compat shims
are provided for both host native and compat32.

Requested by: bapt
Reviewed by: bapt, brooks, ngie (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D11025


# 69921123 23-May-2017 Konstantin Belousov <kib@FreeBSD.org>

Commit the 64-bit inode project.

Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment. Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.

ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks. Unfortunately, not everything can be
fixed, especially outside the base system. For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.

Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.

Struct xvnode changed layout, no compat shims are provided.

For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.

Update note: strictly follow the instructions in UPDATING. Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.

Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver. Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).

Sponsored by: The FreeBSD Foundation (emaste, kib)
Differential revision: https://reviews.freebsd.org/D10439


# 982519d1 06-Apr-2017 Brooks Davis <brooks@FreeBSD.org>

Change the size argument of __getcwd() to size_t.

This matches the getcwd() definition.

This is technically an ABI change, but that would only effect 64-bit
big-endian platforms that pass arguments on the stack. We have none of
those.

Reviewed by: jhb
Obtained from: CheriABI
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9428


# d8ca0a2b 29-Mar-2017 Robert Watson <rwatson@FreeBSD.org>

Hook up new audit event identifiers for various non-Orange Book/CAPP
system calls supported by OpenBSM 1.2-alpha5.

Obtained from: TrustedBSD Project
MFC after: 3 weeks
Sponsored by: DARPA, AFRL


# 3f8455b0 18-Mar-2017 Eric van Gyzen <vangyzen@FreeBSD.org>

Add clock_nanosleep()

Add a clock_nanosleep() syscall, as specified by POSIX.
Make nanosleep() a wrapper around it.

Attach the clock_nanosleep test from NetBSD. Adjust it for the
FreeBSD behavior of updating rmtp only when interrupted by a signal.
I believe this to be POSIX-compliant, since POSIX mentions the rmtp
parameter only in the paragraph about EINTR. This is also what
Linux does. (NetBSD updates rmtp unconditionally.)

Copy the whole nanosleep.2 man page from NetBSD because it is complete
and closely resembles the POSIX description. Edit, polish, and reword it
a bit, being sure to keep any relevant text from the FreeBSD page.

Reviewed by: kib, ngie, jilles
MFC after: 3 weeks
Relnotes: yes
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D10020


# 34ed0c63 27-Dec-2016 John Baldwin <jhb@FreeBSD.org>

Rename the 'flags' argument to getfsstat() to 'mode' and validate it.

This argument is not a bitmask of flags, but only accepts a single value.
Fail with EINVAL if an invalid value is passed to 'flag'. Rename the
'flags' argument to getmntinfo(3) to 'mode' as well to match.

This is a followup to r308088.

Reviewed by: kib
MFC after: 1 month


# 82d8d2b8 07-Dec-2016 Robert Watson <rwatson@FreeBSD.org>

Replace spaces with tabs in definition of SCTP system calls, for consistency
with the remainder of the syscalls.master file. This problem does not occur
in the freebsd32 version of the same system calls.


# 5cba398b 18-Aug-2016 George V. Neville-Neil <gnn@FreeBSD.org>

Remove unusedd and obsolete openbsd_poll system call. (Phase 1)

Reported by: brooks
Reviewed by: brooks,jhb
Differential Revision: https://reviews.freebsd.org/D7548


# 295af703 15-Aug-2016 Konstantin Belousov <kib@FreeBSD.org>

Add an implementation of fdatasync(2).

The syscall is a trivial wrapper around new VOP_FDATASYNC(), sharing
code with fsync(2). For all filesystems, this commit provides the
implementation which delegates the work of VOP_FDATASYNC() to
VOP_FSYNC(). This is functionally correct but not efficient.

This is not yet POSIX-compliant implementation, because it does not
ensure that queued AIO requests are completed before returning.

Reviewed by: mckusick
Discussed with: avg (ZFS), jhb (AIO part)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D7471


# 78be18ae 03-Aug-2016 Bryan Drewery <bdrewery@FreeBSD.org>

Correct some comments.

Sponsored by: EMC / Isilon Storage Division
MFC after: 3 days


# 1b42875d 03-Aug-2016 Ed Schouten <ed@FreeBSD.org>

Re-add traling slash that was removed in r303699.

I must have accidentally pressed some random key in vim.


# a813fdc6 03-Aug-2016 Ed Schouten <ed@FreeBSD.org>

mprotect(): Change prototype to comply to POSIX.

Our mprotect() function seems to take a "const void *" address to the
pages whose permissions need to be adjusted. POSIX uses "void *". Simply
stick to the POSIX one to prevent us from writing unportable code.

PR: 211423 (exp-run)
Tested by: antoine@ (Thanks!)


# d9c4cd2f 27-Jul-2016 Ed Schouten <ed@FreeBSD.org>

Change the return type of msgrcv() to ssize_t as required by POSIX.

It looks like the msgrcv() system call is already written in such a way
that the size is internally computed as a size_t and written into all of
td_retval[0]. This means that it is effectively already returning
ssize_t. It's just that the userspace prototype doesn't match up.


# e5ec7339 10-Jul-2016 Robert Watson <rwatson@FreeBSD.org>

Do allow auditing of read(2) and write(2) system calls, by assigning
those system calls audit event identifiers AUE_READ and AUE_WRITE.
While auditing file-descriptor I/O is not required by the Common
Criteria, in practice this proves useful for both live and forensic
analysis.

NB: freebsd32 already assigns AUE_READ and AUE_WRITE to read(2) and
write(2).

MFC after: 3 days
Sponsored by: DARPA, AFRL


# e16e6409 22-Jun-2016 Brooks Davis <brooks@FreeBSD.org>

Mark the pipe() system call as COMPAT10.

As of r302092 libc uses pipe2() with a zero flags value instead of pipe().

Commit with regenerated files and implementation to follow.

Approved by: re (gjb)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D6816


# bb430bc7 21-Mar-2016 John Baldwin <jhb@FreeBSD.org>

Fully handle size_t lengths in AIO requests.

First, update the return types of aio_return() and aio_waitcomplete() to
ssize_t.

POSIX requires aio_return() to return a ssize_t so that it can represent
all return values from read() and write(). aio_waitcomplete() should use
ssize_t for the same reason.

aio_return() has used ssize_t in <aio.h> since r31620 but the manpage and
system call entry were not updated. aio_waitcomplete() has always
returned int.

Note that this does not require new system call stubs as this is
effectively only an API change in how the compiler interprets the return
value.

Second, allow aio_nbytes values up to IOSIZE_MAX instead of just INT_MAX.

aio_read/write should now honor the same length limits as normal read/write.

Third, use longs instead of ints in the aio_return() and aio_waitcomplete()
system call functions so that the 64-bit size_t in the in-kernel aiocb
isn't truncated to 32-bits before being copied out to userland or
being returned.

Finally, a simple test has been added to verify the bounds checking on the
maximum read size from a file.


# 399e8c17 09-Mar-2016 John Baldwin <jhb@FreeBSD.org>

Simplify AIO initialization now that it is standard.

- Mark AIO system calls as STD and remove the helpers to dynamically
register them.
- Use COMPAT6 for the old system calls with the older sigevent instead of
an 'o' prefix.
- Simplify the POSIX configuration to note that AIO is always available.
- Handle AIO in the default VOP_PATHCONF instead of special casing it in
the pathconf() system call. fpathconf() is still hackish.
- Remove freebsd32_aio_cancel() as it just called the native one directly.

Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D5589


# 871ef8b0 11-Jul-2015 Adrian Chadd <adrian@FreeBSD.org>

Regenerate syscalls.


# 0538aafc 18-Apr-2015 Konstantin Belousov <kib@FreeBSD.org>

The lseek(2), mmap(2), truncate(2), ftruncate(2), pread(2), and
pwrite(2) syscalls are wrapped to provide compatibility with pre-7.x
kernels which required padding before the off_t parameter. The
fcntl(2) contains compatibility code to handle kernels before the
struct flock was changed during the 8.x CURRENT development. The
shims were reasonable to allow easier revert to the older kernel at
that time.

Now, two or three major releases later, shims do not serve any
purpose. Such old kernels cannot handle current libc, so revert the
compatibility code.

Make padded syscalls support conditional under the COMPAT6 config
option. For COMPAT32, the syscalls were under COMPAT6 already.

Remove WITHOUT_SYSCALL_COMPAT build option, which only purpose was to
(partially) disable the removed shims.

Reviewed by: jhb, imp (previous versions)
Discussed with: peter
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 2205e0d1 23-Jan-2015 Jilles Tjoelker <jilles@FreeBSD.org>

Add futimens and utimensat system calls.

The core kernel part is patch file utimes.2008.4.diff from
pluknet@FreeBSD.org. I updated the code for API changes, added the manual
page and added compatibility code for old kernels. There is also audit and
Capsicum support.

A new UTIME_* constant might allow setting birthtimes in future.

Differential Revision: https://reviews.freebsd.org/D1426
Submitted by: pluknet (partially)
Reviewed by: delphij, pluknet, rwatson
Relnotes: yes


# 9f7a06f2 04-Jan-2015 Dmitry Chagin <dchagin@FreeBSD.org>

Indeed, instead of hiding the kern___getcwd() bug by bogus cast
in r276564, change path type to char * (pathnames are always char *).
And remove bogus casts of malloc().
kern___getcwd() internally doesn't actually use or support u_char *
paths, except to copy them to a normal char * path.

These changes are not visible to libc as libc/gen/getcwd.c misdeclares
__getcwd() as taking a plain char * path.

While here remove _SYS_SYSPROTO_H_ for __getcwd() syscall as
we always have sysproto.h.

Pointed out by: bde

MFC after: 1 week


# 186d9c34 12-Nov-2014 Dmitry Chagin <dchagin@FreeBSD.org>

Add the ppoll() system call.
Export kern_poll() needed by an upcoming Linuxulator change.

Differential Revision: https://reviews.freebsd.org/D1133
Reviewed by: kib, wblock
MFC after: 1 month


# 80b47aef 09-Oct-2014 Marcel Moolenaar <marcel@FreeBSD.org>

Move the SCTP syscalls to netinet with the rest of the SCTP code. The
syscalls themselves are tightly coupled with the network stack and
therefore should not be in the generic socket code.

The following four syscalls have been marked as NOSTD so they can be
dynamically registered in sctp_syscalls_init() function:
sys_sctp_peeloff
sys_sctp_generic_sendmsg
sys_sctp_generic_sendmsg_iov
sys_sctp_generic_recvmsg

The syscalls are also set up to be dynamically registered when COMPAT32
option is configured.

As a side effect of moving the SCTP syscalls, getsock_cap needs to be
made available outside of the uipc_syscalls.c source file. A proper
prototype has been added to the sys/socketvar.h header file.

API tests from the SCTP reference implementation have been run to ensure
compatibility. (http://code.google.com/p/sctp-refimpl/source/checkout)

Submitted by: Steve Kiernan <stevek@juniper.net>
Reviewed by: tuexen, rrs
Obtained from: Juniper Networks, Inc.


# ce42e793 18-Mar-2014 Attilio Rao <attilio@FreeBSD.org>

Remove dead code from umtx support:
- Retire long time unused (basically always unused) sys__umtx_lock()
and sys__umtx_unlock() syscalls
- struct umtx and their supporting definitions
- UMUTEX_ERROR_CHECK flag
- Retire UMTX_OP_LOCK/UMTX_OP_UNLOCK from _umtx_op() syscall

__FreeBSD_version is not bumped yet because it is expected that further
breakages to the umtx interface will follow up in the next days.
However there will be a final bump when necessary.

Sponsored by: EMC / Isilon storage division
Reviewed by: jhb


# 55648840 19-Sep-2013 John Baldwin <jhb@FreeBSD.org>

Extend the support for exempting processes from being killed when swap is
exhausted.
- Add a new protect(1) command that can be used to set or revoke protection
from arbitrary processes. Similar to ktrace it can apply a change to all
existing descendants of a process as well as future descendants.
- Add a new procctl(2) system call that provides a generic interface for
control operations on processes (as opposed to the debugger-specific
operations provided by ptrace(2)). procctl(2) uses a combination of
idtype_t and an id to identify the set of processes on which to operate
similar to wait6().
- Add a PROC_SPROTECT control operation to manage the protection status
of a set of processes. MADV_PROTECT still works for backwards
compatability.
- Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc)
the first bit of which is used to track if P_PROTECT should be inherited
by new child processes.

Reviewed by: kib, jilles (earlier version)
Approved by: re (delphij)
MFC after: 1 month


# 84c21af1 12-Sep-2013 John Baldwin <jhb@FreeBSD.org>

Fix the type of the idtype argument to wait6() in syscalls.master.

Approved by: re (kib)
MFC after: 1 week


# 7008be5b 04-Sep-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)

#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);

bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

cap_rights_t rights;

cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by: The FreeBSD Foundation


# 6160e12c 08-Jun-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Add new system call - aio_mlock(). The name speaks for itself. It allows
to perform the mlock(2) operation, which can consume a lot of time, under
control of aio(4).

Reviewed by: kib, jilles
Sponsored by: Nginx, Inc.


# 48947ecc 21-May-2013 Konstantin Belousov <kib@FreeBSD.org>

Fix the wait6(2) on 32bit architectures and for the compat32, by using
the right type for the argument in syscalls.master. Also fix the
posix_fallocate(2) and posix_fadvise(2) compat32 syscalls on the
architectures which require padding of the 64bit argument.

Noted and reviewed by: jhb
Pointy hat to: kib
MFC after: 1 week


# dc570d5e 01-May-2013 Jilles Tjoelker <jilles@FreeBSD.org>

Add pipe2() system call.

The pipe2() function is similar to pipe() but allows setting FD_CLOEXEC and
O_NONBLOCK (on both sides) as part of the function.

If p points to two writable ints, pipe2(p, 0) is equivalent to pipe(p).

If the pointer is not valid, behaviour differs: pipe2() writes into the
array from the kernel like socketpair() does, while pipe() writes into the
array from an architecture-specific assembler wrapper.

Reviewed by: kan, kib


# da7d2afb 01-May-2013 Jilles Tjoelker <jilles@FreeBSD.org>

Add accept4() system call.

The accept4() function, compared to accept(), allows setting the new file
descriptor atomically close-on-exec and explicitly controlling the
non-blocking status on the new socket. (Note that the latter point means
that accept() is not equivalent to any form of accept4().)

The linuxulator's accept4 implementation leaves a race window where the new
file descriptor is not close-on-exec because it calls sys_accept(). This
implementation leaves no such race window (by using falloc() flags). The
linuxulator could be fixed and simplified by using the new code.

Like accept(), accept4() is async-signal-safe, a cancellation point and
permitted in capability mode.


# e324bf91 01-Apr-2013 Matthew D Fleming <mdf@FreeBSD.org>

Fix return type of extattr_set_* and fix rmextattr(8) utility.

extattr_set_{fd,file,link} is logically a write(2)-like operation and
should return ssize_t, just like extattr_get_*. Also, the user-space
utility was using an int for the return value of extattr_get_* and
extattr_list_*, both of which return an ssize_t.

MFC after: 1 week


# e948704e 21-Mar-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Implement chflagsat(2) system call, similar to fchmodat(2), but operates on
file flags.

Reviewed by: kib, jilles
Sponsored by: The FreeBSD Foundation


# b4b2596b 21-Mar-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

- Make 'flags' argument to chflags(2), fchflags(2) and lchflags(2) of type
u_long. Before this change it was of type int for syscalls, but prototypes
in sys/stat.h and documentation for chflags(2) and fchflags(2) (but not
for lchflags(2)) stated that it was u_long. Now some related functions
use u_long type for flags (strtofflags(3), fflagstostr(3)).
- Make path argument of type 'const char *' for consistency.

Discussed on: arch
Sponsored by: The FreeBSD Foundation


# 7493f24e 02-Mar-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

- Implement two new system calls:

int bindat(int fd, int s, const struct sockaddr *addr, socklen_t addrlen);
int connectat(int fd, int s, const struct sockaddr *name, socklen_t namelen);

which allow to bind and connect respectively to a UNIX domain socket with a
path relative to the directory associated with the given file descriptor 'fd'.

- Add manual pages for the new syscalls.

- Make the new syscalls available for processes in capability mode sandbox.

- Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on
the directory descriptor for the syscalls to work.

- Update audit(4) to support those two new syscalls and to handle path
in sockaddr_un structure relative to the given directory descriptor.

- Update procstat(1) to recognize the new capability rights.

- Document the new capability rights in cap_rights_limit(2).

Sponsored by: The FreeBSD Foundation
Discussed with: rwatson, jilles, kib, des


# 2609222a 01-Mar-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Merge Capsicum overhaul:

- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:

CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.

Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).

Added CAP_SYMLINKAT:
- Allow for symlinkat(2).

Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).

Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.

Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.

Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.

Removed CAP_MAPEXEC.

CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.

Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.

CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).

CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).

Added convinient defines:

#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE

#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)

Added defines for backward API compatibility:

#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib


# f13b5a0f 12-Nov-2012 Konstantin Belousov <kib@FreeBSD.org>

Add the wait6(2) system call. It takes POSIX waitid()-like process
designator to select a process which is waited for. The system call
optionally returns siginfo_t which would be otherwise provided to
SIGCHLD handler, as well as extended structure accounting for child
and cumulative grandchild resource usage.

Allow to get the current rusage information for non-exited processes
as well, similar to Solaris.

The explicit WEXITED flag is required to wait for exited processes,
allowing for more fine-grained control of the events the waiter is
interested in.

Fix the handling of siginfo for WNOWAIT option for all wait*(2)
family, by not removing the queued signal state.

PR: standards/170346
Submitted by: "Jukka A. Ukkonen" <jau@iki.fi>
MFC after: 1 month


# d65f1abc 16-Aug-2012 David Xu <davidxu@FreeBSD.org>

Implement syscall clock_getcpuclockid2, so we can get a clock id
for process, thread or others we want to support.
Use the syscall to implement POSIX API clock_getcpuclock and
pthread_getcpuclockid.

PR: 168417


# 520b6a84 25-May-2012 Ed Schouten <ed@FreeBSD.org>

Remove use of non-ISO-C integer types from system call tables.

These files already use ISO-C-style integer types, so make them less
inconsistent by preferring the standard types.


# cf13a585 20-Nov-2011 Lawrence Stewart <lstewart@FreeBSD.org>

- Add the ffclock_getcounter(), ffclock_getestimate() and ffclock_setestimate()
system calls to provide feed-forward clock management capabilities to
userspace processes. ffclock_getcounter() returns the current value of the
kernel's feed-forward clock counter. ffclock_getestimate() returns the current
feed-forward clock parameter estimates and ffclock_setestimate() updates the
feed-forward clock parameter estimates.

- Document the syscalls in the ffclock.2 man page.

- Regenerate the script-derived syscall related files.

Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.

For more information, see http://www.synclab.org/radclock/

Submitted by: Julien Ridoux (jridoux at unimelb edu au)


# d3a993d4 18-Nov-2011 Ed Schouten <ed@FreeBSD.org>

Improve *access*() parameter name consistency.

The current code mixes the use of `flags' and `mode'. This is a bit
confusing, since the faccessat() function as a `flag' parameter to store
the AT_ flag.

Make this less confusing by using the same name as used in the POSIX
specification -- `amode'.


# 936c09ac 03-Nov-2011 John Baldwin <jhb@FreeBSD.org>

Add the posix_fadvise(2) system call. It is somewhat similar to
madvise(2) except that it operates on a file descriptor instead of a
memory region. It is currently only supported on regular files.

Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types. The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are
thus filesystem independent. Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.

The second type of hints are used to hint to the OS that data will or
will not be used. These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request. This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode. This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue. The second change adds
a new vm_object_page_cache() method. This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.

To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.

Reviewed by: jilles, kib, arch@
Approved by: re (kib)
MFC after: 1 month


# cfb5f768 18-Aug-2011 Jonathan Anderson <jonathan@FreeBSD.org>

Add experimental support for process descriptors

A "process descriptor" file descriptor is used to manage processes
without using the PID namespace. This is required for Capsicum's
Capability Mode, where the PID namespace is unavailable.

New system calls pdfork(2) and pdkill(2) offer the functional equivalents
of fork(2) and kill(2). pdgetpid(2) allows querying the PID of the remote
process for debugging purposes. The currently-unimplemented pdwait(2) will,
in the future, allow querying rusage/exit status. In the interim, poll(2)
may be used to check (and wait for) process termination.

When a process is referenced by a process descriptor, it does not issue
SIGCHLD to the parent, making it suitable for use in libraries---a common
scenario when using library compartmentalisation from within large
applications (such as web browsers). Some observers may note a similarity
to Mach task ports; process descriptors provide a subset of this behaviour,
but in a UNIX style.

This feature is enabled by "options PROCDESC", but as with several other
Capsicum kernel features, is not enabled by default in GENERIC 9.0.

Reviewed by: jhb, kib
Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc


# cfb9df55 15-Jul-2011 Jonathan Anderson <jonathan@FreeBSD.org>

Add cap_new() and cap_getrights() system calls.

Implement two previously-reserved Capsicum system calls:
- cap_new() creates a capability to wrap an existing file descriptor
- cap_getrights() queries the rights mask of a capability.

Approved by: mentor (rwatson), re (Capsicum blanket)
Sponsored by: Google Inc


# d91f88f7 18-Apr-2011 Matthew D Fleming <mdf@FreeBSD.org>

Add the posix_fallocate(2) syscall. The default implementation in
vop_stdallocate() is filesystem agnostic and will run as slow as a
read/write loop in userspace; however, it serves to correctly
implement the functionality for filesystems that do not implement a
VOP_ALLOCATE.

Note that __FreeBSD_version was already bumped today to 900036 for any
ports which would like to use this function.

Also reserve space in the syscall table for posix_fadvise(2).

Reviewed by: -arch (previous version)


# ec125fbb 30-Mar-2011 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add rctl. It's used by racct to take user-configurable actions based
on the set of rules it maintains and the current resource usage. It also
privides userland API to manage that ruleset.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib (earlier version)


# 2bfc50bc 04-Mar-2011 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add two new system calls, setloginclass(2) and getloginclass(2). This makes
it possible for the kernel to track login class the process is assigned to,
which is required for RCTL. This change also make setusercontext(3) call
setloginclass(2) and makes it possible to retrieve current login class using
id(1).

Reviewed by: kib (as part of a larger patch)


# 96fcc75f 01-Mar-2011 Robert Watson <rwatson@FreeBSD.org>

Add initial support for Capsicum's Capability Mode to the FreeBSD kernel,
compiled conditionally on options CAPABILITIES:

Add a new credential flag, CRED_FLAG_CAPMODE, which indicates that a
subject (typically a process) is in capability mode.

Add two new system calls, cap_enter(2) and cap_getmode(2), which allow
setting and querying (but never clearing) the flag.

Export the capability mode flag via process information sysctls.

Sponsored by: Google, Inc.
Reviewed by: anderson
Discussed with: benl, kris, pjd
Obtained from: Capsicum Project
MFC after: 3 months


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 8d19559b 30-Aug-2010 Konstantin Belousov <kib@FreeBSD.org>

Make the syscalls reserved for AFS usable by OpenAFS port.

Submitted by: Benjamin Kaduk <kaduk mit edu>
MFC after: 2 weeks


# 13561ed4 26-Aug-2010 Konstantin Belousov <kib@FreeBSD.org>

Fix typo.

Submitted by: Ben Kaduk <minimarmot gmail com>


# 153ac44c 28-Jun-2010 Konstantin Belousov <kib@FreeBSD.org>

Count number of threads that enter and leave dynamically registered
syscalls. On the dynamic syscall deregistration, wait until all
threads leave the syscall code. This somewhat increases the safety
of the loadable modules unloading.

Reviewed by: jhb
Tested by: pho
MFC after: 1 month


# 790f66db 08-Feb-2010 Ed Schouten <ed@FreeBSD.org>

Remove unused LIBCOMPAT keyword from syscalls.master.


# 7e767511 19-Dec-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r198508, r198509:
Reimplement pselect() in kernel, making change of sigmask and sleep atomic.

MFC r198538:
Move pselect(3) man page to section 2.


# 304e9b14 13-Dec-2009 Robert Watson <rwatson@FreeBSD.org>

Merge r197636 from head to stable/8:

Reserve system call numbers for Capsicum security framework capabilities,
capability mode, and process descriptors: cap_new, cap_getrights, cap_enter,
cap_getmode, pdfork, pdkill, pdgetpid, and pdwait.

Obtained from: TrustedBSD Project
Sponsored by: Google


# 066d836b 27-Oct-2009 Konstantin Belousov <kib@FreeBSD.org>

Current pselect(3) is implemented in usermode and thus vulnerable to
well-known race condition, which elimination was the reason for the
function appearance in first place. If sigmask supplied as argument to
pselect() enables a signal, the signal might be delivered before thread
called select(2), causing lost wakeup. Reimplement pselect() in kernel,
making change of sigmask and sleep atomic.

Since signal shall be delivered to the usermode, but sigmask restored,
set TDP_OLDMASK and save old mask in td_oldsigmask. The TDP_OLDMASK
should be cleared by ast() in case signal was not gelivered during
syscall execution.

Reviewed by: davidxu
Tested by: pho
MFC after: 1 month


# 72c35fc6 30-Sep-2009 Robert Watson <rwatson@FreeBSD.org>

Reserve system call numbers for Capsicum security framework capabilities,
capability mode, and process descriptors: cap_new, cap_getrights, cap_enter,
cap_getmode, pdfork, pdkill, pdgetpid, and pdwait.

Obtained from: TrustedBSD Project
Sponsored by: Google
MFC after: 3 weeks


# c3889811 08-Jul-2009 Edward Tomasz Napierala <trasz@FreeBSD.org>

There is an optimization in chmod(1), that makes it not to call chmod(2)
if the new file mode is the same as it was before; however, this
optimization must be disabled for filesystems that support NFSv4 ACLs.
Chmod uses pathconf(2) to determine whether this is the case - however,
pathconf(2) always follows symbolic links, while the 'chmod -h' doesn't.

This change adds lpathconf(3) to make it possible to solve that problem
in a clean way.

Reviewed by: rwatson (earlier version)
Approved by: re (kib)


# b648d480 24-Jun-2009 John Baldwin <jhb@FreeBSD.org>

Change the ABI of some of the structures used by the SYSV IPC API:
- The uid/cuid members of struct ipc_perm are now uid_t instead of unsigned
short.
- The gid/cgid members of struct ipc_perm are now gid_t instead of unsigned
short.
- The mode member of struct ipc_perm is now mode_t instead of unsigned short
(this is merely a style bug).
- The rather dubious padding fields for ABI compat with SV/I386 have been
removed from struct msqid_ds and struct semid_ds.
- The shm_segsz member of struct shmid_ds is now a size_t instead of an
int. This removes the need for the shm_bsegsz member in struct
shmid_kernel and should allow for complete support of SYSV SHM regions
>= 2GB.
- The shm_nattch member of struct shmid_ds is now an int instead of a
short.
- The shm_internal member of struct shmid_ds is now gone. The internal
VM object pointer for SHM regions has been moved into struct
shmid_kernel.
- The existing __semctl(), msgctl(), and shmctl() system call entries are
now marked COMPAT7 and new versions of those system calls which support
the new ABI are now present.
- The new system calls are assigned to the FBSD-1.1 version in libc. The
FBSD-1.0 symbols in libc now refer to the old COMPAT7 system calls.
- A simplistic framework for tagging system calls with compatibility
symbol versions has been added to libc. Version tags are added to
system calls by adding an appropriate __sym_compat() entry to
src/lib/libc/incldue/compat.h. [1]

PR: kern/16195 kern/113218 bin/129855
Reviewed by: arch@, rwatson
Discussed with: kan, kib [1]


# 45f48220 24-Jun-2009 John Baldwin <jhb@FreeBSD.org>

Deprecate the msgsys(), semsys(), and shmsys() system calls by moving
them under COMPAT_FREEBSD[4567]. Starting with FreeBSD 5.0 the SYSV IPC
API was implemented via direct system calls (e.g. msgctl(), msgget(), etc.)
rather than indirecting through the var-args *sys() system calls. The
shmsys() system call was already effectively deprecated for all but
COMPAT_FREEBSD4 already as its implementation for the !COMPAT_FREEBSD4 case
was to simply invoke nosys().


# 3c366f1f 24-Jun-2009 John Baldwin <jhb@FreeBSD.org>

Add a new COMPAT7 flag for FreeBSD 7.x compatibility system calls.


# 0b0fe06a 22-Jun-2009 John Baldwin <jhb@FreeBSD.org>

Fix a typo in a comment.


# 21def99b 17-Jun-2009 John Baldwin <jhb@FreeBSD.org>

- Add the ability to mix multiple flags seperated by pipe ('|') characters
in the type field of system call tables. Specifically, one can now use
the 'NO*' types as flags in addition to the 'COMPAT*' types. For example,
to tag 'COMPAT*' system calls as living in a KLD via NOSTD. The COMPAT*
type is required to be listed first in this case.
- Add new functions 'type()' and 'flag()' to the embedded awk script in
makesyscalls.sh that return true if a requested flag is found in the
type field ($3). The flag() function checks all of the flags in the
field, but type() only checks the first flag. type() is meant to be
used in the top-level "switch" statement and flag() should be used
otherwise.
- Retire the CPT_NOA type, it is now replaced with "COMPAT|NOARGS" using
the flags approach.
- Tweak the comment descriptions of COMPAT[46] system calls so that they
say "freebsd[46] foo" rather than "old foo".
- Document the COMPAT6 type.
- Sync comments in compat32 syscall table with the master table.


# 0ec0b41c 17-Jun-2009 John Baldwin <jhb@FreeBSD.org>

Remove the now-unused NOIMPL flag. It serves no useful purpose given the
existing UNIMPL and NOSTD types.


# f4258391 17-Jun-2009 John Baldwin <jhb@FreeBSD.org>

- NOSTD results in lkmressys being used instead of lkmssys.
- Mark nfsclnt as UNIMPL. It should have been NOSTD instead of NOIMPL back
when it lived in nfsclient.ko, but it was removed from that a long time
ago.


# c4f16b69 15-Jun-2009 John Baldwin <jhb@FreeBSD.org>

Add a new 'void closefrom(int lowfd)' system call. When called, it closes
any open file descriptors >= 'lowfd'. It is largely identical to the same
function on other operating systems such as Solaris, DFly, NetBSD, and
OpenBSD. One difference from other *BSD is that this closefrom() does not
fail with any errors. In practice, while the manpages for NetBSD and
OpenBSD claim that they return EINTR, they ignore internal errors from
close() and never return EINTR. DFly does return EINTR, but for the common
use case (closing fd's prior to execve()), the caller really wants all
fd's closed and returning EINTR just forces callers to call closefrom() in
a loop until it stops failing.

Note that this implementation of closefrom(2) does not make any effort to
resolve userland races with open(2) in other threads. As such, it is not
multithread safe.

Submitted by: rwatson (initial version)
Reviewed by: rwatson
MFC after: 2 weeks


# b38ff370 29-Apr-2009 Jamie Gritton <jamie@FreeBSD.org>

Introduce the extensible jail framework, using the same "name=value"
interface as nmount(2). Three new system calls are added:
* jail_set, to create jails and change the parameters of existing jails.
This replaces jail(2).
* jail_get, to read the parameters of existing jails. This replaces the
security.jail.list sysctl.
* jail_remove to kill off a jail's processes and remove the jail.
Most jail parameters may now be changed after creation, and jails may be
set to exist without any attached processes. The current jail(2) system
call still exists, though it is now a stub to jail_set(2).

Approved by: bz (mentor)


# a1b5a895 09-Nov-2008 Ed Schouten <ed@FreeBSD.org>

Mark uname(), getdomainname() and setdomainname() with COMPAT_FREEBSD4.

Looking at our source code history, it seems the uname(),
getdomainname() and setdomainname() system calls got deprecated
somewhere after FreeBSD 1.1, but they have never been phased out
properly. Because we don't have a COMPAT_FREEBSD1, just use
COMPAT_FREEBSD4.

Also fix the Linuxolator to build without the setdomainname() routine by
just making it call userland_sysctl on kern.domainname. Also replace the
setdomainname()'s implementation to use this approach, because we're
duplicating code with sysctl_domainname().

I wasn't able to keep these three routines working in our
COMPAT_FREEBSD32, because that would require yet another keyword for
syscalls.master (COMPAT4+NOPROTO). Because this routine is probably
unused already, this won't be a problem in practice. If it turns out to
be a problem, we'll just restore this functionality.

Reviewed by: rdivacky, kib


# a9148abd 03-Nov-2008 Doug Rabson <dfr@FreeBSD.org>

Implement support for RPCSEC_GSS authentication to both the NFS client
and server. This replaces the RPC implementation of the NFS client and
server with the newer RPC implementation originally developed
(actually ported from the userland sunrpc code) to support the NFS
Lock Manager. I have tested this code extensively and I believe it is
stable and that performance is at least equal to the legacy RPC
implementation.

The NFS code currently contains support for both the new RPC
implementation and the older legacy implementation inherited from the
original NFS codebase. The default is to use the new implementation -
add the NFS_LEGACYRPC option to fall back to the old code. When I
merge this support back to RELENG_7, I will probably change this so
that users have to 'opt in' to get the new code.

To use RPCSEC_GSS on either client or server, you must build a kernel
which includes the KGSSAPI option and the crypto device. On the
userland side, you must build at least a new libc, mountd, mount_nfs
and gssd. You must install new versions of /etc/rc.d/gssd and
/etc/rc.d/nfsd and add 'gssd_enable=YES' to /etc/rc.conf.

As long as gssd is running, you should be able to mount an NFS
filesystem from a server that requires RPCSEC_GSS authentication. The
mount itself can happen without any kerberos credentials but all
access to the filesystem will be denied unless the accessing user has
a valid ticket file in the standard place (/tmp/krb5cc_<uid>). There
is currently no support for situations where the ticket file is in a
different place, such as when the user logged in via SSH and has
delegated credentials from that login. This restriction is also
present in Solaris and Linux. In theory, we could improve this in
future, possibly using Brooks Davis' implementation of variant
symlinks.

Supporting RPCSEC_GSS on a server is nearly as simple. You must create
service creds for the server in the form 'nfs/<fqdn>@<REALM>' and
install them in /etc/krb5.keytab. The standard heimdal utility ktutil
makes this fairly easy. After the service creds have been created, you
can add a '-sec=krb5' option to /etc/exports and restart both mountd
and nfsd.

The only other difference an administrator should notice is that nfsd
doesn't fork to create service threads any more. In normal operation,
there will be two nfsd processes, one in userland waiting for TCP
connections and one in the kernel handling requests. The latter
process will create as many kthreads as required - these should be
visible via 'top -H'. The code has some support for varying the number
of service threads according to load but initially at least, nfsd uses
a fixed number of threads according to the value supplied to its '-n'
option.

Sponsored by: Isilon Systems
MFC after: 1 month


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 48a43ae8 25-Sep-2008 John Baldwin <jhb@FreeBSD.org>

Tidy up a few things with syscall generation:
- Instead of using a syscall slot (370) just to get a function prototype
for lkmressys(), add an explicit function prototype to <sys/sysent.h>.
This also removes unused special case checks for 'lkmressys' from
makesyscalls.sh.
- Instead of having magic logic in makesyscalls.sh to only generate a
function prototype the first time 'lkmnosys' is seen, make 'NODEF'
always not generate a function prototype and include an explicit
prototype for 'lkmnosys' in <sys/sysent.h>.
- As a result of the fix in (2), update the LKM syscall entries in
the freebsd32 syscall table to use 'lkmnosys' rather than 'nosys'.
- Use NOPROTO for the __syscall() entry (198) in the native ABI. This
avoids the need for magic logic in makesyscalls.h to only generate
a function prototype the first time 'nosys' is encountered.


# e484af13 24-Aug-2008 Robert Watson <rwatson@FreeBSD.org>

When MPSAFE ttys were merged, a new BSM audit event identifier was
allocated for posix_openpt(2). Unfortunately, that identifier
conflicts with other events already allocated to other systems in
OpenBSM. Assign a new globally unique identifier and conform
better to the AUE_ event naming scheme.

This is a stopgap until a new OpenBSM import is done with the
correct identifier, so we'll maintain this as a local diff in svn
until then.

Discussed with: ed
Obtained from: TrustedBSD Project


# 35c316ca 21-Aug-2008 David E. O'Brien <obrien@FreeBSD.org>

Add comments on NOARGS, NODEF, and NOPROTO.


# bc093719 20-Aug-2008 Ed Schouten <ed@FreeBSD.org>

Integrate the new MPSAFE TTY layer to the FreeBSD operating system.

The last half year I've been working on a replacement TTY layer for the
FreeBSD kernel. The new TTY layer was designed to improve the following:

- Improved driver model:

The old TTY layer has a driver model that is not abstract enough to
make it friendly to use. A good example is the output path, where the
device drivers directly access the output buffers. This means that an
in-kernel PPP implementation must always convert network buffers into
TTY buffers.

If a PPP implementation would be built on top of the new TTY layer
(still needs a hooks layer, though), it would allow the PPP
implementation to directly hand the data to the TTY driver.

- Improved hotplugging:

With the old TTY layer, it isn't entirely safe to destroy TTY's from
the system. This implementation has a two-step destructing design,
where the driver first abandons the TTY. After all threads have left
the TTY, the TTY layer calls a routine in the driver, which can be
used to free resources (unit numbers, etc).

The pts(4) driver also implements this feature, which means
posix_openpt() will now return PTY's that are created on the fly.

- Improved performance:

One of the major improvements is the per-TTY mutex, which is expected
to improve scalability when compared to the old Giant locking.
Another change is the unbuffered copying to userspace, which is both
used on TTY device nodes and PTY masters.

Upgrading should be quite straightforward. Unlike previous versions,
existing kernel configuration files do not need to be changed, except
when they reference device drivers that are listed in UPDATING.

Obtained from: //depot/projects/mpsafetty/...
Approved by: philip (ex-mentor)
Discussed: on the lists, at BSDCan, at the DevSummit
Sponsored by: Snow B.V., the Netherlands
dcons(4) fixed by: kan


# 8b07e49a 09-May-2008 Julian Elischer <julian@FreeBSD.org>

Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.

Constraints:
------------

I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.

One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".

One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.

This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.

To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.

The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.

The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.

In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.

One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).

You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.

This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.

Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.

Packets fall into one of a number of classes.

1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..

setfib -3 ping target.example.com # will use fib 3 for ping.

It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.

2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)

3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).

4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.

5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.

6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.

Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)

In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.

In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.

Early testing experience:
-------------------------

Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.

For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.

Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.

ipfw has grown 2 new keywords:

setfib N ip from anay to any
count ip from any to any fib N

In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.

SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.

Where to next:
--------------------

After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.

Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.

My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.

When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.

Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.

This work was sponsored by Ironport Systems/Cisco

Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco


# 7104518b 30-Mar-2008 Konstantin Belousov <kib@FreeBSD.org>

Add the openat(), fexecve() and other *at() syscalls to the table.

Based on the submission by rdivacky,
sponsored by Google Summer of Code 2007
Reviewed by: rwatson, rdivacky
Tested by: pho


# dfdcada3 26-Mar-2008 Doug Rabson <dfr@FreeBSD.org>

Add the new kernel-mode NFS Lock Manager. To use it instead of the
user-mode lock manager, build a kernel with the NFSLOCKD option and
add '-k' to 'rpc_lockd_flags' in rc.conf.

Highlights include:

* Thread-safe kernel RPC client - many threads can use the same RPC
client handle safely with replies being de-multiplexed at the socket
upcall (typically driven directly by the NIC interrupt) and handed
off to whichever thread matches the reply. For UDP sockets, many RPC
clients can share the same socket. This allows the use of a single
privileged UDP port number to talk to an arbitrary number of remote
hosts.

* Single-threaded kernel RPC server. Adding support for multi-threaded
server would be relatively straightforward and would follow
approximately the Solaris KPI. A single thread should be sufficient
for the NLM since it should rarely block in normal operation.

* Kernel mode NLM server supporting cancel requests and granted
callbacks. I've tested the NLM server reasonably extensively - it
passes both my own tests and the NFS Connectathon locking tests
running on Solaris, Mac OS X and Ubuntu Linux.

* Userland NLM client supported. While the NLM server doesn't have
support for the local NFS client's locking needs, it does have to
field async replies and granted callbacks from remote NLMs that the
local client has contacted. We relay these replies to the userland
rpc.lockd over a local domain RPC socket.

* Robust deadlock detection for the local lock manager. In particular
it will detect deadlocks caused by a lock request that covers more
than one blocking request. As required by the NLM protocol, all
deadlock detection happens synchronously - a user is guaranteed that
if a lock request isn't rejected immediately, the lock will
eventually be granted. The old system allowed for a 'deferred
deadlock' condition where a blocked lock request could wake up and
find that some other deadlock-causing lock owner had beaten them to
the lock.

* Since both local and remote locks are managed by the same kernel
locking code, local and remote processes can safely use file locks
for mutual exclusion. Local processes have no fairness advantage
compared to remote processes when contending to lock a region that
has just been unlocked - the local lock manager enforces a strict
first-come first-served model for both local and remote lockers.

Sponsored by: Isilon Systems
PR: 95247 107555 115524 116679
MFC after: 2 weeks


# 7f64829a 25-Mar-2008 Ruslan Ermilov <ru@FreeBSD.org>

Fixed type of the fourth argument of cpuset_{get,set}affinity(2) to be size_t.

Prodded by: davidxu


# 6617724c 12-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

Remove kernel support for M:N threading.

While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential. Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.


# d7f687fc 02-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

Add cpuset, an api for thread to cpu binding and cpu resource grouping
and assignment.
- Add a reference to a struct cpuset in each thread that is inherited from
the thread that created it.
- Release the reference when the thread is destroyed.
- Add prototypes for syscalls and macros for manipulating cpusets in
sys/cpuset.h
- Add syscalls to create, get, and set new numbered cpusets:
cpuset(), cpuset_{get,set}id()
- Add syscalls for getting and setting affinity masks for cpusets or
individual threads: cpuid_{get,set}affinity()
- Add types for the 'level' and 'which' parameters for the cpuset. This
will permit expansion of the api to cover cpu masks for other objects
identifiable with an id_t integer. For example, IRQs and Jails may be
coming soon.
- The root set 0 contains all valid cpus. All thread initially belong to
cpuset 1. This permits migrating all threads off of certain cpus to
reserve them for special applications.

Sponsored by: Nokia
Discussed with: arch, rwatson, brooks, davidxu, deischen
Reviewed by: antoine


# 5f56182b 12-Feb-2008 Ruslan Ermilov <ru@FreeBSD.org>

Change readlink(2)'s return type and type of the last argument
to match POSIX.

Prodded by: Alexey Lyashkov


# 6c902059 20-Jan-2008 Robert Watson <rwatson@FreeBSD.org>

Use audit events AUE_SHMOPEN and AUE_SHMUNLINK with new system calls
shm_open() and shm_unlink(). More auditing will need to be done for
these calls to capture arguments properly.


# 8e38aeff 08-Jan-2008 John Baldwin <jhb@FreeBSD.org>

Add a new file descriptor type for IPC shared memory objects and use it to
implement shm_open(2) and shm_unlink(2) in the kernel:
- Each shared memory file descriptor is associated with a swap-backed vm
object which provides the backing store. Each descriptor starts off with
a size of zero, but the size can be altered via ftruncate(2). The shared
memory file descriptors also support fstat(2). read(2), write(2),
ioctl(2), select(2), poll(2), and kevent(2) are not supported on shared
memory file descriptors.
- shm_open(2) and shm_unlink(2) are now implemented as system calls that
manage shared memory file descriptors. The virtual namespace that maps
pathnames to shared memory file descriptors is implemented as a hash
table where the hash key is generated via the 32-bit Fowler/Noll/Vo hash
of the pathname.
- As an extension, the constant 'SHM_ANON' may be specified in place of the
path argument to shm_open(2). In this case, an unnamed shared memory
file descriptor will be created similar to the IPC_PRIVATE key for
shmget(2). Note that the shared memory object can still be shared among
processes by sharing the file descriptor via fork(2) or sendmsg(2), but
it is unnamed. This effectively serves to implement the getmemfd() idea
bandied about the lists several times over the years.
- The backing store for shared memory file descriptors are garbage
collected when they are not referenced by any open file descriptors or
the shm_open(2) virtual namespace.

Submitted by: dillon, peter (previous versions)
Submitted by: rwatson (I based this on his version)
Reviewed by: alc (suggested converting getmemfd() to shm_open())


# 7188e3c8 19-Oct-2007 Ed Maste <emaste@FreeBSD.org>

Put comments about syscalls by the correct ones, and use the correct syscall
number in the comment.


# 0b1f0611 15-Aug-2007 David Xu <davidxu@FreeBSD.org>

Add thr_kill2 syscall which sends a signal to a thread in another process.

Submitted by: Tijl Coosemans tijl at ulyssis dot org
Approved by: re (kensmith)


# 51504d9a 04-Jul-2007 Peter Wemm <peter@FreeBSD.org>

Create new syscalls for mmap(), lseek(), pread(), pwrite(), truncate() and
ftruncate(), but without the pad arg.

There are several reasons for this. Consider 'mmap()'. On AMD64, the
function call (and syscall) ABI allow for 6 register arguments. Additional
arguments go on the stack. mmap(2) has 6 arguments. However, the syscall
definition has an extra 'int pad' argument. This pushes it to 7 arguments,
which means one must spill into the memory stack. Since the kernel API
doesn't match userland API, we have a hack in libc - libc/sys/mmap.c.
This implements the userland API by calling __syscall() with an extra
argument and the pad argument, for a total of 8 args. This is all
unnecessary and inconvenient for several things, including the kernel's
syscall handler code which now has to handle merging stack arguments with
register arguments. It is a big deal for certain 3rd party code.

I'm adding libc glue to make the transition totally painless. I had
intended to mark the old syscalls as COMPAT6, but the potential to shoot
your feet by building a new kernel without COMPAT_FREEBSD6 but with a
slighly older userland was too great. For now, they have manual
"freebsd6_" prefixes rather than being COMPAT6. They will go back to
being marked 'COMPAT6' after 7-stable starts.

Approved by: re (kensmith)


# f8829a4a 03-Nov-2006 Randall Stewart <rrs@FreeBSD.org>

Ok, here it is, we finally add SCTP to current. Note that this
work is not just mine, but it is also the works of Peter Lei
and Michael Tuexen. They both are my two key other developers
working on the project.. and they need ata-boy's too:
****
peterlei@cisco.com
tuexen@fh-muenster.de
****
I did do a make sysent which updated the
syscall's and sysproto.. I hope that is correct... without
it you don't build since we have new syscalls for SCTP :-0

So go out and look at the NOTES, add
option SCTP (make sure inet and inet6 are present too)
and play with SCTP.

I will see about comitting some test tools I have after I
figure out where I should place them. I also have a
lib (libsctp.a) that adds some of the missing socketapi
functions that I need to put into lib's.. I will talk
to George about this :-)

There may still be some 64 bit issues in here, none of
us have a 64 bit processor to test with yet.. Michael
may have a MAC but thats another beast too..

If you have a mac and want to use SCTP contact Michael
he maintains a web site with a loadable module with
this code :-)

Reviewed by: gnn
Approved by: gnn


# 5f641fc0 16-Oct-2006 David Xu <davidxu@FreeBSD.org>

o Add keyword volatile for user mutex owner field.
o Fix type consistent problem by using type long for old
umtx and wait channel.
o Rename casuptr to casuword.


# 888db9e1 03-Oct-2006 Robert Watson <rwatson@FreeBSD.org>

Audit creat() system call (compat code), and change type for getpagesize(),
which isn't actually being audited anyway.

MFC after: 3 days
Obtained from: TrustedBSD Project


# 73fa3e5b 20-Sep-2006 David Xu <davidxu@FreeBSD.org>

Replace system call thr_getscheduler, thr_setscheduler, thr_setschedparam
with rtprio_thread, while rtprio system call is for process only, the new
system call rtprio_thread is responsible for LWP.


# 6c2d307a 17-Sep-2006 Robert Watson <rwatson@FreeBSD.org>

AUE_SIGALTSTACK instead of AUE_SIGPENDING for sigaltstack().

Obtained from: TrustedBSD Project
MFC after: 3 days


# 7f26ddda 03-Sep-2006 Robert Watson <rwatson@FreeBSD.org>

Assign proper audit event identifiers to a number of system calls not
covered in previous passes:

- sysarch, rtprio
- clock_settime
- preadv/pwritev
- __getcwd
- kqueue
- fhstatfs
- kldunloadf

Obtained from: TrustedBSD Project


# d1967c5d 03-Sep-2006 Robert Watson <rwatson@FreeBSD.org>

Use AUE_NTP_ADJTIME for ntp_adjtime() instead of AUE_ADJTIME.

Obtained from: TrustedBSD Project


# d10183d9 27-Aug-2006 David Xu <davidxu@FreeBSD.org>

This is initial version of POSIX priority mutex support, a new userland
mutex structure is added as following:
struct umutex {
__lwpid_t m_owner;
uint32_t m_flags;
uint32_t m_ceilings[2];
uint32_t m_spare[4];
};
The m_owner represents owner thread, it is a thread id, in non-contested
case, userland can simply use atomic_cmpset_int to lock the mutex, if the
mutex is contested, high order bit will be set, and userland should do locking
and unlocking via kernel syscall. Flag UMUTEX_PRIO_INHERIT represents
pthread's PTHREAD_PRIO_INHERIT mutex, which when contention happens, kernel
should do priority propagating. Flag UMUTEX_PRIO_PROTECT indicates it is
pthread's PTHREAD_PRIO_PROTECT mutex, userland should initialize m_owner
to contested state UMUTEX_CONTESTED, then atomic_cmpset_int will be failure
and kernel syscall should be invoked to do locking, this becauses
for such a mutex, kernel should always boost the thread's priority before
it can lock the mutex, m_ceilings is used by PTHREAD_PRIO_PROTECT mutex,
the first element is used to boost thread's priority when it locked the mutex,
second element is used when the mutex is unlocked, the PTHREAD_PRIO_PROTECT
mutex's link list is kept in userland, the m_ceiling[1] is managed by thread
library so kernel needn't allocate memory to keep the link list, when such
a mutex is unlocked, kernel reset m_owner to UMUTEX_CONTESTED.
Flag USYNC_PROCESS_SHARED indicate if the synchronization object is process
shared, if the flag is not set, it saves a vm_map_lookup() call.

The umtx chain is still used as a sleep queue, when a thread is blocked on
PTHREAD_PRIO_INHERIT mutex, a umtx_pi is allocated to support priority
propagating, it is dynamically allocated and reference count is used,
it is not optimized but works well in my tests, while the umtx chain has
its own locking protocol, the priority propagating protocol are all protected
by sched_lock because priority propagating function is called with sched_lock
held from scheduler.

No visible performance degradation is found which these changes. Some parameter
names in _umtx_op syscall are renamed.


# bad9a7a5 16-Aug-2006 Peter Wemm <peter@FreeBSD.org>

Grab two syscall numbers. One is used to emulate functionality that linux
has in its procfs (do a readlink of /proc/self/fd/<nn> to find the pathname
that corresponds to a given file descriptor). Valgrind-3.x needs this
functionality. This is a placeholder only at this time.


# 589201fd 15-Aug-2006 John Baldwin <jhb@FreeBSD.org>

- Use NOSTD rather than NOIMPL for nfssvc() to match other syscalls
provided via klds.
- Correct audit identifier for nfssvc().


# af5bf122 28-Jul-2006 John Baldwin <jhb@FreeBSD.org>

Now that all system calls are MPSAFE, retire the SYF_MPSAFE flag used to
mark system calls as being MPSAFE:
- Stop conditionally acquiring Giant around system call invocations.
- Remove all of the 'M' prefixes from the master system call files.
- Remove support for the 'M' prefix from the script that generates the
syscall-related files from the master system call files.
- Don't explicitly set SYF_MPSAFE when registering nfssvc.


# e0b4add8 28-Jul-2006 John Baldwin <jhb@FreeBSD.org>

Various fixes to comments in the syscall master files including removing
cruft from the audit import and adding mention of COMPAT4 to freebsd32.


# 60088160 13-Jul-2006 David Xu <davidxu@FreeBSD.org>

Add syscalls thr_setscheduler, thr_getscheduler, and thr_setschedparam,
these syscalls are designed to set thread's scheduling parameters and
policy, because each syscall contains a size parameter, it is possible
to support future scheduling option, e.g SCHED_SPORADIC, this option
needs other fields in structure sched_param, current they are not
avaiblable.


# be5747d5 11-Jul-2006 John Baldwin <jhb@FreeBSD.org>

- Add conditional VFS Giant locking to getdents_common() (linux ABIs),
ibcs2_getdents(), ibcs2_read(), ogetdirentries(), svr4_sys_getdents(),
and svr4_sys_getdents64() similar to that in getdirentries().
- Mark ibcs2_getdents(), ibcs2_read(), linux_getdents(), linux_getdents64(),
linux_readdir(), ogetdirentries(), svr4_sys_getdents(), and
svr4_sys_getdents64() MPSAFE.


# bbe5d031 05-Jul-2006 Wayne Salamon <wsalamon@FreeBSD.org>

Add audit events for the extended attribute system calls.

Obtained from: TrustedBSD Project
Approved by: rwatson (mentor)


# 597d608f 27-Jun-2006 John Baldwin <jhb@FreeBSD.org>

- Expand the scope of Giant some in mount(2) to protect the vfsp structure
from going away. mount(2) is now MPSAFE.
- Expand the scope of Giant some in unmount(2) to protect the mp structure
(or rather, to handle concurrent unmount races) from going away.
umount(2) is now MPSAFE, as well as linux_umount() and linux_oldumount().
- nmount(2) and linux_mount() were already MPSAFE.


# 867c089b 28-Mar-2006 Dag-Erling Smørgrav <des@FreeBSD.org>

Revert previous commit at davidxu's insistance. Instead, use __DECONST
(argh!) and rearrange the prototypes to make it clear that _umtx_op()
is not deprecated.


# b3efbabe 28-Mar-2006 Dag-Erling Smørgrav <des@FreeBSD.org>

The undocumented and deprecated system call _umtx_op() takes two pointer
arguments. The first one is never used (all callers pass in 0); the
second is sometimes used to pass in a struct timespec * which is used as
a timeout and never modified. Constify that argument so callers can pass
a const struct timespec * without jumping through hoops.


# 99eee864 23-Mar-2006 David Xu <davidxu@FreeBSD.org>

Implement aio_fsync() syscall.


# 61d3a4ef 28-Feb-2006 David Xu <davidxu@FreeBSD.org>

Let kernel POSIX timer code and mqueue code to use integer as a resource
handle, the timer_t and mqd_t types will be a pointer which userland
will define it.


# c983324e 05-Feb-2006 Robert Watson <rwatson@FreeBSD.org>

Prefer AUE_FOO audit identifiers to AUE_O_FOO, which are largely left
over from the Darwin implementation.

When we implement a system call as a wrapper to sysctl(), audit it as
AUE_SYSCTL. This leads to greater compatibility with Solaris audit
trails as sysctl() argument tokens are not the same as the ones for
the originaly system calls (i.e., setdomainname()).

Replace references to AUE_ events that are equivilent to AUE_NULL with
AUE_NULL. In the case of process signal configuration, this is
because these events do not require auditing.

Move from the Darwin spelling of getsockopt() to the FreeBSD/Solaris
one.

Audit nmount().

Obtained from: TrustedBSD Project


# 9e7d7224 04-Feb-2006 David Xu <davidxu@FreeBSD.org>

Implement thr_set_name to set a name for thread.

Reviewed by: julian


# 62646c07 03-Feb-2006 Robert Watson <rwatson@FreeBSD.org>

Assign audit event identifiers to many system calls.

Much work by: wsalamon
Obtained from: TrustedBSD Project


# 35d29f50 01-Feb-2006 Robert Watson <rwatson@FreeBSD.org>

Map audit-related system calls to audit event identifiers.

Much work by: wsalamon
Obtained from: TrustedBSD Project


# 1ce91824 21-Jan-2006 David Xu <davidxu@FreeBSD.org>

Make aio code MP safe.


# 5a56b437 23-Dec-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Add abort2() systemcall.


# 94e1294b 26-Nov-2005 David Xu <davidxu@FreeBSD.org>

Don't use OpenBSD syscall numbers, instead, use new syscall numbers
for POSIX message queue.

Suggested by: rwatson


# 655291f2 25-Nov-2005 David Xu <davidxu@FreeBSD.org>

Bring in experimental kernel support for POSIX message queue.


# 0972628a 29-Oct-2005 David Xu <davidxu@FreeBSD.org>

Fix sigevent's POSIX incompatible problem by adding member fields
sigev_notify_function and sigev_notify_attributes. AIO syscalls
use sigevent, so they have to be adjusted.

Reviewed by: alc


# 86857b36 22-Oct-2005 David Xu <davidxu@FreeBSD.org>

Implement POSIX timers. Current only CLOCK_REALTIME and CLOCK_MONOTONIC
clock are supported. I have plan to merge XSI timer ITIMER_REAL and other
two CPU timers into the new code, current three slots are available for
the XSI timers.
The SIGEV_THREAD notification type is not supported yet because our
sigevent struct lacks of two member fields:
sigev_notify_function
sigev_notify_attributes
I have found the sigevent is used in AIO, so I won't add the two members
unless the AIO code is adjusted.


# d60e86c8 18-Oct-2005 Stefan Farfeleder <stefanf@FreeBSD.org>

Const-qualify ksem_timedwait's parameter abstime as it's only passed in.


# 9104847f 13-Oct-2005 David Xu <davidxu@FreeBSD.org>

1. Change prototype of trapsignal and sendsig to use ksiginfo_t *, most
changes in MD code are trivial, before this change, trapsignal and
sendsig use discrete parameters, now they uses member fields of
ksiginfo_t structure. For sendsig, this change allows us to pass
POSIX realtime signal value to user code.

2. Remove cpu_thread_siginfo, it is no longer needed because we now always
generate ksiginfo_t data and feed it to libpthread.

3. Add p_sigqueue to proc structure to hold shared signals which were
blocked by all threads in the proc.

4. Add td_sigqueue to thread structure to hold all signals delivered to
thread.

5. i386 and amd64 now return POSIX standard si_code, other arches will
be fixed.

6. In this sigqueue implementation, pending signal set is kept as before,
an extra siginfo list holds additional siginfo_t data for signals.
kernel code uses psignal() still behavior as before, it won't be failed
even under memory pressure, only exception is when deleting a signal,
we should call sigqueue_delete to remove signal from sigqueue but
not SIGDELSET. Current there is no kernel code will deliver a signal
with additional data, so kernel should be as stable as before,
a ksiginfo can carry more information, for example, allow signal to
be delivered but throw away siginfo data if memory is not enough.
SIGKILL and SIGSTOP have fast path in sigqueue_add, because they can
not be caught or masked.
The sigqueue() syscall allows user code to queue a signal to target
process, if resource is unavailable, EAGAIN will be returned as
specification said.
Just before thread exits, signal queue memory will be freed by
sigqueue_flush.
Current, all signals are allowed to be queued, not only realtime signals.

Earlier patch reviewed by: jhb, deischen
Tested on: i386, amd64


# 7f300b47 27-Sep-2005 Christian S.J. Peron <csjp@FreeBSD.org>

Mark the extended attribute syscalls as being MP safe.

Requested by: jhb


# 4acd2e73 08-Jul-2005 John Baldwin <jhb@FreeBSD.org>

Mark second instance of lchown() MP safe just like the first.

Approved by: re (scottl)


# bcd9e0dd 07-Jul-2005 John Baldwin <jhb@FreeBSD.org>

- Add two new system calls: preadv() and pwritev() which are like readv()
and writev() except that they take an additional offset argument and do
not change the current file position. In SAT speak:
preadv:readv::pread:read and pwritev:writev::pwrite:write.
- Try to reduce code duplication some by merging most of the old
kern_foov() and dofilefoo() functions into new dofilefoo() functions
that are called by kern_foov() and kern_pfoov(). The non-v functions
now all generate a simple uio on the stack from the passed in arguments
and then call kern_foov(). For example, read() now just builds a uio and
calls kern_readv() and pwrite() just builds a uio and calls kern_pwritev().

PR: kern/80362
Submitted by: Marc Olzheim marcolz at stack dot nl (1)
Approved by: re (scottl)
MFC after: 1 week


# f3596e33 30-May-2005 Robert Watson <rwatson@FreeBSD.org>

Introduce a new field in the syscalls.master file format to hold the
audit event identifier associated with each system call, which will
be stored by makesyscalls.sh in the sy_auevent field of struct sysent.
For now, default the audit identifier on all system calls to AUE_NULL,
but in the near future, other BSM event identifiers will be used. The
mapping of system calls to event identifiers is many:one due to
multiple system calls that map to the same end functionality across
compatibility wrappers, ABI wrappers, etc.

Submitted by: wsalamon
Obtained from: TrustedBSD Project


# 45cb0a00 29-May-2005 Robert Watson <rwatson@FreeBSD.org>

Normalize white space in syscalls.master: try to use tabs before system
call types.


# d85bfefd 28-May-2005 Robert Watson <rwatson@FreeBSD.org>

Mark ntp_gettime() as MSTD, since its system call path will acquire
Giant if required.


# d7b9187b 28-May-2005 Robert Watson <rwatson@FreeBSD.org>

Mark the following compatability system calls as MCOMPAT or MCOMPAT4 based
on the their simply wrapping MPSAFE implementations of existing MPSAFE
system calls:

getfsstat()
lseek()
stat()
lstat()
truncate()
ftruncate()
statfs()
fstatfs()

Note that ogetdirentries() is not marked MPSAFE because it does not share
the MPSAFE implementation used for getdirentries(), and requires separate
locking to be implemented.


# 160349ad 28-May-2005 Robert Watson <rwatson@FreeBSD.org>

Mark quotactl() as MSTD.


# ec792a67 28-May-2005 Robert Watson <rwatson@FreeBSD.org>

Mark kenv(2) as MPSAFE, since it appears to be properly locked down.


# 5267dc0b 28-May-2005 Robert Watson <rwatson@FreeBSD.org>

Also mark the COMPAT4 version of fhstatfs() as MPSAFE.


# 2191a5d1 27-May-2005 Robert Watson <rwatson@FreeBSD.org>

Mark fhopen(), fhstat(), and fhstatfs() as MSTD, since they now
acquire Giant themselves.


# c4bd610f 22-Apr-2005 David Xu <davidxu@FreeBSD.org>

Add new syscall thr_new to create thread in atomic, it will
inherit signal mask from parent thread, setup TLS and stack, and
user entry address.
Also support POSIX thread's PTHREAD_SCOPE_PROCESS and PTHREAD_SCOPE_SYSTEM,
sysctl is also provided to control the scheduler scope.


# b2624444 09-Mar-2005 Stefan Farfeleder <stefanf@FreeBSD.org>

Fix typo in comment.


# 96d31285 01-Mar-2005 Paul Saab <ps@FreeBSD.org>

Change the prototype of kevent to remove the const from the changelist.

Reviewed by: jhb


# 810ad5ec 25-Jan-2005 Jeff Roberson <jeff@FreeBSD.org>

- Struct mount is not yet locked well enough to allow
mount/nmount/unmount to run without Giant. Mark them as STD here.


# 29ed48fc 24-Jan-2005 Jeff Roberson <jeff@FreeBSD.org>

- Change all VFS syscalls to MSTD as they all manually deal with giant
or the appropriate filesystem locks.

Sponsored By: Isilon Systems, Inc.


# fe0ef598 02-Jan-2005 Marcel Moolenaar <marcel@FreeBSD.org>

uuidgen(2) is MP safe.


# c180db2b 25-Dec-2004 David Xu <davidxu@FreeBSD.org>

Make _umtx_op() as more general interface, the final parameter needn't be
timespec pointer, every parameter will be interpreted by its opcode.


# 50586e8b 17-Dec-2004 David Xu <davidxu@FreeBSD.org>

1. make umtx sharable between processes, the way is two or more processes
call mmap() to create a shared space, and then initialize umtx on it,
after that, each thread in different processes can use the umtx same
as threads in same process.
2. introduce a new syscall _umtx_op to support timed lock and condition
variable semantics. also, orignal umtx_lock and umtx_unlock inline
functions now are reimplemented by using _umtx_op, the _umtx_op can
use arbitrary id not just a thread id.


# 7fa77ace 24-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Mark mount, unmount and nmount MPSAFE


# 6b270b48 18-Nov-2004 Mark Santcroos <marks@FreeBSD.org>

Add ntp_gettime(2) system call.

Reviewed by: imp, phk, njl, peter
Approved by: njl


# 3e8c2449 23-Oct-2004 Robert Watson <rwatson@FreeBSD.org>

Add system call place-holders for the following system calls
implementing Sun's BSM Audit API on FreeBSD:

audit()
auditon()
getauid()
setauid()
getaudit()
setaudit()
getaudit_addr()
setaudit_addr()
auditctl()

Submitted by: Wayne Salamon <wsalamon at computer dot org>
Obtained from: TrustedBSD Project


# ebfcca3d 06-Oct-2004 David Xu <davidxu@FreeBSD.org>

Regen to unbreak world.

Pointy hat to: mtm


# 1a946b9f 13-Jul-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Add kldunloadf() system call. Stay tuned for follwing commit messages.


# 507b0318 12-Jul-2004 David Xu <davidxu@FreeBSD.org>

Change kse_switchin to accept kse_thr_mailbox pointer, the syscall
will be used heavily in debugging KSE threads. This breaks libpthread
on IA64, but because libpthread was not in 5.2.1 release, I would like
to change it so we needn't to introduce another syscall.


# cd28f17d 01-Jul-2004 Marcel Moolenaar <marcel@FreeBSD.org>

Change the thread ID (thr_id_t) used for 1:1 threading from being a
pointer to the corresponding struct thread to the thread ID (lwpid_t)
assigned to that thread. The primary reason for this change is that
libthr now internally uses the same ID as the debugger and the kernel
when referencing to a kernel thread. This allows us to implement the
support for debugging without additional translations and/or mappings.

To preserve the ABI, the 1:1 threading syscalls, including the umtx
locking API have not been changed to work on a lwpid_t. Instead the
1:1 threading syscalls operate on long and the umtx locking API has
not been changed except for the contested bit. Previously this was
the least significant bit. Now it's the most significant bit. Since
the contested bit should not be tested by userland, this change is
not expected to be visible. Just to be sure, UMTX_CONTESTED has been
removed from <sys/umtx.h>.

Reviewed by: mtm@
ABI preservation tested on: i386, ia64


# 2ed57081 21-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Mark unlink() as MPSAFE as we now acquire Giant in the unlink()
system call.


# 61d87ffd 21-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Mark link() system call as MPSAFE.


# 0b0a60fb 05-Apr-2004 Doug Rabson <dfr@FreeBSD.org>

Add lgetfh(2) which is like getfh(2) but doesn't follow symlinks.


# 1713a516 27-Mar-2004 Mike Makonnen <mtm@FreeBSD.org>

Separate thread synchronization from signals in libthr. Instead
use msleep() and wakeup_one().

Discussed with: jhb, peter, tjr


# 1f325ae3 16-Mar-2004 David Malone <dwmalone@FreeBSD.org>

Get ready to mark open, creat and nosys as MPSAFE.


# 8ac61436 15-Mar-2004 John Baldwin <jhb@FreeBSD.org>

Drop the proc lock around calls to the MD functions ptrace_single_step(),
ptrace_set_pc(), and cpu_ptrace() so that those functions are free to
acquire Giant, sleep, etc. We already do a PHOLD/PRELE around them so
that it is safe to sleep inside of these routines if necessary. This
allows ptrace() to be marked MP safe again as it no longer triggers lock
order reversals on Alpha.

Tested by: wilko


# 37814395 13-Mar-2004 Peter Wemm <peter@FreeBSD.org>

Push Giant down a little further:
- no longer serialize on Giant for thread_single*() and family in fork,
exit and exec
- thread_wait() is mpsafe, assert no Giant
- reduce scope of Giant in exit to not cover thread_wait and just do
vm_waitproc().
- assert that thread_single() family are not called with Giant
- remove the DROP/PICKUP_GIANT macros from thread_single() family
- assert that thread_suspend_check() s not called with Giant
- remove manual drop_giant hack in thread_suspend_check since we know it
isn't held.
- remove the DROP/PICKUP_GIANT macros from thread_suspend_check() family
- mark kse_create() mpsafe


# aae94fbb 02-Feb-2004 Daniel Eischen <deischen@FreeBSD.org>

Add ksem_timedwait() to complement ksem_wait().

Glanced at by: alfred


# 866e3b7e 25-Dec-2003 Alfred Perlstein <alfred@FreeBSD.org>

Put restrict back in, the compilation failure was my fault when I
did a bad merge from the PR.

Thanks to Bruce Evans for explaining.


# 6502da13 24-Dec-2003 Alfred Perlstein <alfred@FreeBSD.org>

We're not ready for restrict qualifiers here.


# 9f144cff 24-Dec-2003 Alfred Perlstein <alfred@FreeBSD.org>

Add restrict qualifiers.

PR: 44394
Submitted by: Craig Rodrigues <rodrige@attbi.com>


# eec525a4 22-Dec-2003 Peter Wemm <peter@FreeBSD.org>

Remove namespc column and attempt to un-fold some of the longer lines
that now fit.


# 5352eb6b 10-Dec-2003 Peter Wemm <peter@FreeBSD.org>

Update file locations for syscall tables to copy to.


# 702b2a17 07-Dec-2003 Marcel Moolenaar <marcel@FreeBSD.org>

Add kse_switchin(2). This syscall can be used by KSE implementations
to have the kernel switch to a new thread, instead of doing it in
userland. It is in fact needed on ia64 where syscall restarts do not
return to userland first. It's completely handled inside the kernel.
As such, any context created by the kernel as part of an upcall and
caused by some syscall needs to be restored by the kernel.


# 5c49a056 13-Nov-2003 Jeff Roberson <jeff@FreeBSD.org>

- Revision 1.156 marked ptrace() SMP safe. Unfortunately, alpha implements
parts of ptrace using proc_rwmem(). proc_rwmem() requires giant, and
giant must be acquired prior to the proc lock, so ptrace must require giant
still.


# fde81c7d 12-Nov-2003 Kirk McKusick <mckusick@FreeBSD.org>

Update the statfs structure with 64-bit fields to allow
accurate reporting of multi-terabyte filesystem sizes.

You should build and boot a new kernel BEFORE doing a `make world'
as the new kernel will know about binaries using the old statfs
structure, but an old kernel will not know about the new system
calls that support the new statfs structure. Running an old kernel
after a `make world' will cause programs such as `df' that do a
statfs system call to fail with a bad system call.

Reviewed by: Bruce Evans <bde@zeta.org.au>
Reviewed by: Tim Robbins <tjr@freebsd.org>
Reviewed by: Julian Elischer <julian@elischer.org>
Reviewed by: the hoards of <arch@freebsd.org>
Sponsored by: DARPA & NAI Labs.


# c055e5d4 07-Nov-2003 John Baldwin <jhb@FreeBSD.org>

Mark ptrace(), ktrace(), utrace(), sysarch(), and issetugid() as MP safe.
The parts of these calls that are not yet MP safe acquire Giant explicitly.


# bd781a1e 21-Oct-2003 Scott Long <scottl@FreeBSD.org>

Don peril-sensitive sunglasses and mark pipe(2) as MPSAFE. I've beaten up
on it for the last 15 hours with no signs of problems. It gives a small
(1%) gain on buildworld since pipe_read/pipe_write are already free of Giant.


# 111b0d0d 20-Oct-2003 David Malone <dwmalone@FreeBSD.org>

Mark dup as MPSAFE. Giant was pushed into dup ages ago, but it looks
like it was missed in syscalls.master.

Spotted by: alc


# ffe5125e 06-Sep-2003 Alan Cox <alc@FreeBSD.org>

msync(2) should be declared MP-safe.


# dd7da9aa 17-Jul-2003 David Xu <davidxu@FreeBSD.org>

o Refine kse_thr_interrupt to allow it to handle different commands.
o Remove TDF_NOSIGPOST.
o Add a member td_waitset to proc structure, it will be used for sigwait.

Tested by: deischen


# 9dde3bc9 28-Jun-2003 David Xu <davidxu@FreeBSD.org>

o Change kse_thr_interrupt to allow send a signal to a specified thread,
or unblock a thread in kernel, and allow UTS to specify whether syscall
should be restarted.
o Add ability for UTS to monitor signal comes in and removed from process,
the flag PS_SIGEVENT is used to indicate the events.
o Add a KMF_WAITSIGEVENT for KSE mailbox flag, UTS call kse_release with
this flag set to wait for above signal event.
o For SA based thread, kernel masks all signal in its signal mask, let
UTS to use kse_thr_interrupt interrupt a thread, and install a signal
frame in userland for the thread.
o Add a tm_syncsig in thread mailbox, when a hardware trap occurs,
it is used to deliver synchronous signal to userland, and upcall
is schedule, so UTS can process the synchronous signal for the thread.

Reviewed by: julian (mentor)


# 9e18f277 03-Jun-2003 Robert Watson <rwatson@FreeBSD.org>

Add system calls to explicitly list extended attributes on a
file/directory/link, rather than using a less explicit hack on
the extattr retrieval API:

extattr_list_fd()
extattr_list_file()
extattr_list_link()

The existing API was counter-intuitive, and poorly documented.
The prototypes for these system calls are identical to
extattr_get_*(), but without a specific attribute name to
leave NULL.

Pointed out by: Dominic Giampaolo <dbg@apple.com>
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# fd7a8150 08-Apr-2003 Mike Barcroft <mike@FreeBSD.org>

o In struct prison, add an allprison linked list of prisons (protected
by allprison_mtx), a unique prison/jail identifier field, two path
fields (pr_path for reporting and pr_root vnode instance) to store
the chroot() point of each jail.
o Add jail_attach(2) to allow a process to bind to an existing jail.
o Add change_root() to perform the chroot operation on a specified
vnode.
o Generalize change_dir() to accept a vnode, and move namei() calls
to callers of change_dir().
o Add a new sysctl (security.jail.list) which is a group of
struct xprison instances that represent a snapshot of active jails.

Reviewed by: rwatson, tjr


# f27bf63b 31-Mar-2003 Jeff Roberson <jeff@FreeBSD.org>

- Mark the various thr syscalls as MP safe. Previously there was a bug if
this was not done since thr_exit() unwinds giant.


# 6eeb9653 31-Mar-2003 Jeff Roberson <jeff@FreeBSD.org>

- Include umtx.h in files generated by makesyscalls.sh
- Add system calls for umtx.


# 8d5377e5 31-Mar-2003 Jeff Roberson <jeff@FreeBSD.org>

- Add the four thr related system calls.


# a447cd8b 31-Mar-2003 Jeff Roberson <jeff@FreeBSD.org>

- Define sigwait, sigtimedwait, and sigwaitinfo in terms of
kern_sigtimedwait() which is capable of supporting all of their semantics.
- These should be POSIX compliant but more careful review is needed before
we announce this.


# eb117d5c 20-Feb-2003 David Xu <davidxu@FreeBSD.org>

Add a timeout parameter to kse_release.


# b17c9cfa 26-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Add const qualifier to data argument for msgsnd.

PR: standards/45274
Submitted by: Craig Rodrigues <rodrigc@attbi.com>


# e1d7d0bb 25-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Bring shm functions closer the the opengroup standards.

PR: 47469
Submitted by: Craig Rodrigues <rodrigc@attbi.com>


# 3beb3270 25-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Bring semop() closer the the opengroup standards.

PR: 47471
Submitted by: Craig Rodrigues <rodrigc@attbi.com>


# cac3fba0 04-Jan-2003 David Xu <davidxu@FreeBSD.org>

Some KSE syscalls are MPSAFE.


# b1f4acd8 29-Dec-2002 Robert Watson <rwatson@FreeBSD.org>

Add definitions for four new system calls:

__acl_get_link() Retrieve an ACL by name without following
symbolic links.
__acl_set_link() Set an ACL by name without following
symbolic links.
__acl_delete_link() Delete an ACL by name without following
symbolic links.
__acl_aclcheck_link() Check an ACL against a file by name without
following symbolic links.

These calls are similar in spirit to lstat(), lchown(), lchmod(), etc,
and will be used under similar circumstances.

Obtained from: TrustedBSD Project


# 92da00bb 15-Dec-2002 Matthew Dillon <dillon@FreeBSD.org>

This is David Schultz's swapoff code which I am finally able to commit.
This should be considered highly experimental for the moment.

Submitted by: David Schultz <dschultz@uclink.Berkeley.EDU>
MFC after: 3 weeks


# 2be05b70 15-Nov-2002 Daniel Eischen <deischen@FreeBSD.org>

Add getcontext, setcontext, and swapcontext as system calls.
Previously these were libc functions but were requested to
be made into system calls for atomicity and to coalesce what
might be two entrances into the kernel (signal mask setting
and floating point trap) into one.

A few style nits and comments from bde are also included.

Tested on alpha by: gallatin


# 21bb9ea2 05-Nov-2002 Robert Watson <rwatson@FreeBSD.org>

Flesh out the definition of __mac_execve(): per earlier discussion,
it's essentially execve() with an optional MAC label argument.

Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 6cedb451 01-Nov-2002 Robert Watson <rwatson@FreeBSD.org>

Rename __execve_mac() to __mac_execve() for increased consistency
with other MAC system calls.

Requested by: various (phk, gordont, jake, ...)


# 23eeeff7 25-Oct-2002 Peter Wemm <peter@FreeBSD.org>

Split 4.x and 5.x signal handling so that we can keep 4.x signal
handling clean and functional as 5.x evolves. This allows some of the
nasty bandaids in the 5.x codepaths to be unwound.

Encapsulate 4.x signal handling under COMPAT_FREEBSD4 (there is an
anti-foot-shooting measure in place, 5.x folks need this for a while) and
finish encapsulating the older stuff under COMPAT_43. Since the ancient
stuff is required on alpha (longjmp(3) passes a 'struct osigcontext *'
to the current sigreturn(2), instead of the 'ucontext_t *' that sigreturn
is supposed to take), add a compile time check to prevent foot shooting
there too. Add uniform COMPAT_43 stubs for ia64/sparc64/powerpc.

Tested on: i386, alpha, ia64. Compiled on sparc64 (a few days ago).
Approved by: re


# aad1cdc8 22-Oct-2002 Robert Watson <rwatson@FreeBSD.org>

Flesh out prototypes for __mac_get_pid, __mac_get_link, and
__mac_set_link, based on __mac_get_proc() except with a pid,
and __mac_get_file(), __mac_set_file() except that they do
not follow symlinks. First in a series of commits to flesh
out the user API.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 8556393b 19-Oct-2002 Peter Wemm <peter@FreeBSD.org>

Stake a claim on 418 (__xstat), 419 (__xfstat), 420 (__xlstat)


# c8447553 19-Oct-2002 Peter Wemm <peter@FreeBSD.org>

Grab 416/417 real estate before I get burned while testing again.
This is for the not-quite-ready signal/fpu abi stuff. It may not see
the light of day, but I'm certainly not going to be able to validate it
when getting shot in the foot due to syscall number conflicts.


# bc5245d9 19-Oct-2002 Robert Watson <rwatson@FreeBSD.org>

Add a placeholder for the execve_mac() system call, similar to SELinux's
execve_secure() system call, which permits a process to pass in a label
for a label change during exec. This permits SELinux to change the
label for the resulting exec without a race following a manual label
change on the process. Because this interface uses our general purpose
MAC label abstraction, we call it execve_mac(), and wrap our port of
SELinux's execve_secure() around it with appropriate sid mappings.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 803cc8aa 14-Oct-2002 Peter Wemm <peter@FreeBSD.org>

Restore pointer that was removed in 1.128. This wasn't a merge-o.


# 3c4aba09 09-Oct-2002 Robert Watson <rwatson@FreeBSD.org>

Fix what looks like a merge-o from a conflict in the last commit to
syscalls.master.


# 0d66d36f 09-Oct-2002 Peter Wemm <peter@FreeBSD.org>

Add a pointer to the alternate syscall tables on 64 bit platforms.


# 8b10835c 09-Oct-2002 Robert Watson <rwatson@FreeBSD.org>

Flesh out the extattr_{delete,get,set}_link() system calls: variations
on the _file() theme that do not follow symlinks. Sync to MAC tree.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 36a8dac1 02-Oct-2002 Archie Cobbs <archie@FreeBSD.org>

Let kse_wakeup() take a KSE mailbox pointer argument.

Reviewed by: julian


# 4499985e 30-Sep-2002 Robert Watson <rwatson@FreeBSD.org>

Reserve system call numbers for the following system calls:

__mac_get_pid Retrieve MAC label of a process by pid

Similar to __mac_get_proc() except that the target process of
the operation is explicitly specified rather than assuming
curthread.

__mac_get_link Retrieve MAC label of a path with NOFOLLOW
__mac_set_link Set MAC label of a path with NOFOLLOW
extattr_set_link Set EAs on a path with NOFOLLOW
extattr_get_link Retrieve EAs on a path with NOFOLLOW
extattr_delete_link Delete EAs on a path with NOFOLLOW

These calls are similar to __mac_get_file(), __mac_set_file(),
extattr_set_file(), extattr_get_file(), and extattr_delete_file(),
except that they do not follow symlinks. The distinction between
these calls is similar to lchown() vs chown().

Implementations to follow.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 89def71c 25-Sep-2002 Archie Cobbs <archie@FreeBSD.org>

Make the following name changes to KSE related functions, etc., to better
represent their purpose and minimize namespace conflicts:

kse_fn_t -> kse_func_t
struct thread_mailbox -> struct kse_thr_mailbox
thread_interrupt() -> kse_thr_interrupt()
kse_yield() -> kse_release()
kse_new() -> kse_create()

Add missing declaration of kse_thr_interrupt() to <sys/kse.h>.
Regenerate the various generated syscall files. Minor style fixes.

Reviewed by: julian


# 6d5dec35 18-Sep-2002 Alfred Perlstein <alfred@FreeBSD.org>

Add the rest of the kernel support for the sem_ API in kern/uipc_sem.c.

Option 'P1003_1B_SEMAPHORES' to compile them in, or load the "sem" module
to activate them.

Have kern/makesyscalls.sh emit an include for sys/_semaphore.h into sysproto.h
to pull in the typedef for semid_t.

Add the syscalls to the syscall table as module stubs.


# f61b8549 19-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

mac_syscall is now implemented, switch to MSTD.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 280f0785 06-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Rename mac_policy() to mac_syscall() to be more reflective of its
purpose.

Submitted by: cvance@tislabs.com
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 55fb7830 30-Jul-2002 Robert Watson <rwatson@FreeBSD.org>

Introduce support for Mandatory Access Control and extensible
kernel access control.

Replace 'void *' with 'struct mac *' now that mac.h is in the base
tree. The current POSIX.1e-derived userland MAC interface is
schedule for replacement, but will act as a functional placeholder
until the replacement is done. These system calls allow userland
processes to get and set labels on both the current process, as well
as file system objects and file descriptor backed objects.


# aedbd622 30-Jul-2002 Robert Watson <rwatson@FreeBSD.org>

Introduce a mac_policy() system call that will provide MAC policies
with a general purpose front end entry point for user applications
to invoke. The MAC framework will route the system call to the
appropriate policy by name.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 5d37d00a 29-Jul-2002 Robert Watson <rwatson@FreeBSD.org>

Prototype function arguments, only with MAC-specific structures
replaced with void until we bring in the actual structure definitions.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 8a32e0c9 13-Jul-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove incorrect comment about now corrected manpage.


# 9c341296 12-Jul-2002 Alfred Perlstein <alfred@FreeBSD.org>

Create a bug-for-bug FreeBSD4 compatible version of sendfile and move the
fixed sendfile over. This is needed to preserve binary compatibility from
4.x to 5.x.


# e602ba25 29-Jun-2002 Julian Elischer <julian@FreeBSD.org>

Part 1 of KSE-III

The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)

Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)

NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..


# 65772a1a 13-Jun-2002 Robert Watson <rwatson@FreeBSD.org>

Keep POSIX.1e capabilities system call placeholders, but remove definitions.


# 494eefd8 27-May-2002 Marcel Moolenaar <marcel@FreeBSD.org>

Add syscall uuidgen() for generating Univerally Unique Identifiers
(UUIDs). On ia64 UUIDs, aka GUIDs, are used by EFI and the firmware
among others. To create GUID Partition Tables (GPTs), we need to
be able to generate UUIDs.


# 8d9b781f 05-May-2002 Maxime Henrion <mux@FreeBSD.org>

Add an entry for the lchflags(2) syscall. It's useful to prevent
a symlink deletion.

Reviewed by: rwatson


# fd448168 17-Apr-2002 Maxime Henrion <mux@FreeBSD.org>

Add an entry for the kenv(2) syscall (code to follow).

Reviewed by: peter


# b0d97980 13-Apr-2002 Alan Cox <alc@FreeBSD.org>

Remove the requirement that Giant be held around sigreturn().


# a0805f6f 11-Apr-2002 Alan Cox <alc@FreeBSD.org>

Remove the requirement that Giant be held around osigreturn(). All platform-
specific implementations are MPSAFE.


# 11ffd032 05-Mar-2002 Robert Watson <rwatson@FreeBSD.org>

Reserve system call numbers for the MAC framework. This will prevent
people working on the MAC tree from getting toasted whenever system call
numbers are allocated in the main tree (for example, for KSE :-).
Calls allocated: __mac_{get,set}_proc, __mac_{get,set}_{fd,file}().

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# c28841c1 18-Feb-2002 Julian Elischer <julian@FreeBSD.org>

Add stub syscalls and definitions for KSE calls.
"Book'em Danno"


# 8a2c87e7 18-Feb-2002 Julian Elischer <julian@FreeBSD.org>

Add 5 KSE syscalls. Two will be implemented with the next KSE
step and the others are reservations for coming code.
All will be stubbed in this kernel in the next commit.
This will allow people to easily make KSE binaries for userland testing
(the syscalls will be in libc) but they will still need a real KSE kernel
to test it. (libc looks in /sys to decide what it should add stubs for).


# bc874287 17-Feb-2002 Daniel Eischen <deischen@FreeBSD.org>

Fix prototype to sigreturn to use struct __ucontext instead of ucontext_t.


# 74237f55 09-Feb-2002 Robert Watson <rwatson@FreeBSD.org>

Part I: Update extended attribute API and ABI:

o Modify the system call syntax for extattr_{get,set}_{fd,file}() so
as not to use the scatter gather API (which appeared not to be used
by any consumers, and be less portable), rather, accepts 'data'
and 'nbytes' in the style of other simple read/write interfaces.
This changes the API and ABI.

o Modify system call semantics so that extattr_get_{fd,file}() return
a size_t. When performing a read, the number of bytes read will
be returned, unless the data pointer is NULL, in which case the
number of bytes of data are returned. This changes the API only.

o Modify the VOP_GETEXTATTR() vnode operation to accept a *size_t
argument so as to return the size, if desirable. If set to NULL,
the size will not be returned.

o Update various filesystems (pseodofs, ufs) to DTRT.

These changes should make extended attributes more useful and more
portable. More commits to rebuild the system call files, as well
as update userland utilities to follow.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 860965f1 01-Feb-2002 Bruce Evans <bde@FreeBSD.org>

Made osigreturn(2) standard so that SYS_osigreturn can be used in the
signal trampoline for old signals. The arches that support old signals
currently abuse sigreturn(2) instead. This mainly complicates things
and slightly breaks the the new sigreturn(2).

COMPAT is too limited to support the correct configuration of osigreturn,
and this commit doesn't attempt to fix it; it just moves the bogusness:
osigreturn() must now be provided unconditionally even on arches that
don't really need it; previously it had to be provided under the bogus
condition defined(COMPAT_43).


# 21d56e9c 29-Dec-2001 Alfred Perlstein <alfred@FreeBSD.org>

Make AIO a loadable module.

Remove the explicit call to aio_proc_rundown() from exit1(), instead AIO
will use at_exit(9).

Add functions at_exec(9), rm_at_exec(9) which function nearly the
same as at_exec(9) and rm_at_exec(9), these functions are called
on behalf of modules at the time of execve(2) after the image
activator has run.

Use a modified version of tegge's suggestion via at_exec(9) to close
an exploitable race in AIO.

Fix SYSCALL_MODULE_HELPER such that it's archetecuterally neutral,
the problem was that one had to pass it a paramater indicating the
number of arguments which were actually the number of "int". Fix
it by using an inline version of the AS macro against the syscall
arguments. (AS should be available globally but we'll get to that
later.)

Add a primative system for dynamically adding kqueue ops, it's really
not as sophisticated as it should be, but I'll discuss with jlemon when
he's around.


# c60693db 02-Nov-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Reserve 378 for the new mount syscall Maxime Henrion <mux@qualys.com>
is working on. (This is to get us more than 32 mountoptions).


# b55abfd9 13-Oct-2001 Robert Watson <rwatson@FreeBSD.org>

o Reserve system call 377 for afs_syscall; by reserving a system call
number, portable OpenAFS applications don't have to attempt to determine
what system call number was dynamically allocated. No system call
prototype or implementation is defined.

Requested by: Tom Maher <tardis@watson.org>


# 9c94f773 21-Sep-2001 Robert Watson <rwatson@FreeBSD.org>

o Introduce eaccess(2), a version of access(2) that uses the effective
credentials rather than the real credentials. This is useful for
implementing GUI's which need to modify icons based on access rights,
but where use of open(2) is too expensive, use of stat(2) doesn't
reflect the file system's real protection model, and use of
access() suffers from real/effective credential confusion. This
implementation provides the same semantics as the call of the same
name on SCO OpenServer. Note: using this call improperly can
leave you subject to some of the same races present in the
access(2) call.
o To implement this, break out the basic logic of access(2) into
vpaccess(), which accepts a passed credential to perform the
invocation of VOP_ACCESS(). Add eaccess(2) to invoke vpaccess(),
and modify access(2) to use vpaccess().

Obtained from: TrustedBSD Project


# eb25edbd 18-Sep-2001 Peter Wemm <peter@FreeBSD.org>

Cleanup and split of nfs client and server code.
This builds on the top of several repo-copies.


# 257d1988 01-Sep-2001 Matthew Dillon <dillon@FreeBSD.org>

Synchronize syscalls.master(s) with recent Giant pushdown work


# 918c3b13 31-Aug-2001 Matthew Dillon <dillon@FreeBSD.org>

Make yield() MPSAFE.

Synchronize syscalls.master with all MPSAFE changes to date. Synchronize
new syscall generation follows because yield() will panic if it is out
of sync with syscalls.master.


# df998760 30-Aug-2001 Matthew Dillon <dillon@FreeBSD.org>

Giant pushdown syscalls in kern/uipc_syscalls.c. Affected calls:

recvmsg(), sendmsg(), recvfrom(), accept(), getpeername(), getsockname(),
socket(), connect(), accept(), send(), recv(), bind(), setsockopt(), listen(),
sendto(), shutdown(), socketpair(), sendfile()


# b6a4b4f9 30-Aug-2001 Matthew Dillon <dillon@FreeBSD.org>

Giant Pushdown: sysv shm, sem, and msg calls.


# 356861db 30-Aug-2001 Matthew Dillon <dillon@FreeBSD.org>

Remove the MPSAFE keyword from the parser for syscalls.master.
Instead introduce the [M] prefix to existing keywords. e.g.
MSTD is the MP SAFE version of STD. This is prepatory for a
massive Giant lock pushdown. The old MPSAFE keyword made
syscalls.master too messy.

Begin comments MP-Safe procedures with the comment:
/*
* MPSAFE
*/
This comments means that the procedure may be called without
Giant held (The procedure itself may still need to obtain
Giant temporarily to do its thing).

sv_prepsyscall() is now MP SAFE and assumed to be MP SAFE
sv_transtrap() is now MP SAFE and assumed to be MP SAFE

ktrsyscall() and ktrsysret() are now MP SAFE (Giant Pushdown)
trapsignal() is now MP SAFE (Giant Pushdown)

Places which used to do the if (mtx_owned(&Giant)) mtx_unlock(&Giant)
test in syscall[2]() in */*/trap.c now do not. Instead they
explicitly unlock Giant if they previously obtained it, and then
assert that it is no longer held to catch broken system calls.

Rebuild syscall tables.


# b6343691 29-May-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Remove a comment which was past its shelf life.

PR: 18750
Submitted by: Tony Finch <dot@dotat.at>


# 23955314 18-May-2001 Alfred Perlstein <alfred@FreeBSD.org>

Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb


# b4b469e6 11-May-2001 Tor Egge <tegge@FreeBSD.org>

gettimeofday() is MP safe on both -current and -stable.


# 130d0157 11-Apr-2001 Robert Watson <rwatson@FreeBSD.org>

o Introduce a new system call, __setsugid(), which allows a process to
toggle the P_SUGID bit explicitly, rather than relying on it being
set implicitly by other protection and credential logic. This feature
is introduced to support inter-process authorization regression testing
by simplifying userland credential management allowing the easy
isolation and reproduction of authorization events with specific
security contexts. This feature is enabled only by "options REGRESSION"
and is not intended to be used by applications. While the feature is
not known to introduce security vulnerabilities, it does allow
processes to enter previously inaccessible parts of the credential
state machine, and is therefore disabled by default. It may not
constitute a risk, and therefore in the future pending further analysis
(and appropriate need) may become a published interface.

Obtained from: TrustedBSD Project


# fec605c8 31-Mar-2001 Robert Watson <rwatson@FreeBSD.org>

o Introduce extattr_{delete,get,set}_fd() to allow extended attribute
operations on file descriptors, which complement the existing set of
calls, extattr_{delete,get,set}_file() which act on paths. In doing
so, restructure the system call implementation such that the two sets
of functions share most of the relevant code, rather than duplicating
it. This pushes the vnode locking into the shared code, but keeps
the copying in of some arguments in the system call code. Allowing
access via file descriptors reduces the opportunity for race
conditions when managing extended attributes.

Obtained from: TrustedBSD Project


# 30632071 18-Mar-2001 Robert Watson <rwatson@FreeBSD.org>

o Rename "namespace" argument to "attrnamespace" as namespace is a C++
reserved word.

Submitted by: jkh
Obtained from: TrustedBSD Project


# 70f36851 14-Mar-2001 Robert Watson <rwatson@FreeBSD.org>

o Change the API and ABI of the Extended Attribute kernel interfaces to
introduce a new argument, "namespace", rather than relying on a first-
character namespace indicator. This is in line with more recent
thinking on EA interfaces on various mailing lists, including the
posix1e, Linux acl-devel, and trustedbsd-discuss forums. Two namespaces
are defined by default, EXTATTR_NAMESPACE_SYSTEM and
EXTATTR_NAMESPACE_USER, where the primary distinction lies in the
access control model: user EAs are accessible based on the normal
MAC and DAC file/directory protections, and system attributes are
limited to kernel-originated or appropriately privileged userland
requests.

o These API changes occur at several levels: the namespace argument is
introduced in the extattr_{get,set}_file() system call interfaces,
at the vnode operation level in the vop_{get,set}extattr() interfaces,
and in the UFS extended attribute implementation. Changes are also
introduced in the VFS extattrctl() interface (system call, VFS,
and UFS implementation), where the arguments are modified to include
a namespace field, as well as modified to advoid direct access to
userspace variables from below the VFS layer (in the style of recent
changes to mount by adrian@FreeBSD.org). This required some cleanup
and bug fixing regarding VFS locks and the VFS interface, as a vnode
pointer may now be optionally submitted to the VFS_EXTATTRCTL()
call. Updated documentation for the VFS interface will be committed
shortly.

o In the near future, the auto-starting feature will be updated to
search two sub-directories to the ".attribute" directory in appropriate
file systems: "user" and "system" to locate attributes intended for
those namespaces, as the single filename is no longer sufficient
to indicate what namespace the attribute is intended for. Until this
is committed, all attributes auto-started by UFS will be placed in
the EXTATTR_NAMESPACE_SYSTEM namespace.

o The default POSIX.1e attribute names for ACLs and Capabilities have
been updated to no longer include the '$' in their filename. As such,
if you're using these features, you'll need to rename the attribute
backing files to the same names without '$' symbols in front.

o Note that these changes will require changes in userland, which will
be committed shortly. These include modifications to the extended
attribute utilities, as well as to libutil for new namespace
string conversion routines. Once the matching userland changes are
committed, a buildworld is recommended to update all the necessary
include files and verify that the kernel and userland environments
are in sync. Note: If you do not use extended attributes (most people
won't), upgrading is not imperative although since the system call
API has changed, the new userland extended attribute code will no longer
compile with old include files.

o Couple of minor cleanups while I'm there: make more code compilation
conditional on FFS_EXTATTR, which should recover a bit of space on
kernels running without EA's, as well as update copyright dates.

Obtained from: TrustedBSD Project


# 86360fee 01-Dec-2000 Jake Burkholder <jake@FreeBSD.org>

Remove thr_sleep and thr_wakeup. Remove fields p_nthread and p_wakeup
from struct proc, which are now unused (p_nthread already was).
Remove process flag P_KTHREADP which was untested and only set
in vfs_aio.c (it should use kthread_create). Move the yield
system call to kern_synch.c as kern_threads.c has been removed
completely.

moral support from: alfred, jhb


# 78525ce3 01-Dec-2000 Alfred Perlstein <alfred@FreeBSD.org>

sysvipc loadable.

new syscall entry lkmressys - "reserved loadable syscall"

Make syscall_register allow overwriting of such entries (lkmressys).


# ae51d56c 28-Aug-2000 Marcel Moolenaar <marcel@FreeBSD.org>

Fix prototypes for {o|}{g|s}etrlimit. A recent change in the
Linuxulator caused this bug to trigger.


# 4e0f152b 29-Jul-2000 Peter Wemm <peter@FreeBSD.org>

Sigh. Fix SYS_exit problems. I misunderstood the significance of these
trailing options.


# ac2b067b 28-Jul-2000 Peter Wemm <peter@FreeBSD.org>

Change the 'exit()' system call to 'sys_exit()'. This avoids overlapping
gcc's internal exit() prototypes and the (futile) hackery that we did to
try and avoid warnings. main() was renamed for similar reasons.
Remove an exit related hack from makesyscalls.sh.


# a8e65b91 18-Jul-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Simplify kqueue API slightly.

Discussed on: -arch


# 92eebb8a 13-Jul-2000 Robert Watson <rwatson@FreeBSD.org>

o Introduce syscall prototypes, stubs for __cap_{get,set}_{fd,file},
syscalls to manage capability sets on files. First of two commits.

Obtained from: TrustedBSD Project


# b09b66ab 15-Jun-2000 Robert Watson <rwatson@FreeBSD.org>

Introduce syscalls for process capability manipulation. Currently backs
onto already committed stubs. Commit one of two.

Reviewed by: Damned if I can remember. Many people.
Obtained from: TrustedBSD Project


# aa4b7eae 09-May-2000 Bruce Evans <bde@FreeBSD.org>

Fixed the declaration of mmap(). The crufty padding arg had the wrong
type. This gave an inconsistent amount of crufty padding on i386's with
64-bit longs (8 bytes instead of 4). On alphas it gives a consistent
amount of crufty padding (8 bytes) in addition to the 4 bytes of normal
padding caused by passing int args as register_t's.

Fixed the args struct tag for the NOPROTO syscalls (netbsd_lchown() and
netbsd_msync()). The tag is currently unused for NOPROTO syscalls, so
the bug has no effect, but it will be used even in the NOPROTO case to
calculate sy_nargs correctly.


# 39e4c0c8 01-May-2000 Peter Wemm <peter@FreeBSD.org>

Remove undocumented broken-as-designed semconfig() syscall.


# cb679c38 16-Apr-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Introduce kqueue() and kevent(), a kernel event notification facility.


# c01df631 03-Apr-2000 Alfred Perlstein <alfred@FreeBSD.org>

Make makesyscalls.sh parse an optional field 'MPSAFE' that specifies
that a syscall does not want the BGL to be grabbed automatically.

Add the new MPSAFE flag to the syscalls that dillon has determined to
be MPSAFE.


# 5134b3e9 18-Jan-2000 Robert Watson <rwatson@FreeBSD.org>

Fix bde'isms in acl/extattr syscall interface, renaming syscalls to
prettier (?) names, adding some const's around here, et al.

Commit 1 out of 3.

Reviewed by: bde


# 8ccd6334 16-Jan-2000 Peter Wemm <peter@FreeBSD.org>

Implement setres[ug]id() and getres[ug]id(). This has been sitting in
my tree for ages (~2 years) waiting for an excuse to commit it. Now Linux
has implemented it and it seems that Staroffice (when using the
linux_base6.1 port's libc) calls this in the linux emulator and dies in
setup. The Linux emulator can call these now.


# bfbbc4aa 13-Jan-2000 Jason Evans <jasone@FreeBSD.org>

Add aio_waitcomplete(). Make aio work correctly for socket descriptors.
Make gratuitous style(9) fixes (me, not the submitter) to make the aio
code more readable.

PR: kern/12053
Submitted by: Chris Sedore <cmsedore@maxwell.syr.edu>


# 20883b0f 21-Dec-1999 Alfred Perlstein <alfred@FreeBSD.org>

make getfh a standard syscall instead of dependant on having
NFSSERVER defined, useful for userland fileservers that want to
use a filehandle type interface to the filesystem.

Submitted by: Assar Westerlund assar@stacken.kth.se
PR: kern/15452


# ef351daa 18-Dec-1999 Robert Watson <rwatson@FreeBSD.org>

First pass commit to introduce new ACL and Extended Attribute system calls.
The second pass commit with all the supporting code will happen shortly
afterwards.

Reviewed by: eivind


# b08210f5 17-Nov-1999 Brian Somers <brian@FreeBSD.org>

modfind(char *) -> modfind(const char *)

Reminded by: dfr


# b7d85123 12-Oct-1999 Marcel Moolenaar <marcel@FreeBSD.org>

Now that userland including modules don't use the osig* syscalls,
make them of type COMPAT.


# da3605db 29-Sep-1999 Marcel Moolenaar <marcel@FreeBSD.org>

sigset_t change (part 1 of 5)
-----------------------------

Rename sigaction, sigprocmask, sigpending and sigsuspend to
osigaction, osigprocmask, osigpending and osigsuspend (resp)
and add new syscalls for them to support the new sisgset_t
without breaking existing binaries.

Change the prototype of sigaltstack to use the typedef stack_t
instead of struct sigaltstack to reflect that it is SUSv2
compliant.

Also, rename sigreturn to osigreturn and add a new syscall
to support the modified stackframe. The change is caused by
sigreturn operating on ucontext_t now and the fact that
siginfo_t has been updated to conform to SUSv2.


# c24fda81 10-Sep-1999 Alfred Perlstein <alfred@FreeBSD.org>

Seperate the export check in VFS_FHTOVP, exports are now checked via
VFS_CHECKEXP.

Add fh(open|stat|stafs) syscalls to allow userland to query filesystems
based on (network) filehandle.

Obtained from: NetBSD


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# 23955079 11-Aug-1999 Nik Clayton <nik@FreeBSD.org>

Add CPT_NOA, LIBCOMPAT, NODEF, NOARGS, NOPROTO, and NOIMPL to the commented
list of available types.

PR: docs/13007
Submitted by: Assar Westerlund <assar@sics.se>


# 45f26d41 05-Aug-1999 Jordan K. Hubbard <jkh@FreeBSD.org>

Move syscall 180 back to where it was before and fix the
incorrect comment which led me to move it in the first place.


# b24eb279 04-Aug-1999 Jordan K. Hubbard <jkh@FreeBSD.org>

Reserve a syscall for the arla folks. I'm assuming that since syscalls.c
and init_sysent.c are checked into CVS, I should also commit the regenerated
copies even though they're built by syscalls.master. Correct? Bruce? :)


# f664346f 13-May-1999 Bruce Evans <bde@FreeBSD.org>

Fixed nonsense arg type `const caddr_t' in the prototype() for utrace().
Changed to `const void *'. utrace() is undocumented, so nothing should
notice.

Fixed missing consts for utrace() and ktrace() in syscalls.master.

sys/ktrace.h is missing some Lite2 changes of shorts to ints.


# 02daf150 28-Apr-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Add the jail system call.


# 8fe387ab 04-Apr-1999 Dmitrij Tejblum <dt@FreeBSD.org>

Add standard padding argument to pread and pwrite syscall. That should make them
NetBSD compatible.

Add parameter to fo_read and fo_write. (The only flag FOF_OFFSET mean that
the offset is set in the struct uio).

Factor out some common code from read/pread/write/pwrite syscalls.


# 4160ccd9 27-Mar-1999 Alan Cox <alc@FreeBSD.org>

Added pread and pwrite. These functions are defined by the X/Open
Threads Extension. (Note: We use the same syscall numbers as NetBSD.)

Submitted by: John Plevyak <jplevyak@inktomi.com>


# 325e13dd 10-Nov-1998 Peter Wemm <peter@FreeBSD.org>

A kldsym(2) syscall prototype for extracting information from the in-kernel
linker. This is intended to replace kvm_mkdb etc. The first version
only does name->value lookups, but it's open ended. value->name lookups
would probably be a good thing to do too.

It's been suggested to try and connect the symbol tables to sysctl (which
is probably a more flexible way of doing it if it's done right), but that
is far more complex and difficult than I was ready to have a shot at.


# dd0b2081 05-Nov-1998 David Greenman <dg@FreeBSD.org>

Implemented zero-copy TCP/IP extensions via sendfile(2) - send a
file to a stream socket. sendfile(2) is similar to implementations in
HP-UX, Linux, and other systems, but the API is more extensive and
addresses many of the complaints that the Apache Group and others have
had with those other implementations. Thanks to Marc Slemko of the
Apache Group for helping me work out the best API for this.
Anyway, this has the "net" result of speeding up sends of files over
TCP/IP sockets by about 10X (that is to say, uses 1/10th of the CPU
cycles) when compared to a traditional read/write loop.


# 2e83b281 24-Aug-1998 Doug Rabson <dfr@FreeBSD.org>

Fix a few syscall arguments to use size_t instead of u_int.


# ecbb00a2 07-Jun-1998 Doug Rabson <dfr@FreeBSD.org>

This commit fixes various 64bit portability problems required for
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.

The prototype FreeBSD/alpha machdep will follow in a couple of days
time.


# 786cf38a 14-May-1998 Peter Wemm <peter@FreeBSD.org>

deep-six signanosleep(). It sounded like a good idea at the time.


# 1f562172 10-May-1998 John Dyson <dyson@FreeBSD.org>

Fix the futimes/undelete/utrace conflict with other BSD's. Note that
the only common usage of utrace (the possible problem with this
commit) is with malloc, so this should be a real problem. Add
the various NetBSD syscalls that allow full emulation of their
development environment.


# 8a6472b7 28-Mar-1998 Peter Dufault <dufault@FreeBSD.org>

Finish _POSIX_PRIORITY_SCHEDULING. Needs P1003_1B and
_KPOSIX_PRIORITY_SCHEDULING options to work. Changes:

Change all "posix4" to "p1003_1b". Misnamed files are left
as "posix4" until I'm told if I can simply delete them and add
new ones;

Add _POSIX_PRIORITY_SCHEDULING system calls for FreeBSD and Linux;

Add man pages for _POSIX_PRIORITY_SCHEDULING system calls;

Add options to LINT;

Minor fixes to P1003_1B code during testing.


# 14f1d426 03-Feb-1998 Bruce Evans <bde@FreeBSD.org>

Fixed type of mincore().


# c5b193bf 30-Jan-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Retire LFS.

If you want to play with it, you can find the final version of the
code in the repository the tag LFS_RETIREMENT.

If somebody makes LFS work again, adding it back is certainly
desireable, but as it is now nobody seems to care much about it,
and it has suffered considerable bitrot since its somewhat haphazard
integration.

R.I.P


# 7b778b5e 23-Jan-1998 Eivind Eklund <eivind@FreeBSD.org>

Make all file-system (MFS, FFS, NFS, LFS, DEVFS) related option new-style.

This introduce an xxxFS_BOOT for each of the rootable filesystems.
(Presently not required, but encouraged to allow a smooth move of option *FS
to opt_dontuse.h later.)

LFS is temporarily disabled, and will be re-enabled tomorrow.


# de17eb59 01-Jan-1998 Alexander Langer <alex@FreeBSD.org>

Added missing caddr_t --> void * conversions for sys/mman.h functions.

Submitted by: bde


# e6e21bc0 26-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Add "NOIMPL" for syscalls we know what is, but don't implement as "STD".
Use this for getfh & nfssvc.


# 7822f1c6 14-Sep-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Add a __getcwd() syscall. This is intentionally undocumented, but all
it does is to try to figure the pwd out from the vfs namecache, and
return a reversed string to it. libc:getcwd() is responsible for
flipping it back.


# 8cb0553a 13-Sep-1997 Peter Wemm <peter@FreeBSD.org>

Activate poll(2) syscall


# 6871cc62 18-Aug-1997 Peter Wemm <peter@FreeBSD.org>

SVR4/XPG-style getpgid()/getsid() syscalls.


# 2c1011f7 15-Jun-1997 John Dyson <dyson@FreeBSD.org>

Modifications to existing files to support the initial AIO/LIO and
kernel based threading support.


# 99f06d5c 01-Jun-1997 Peter Wemm <peter@FreeBSD.org>

New syscall, signanosleep(), which is a hybrid of sigsuspend(2) and
nanosleep(2). It sleeps until either the time expires, or a signal
permitted by the supplied mask arrives (eg: SIGALRM if appropriate)


# 851679e5 08-May-1997 Peter Wemm <peter@FreeBSD.org>

oops. NODIDE -> NOHIDE


# b6f031b7 08-May-1997 Peter Wemm <peter@FreeBSD.org>

Define entries for the posix-style clock/timer syscalls including
nanosleep(). Also, note some syscall conflicts with other systems and
indicate slots tagged for use with other syscalls some day.


# cea6c86c 07-May-1997 Doug Rabson <dfr@FreeBSD.org>

This is the kernel linker. To use it, you will first need to apply
the patches in freefall:/home/dfr/ld.diffs to your ld sources and set
BINFORMAT to aoutkld when linking the kernel.

Library changes and userland utilities will appear in a later commit.


# 56f12a6c 31-Mar-1997 Peter Wemm <peter@FreeBSD.org>

issetugid is now implemented rather than reserved


# 4eb542c6 30-Mar-1997 Peter Wemm <peter@FreeBSD.org>

Reserve 252 (poll, first in OpenBSD)
Reserve 253 (issetugid, as in OpenBSD)
Allocate 254 for lchown(2)


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 996c772f 09-Feb-1997 John Dyson <dyson@FreeBSD.org>

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# ac0ad63f 16-Jan-1997 Bruce Evans <bde@FreeBSD.org>

Reduced #include spam in <sys/sysproto.h> and fixed things that depended
on it.

makesyscalls.sh:
This parsed $Id$. Fixed(?) to parse $FreeBSD$. The output is wrong when
the id is not expanded in the source file.

syscalls.master:
Fixed declaration of sigsuspend(). There are still some bogons and
spam involving sigset_t.
Use `struct foo *' instead of the equivalent `foo_t *' for some nfs and
lfs syscalls so that <sys/sysproto.h> doesn't depend on <sys/mount.h>.


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# e6c4b9ba 19-Sep-1996 Poul-Henning Kamp <phk@FreeBSD.org>

Add the utrace(caddr_t addr,size_t len) syscall, that will store the
data pointed at in a ktrace file, if this process is being ktrace'ed.
I'm using this to profile malloc usage.
The advantage is that there is no context around this call, ie, no
open file or socket, so it will work in any process, and you can
decide if you want it to collect data or not.


# b08f7993 20-Aug-1996 Sujal Patel <smpatel@FreeBSD.org>

Remove the kernel FD_SETSIZE limit for select().
Make select()'s first argument 'int' not 'u_int'.

Reviewed by: bde


# edbfedac 11-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all
files are off the vendor branch, so this should not change anything.

A "U" marker generally means that the file was not changed in between
the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally
means that there was a change.
[note new unused (in this form) syscalls.conf, to be 'cvs rm'ed]


# 3f7efdf3 02-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Change the 'int len' args in the mmap/msync/mincore/etc class syscalls
to 'size_t' as per bde's request.


# 96ac07ef 23-Feb-1996 Peter Wemm <peter@FreeBSD.org>

Add hooks for rfork/minherit pair, and reset args of vfork in preperation
for adding the syscalls.


# 4f9a71f6 23-Feb-1996 Peter Wemm <peter@FreeBSD.org>

Note the syscall numbers used in BSD/OS 2.x. We dont want to
accidently use one of these ourselves as it'd make it harder to run
their binaries.
Also, remove the now-defunct #include "opt_sysvipc.h".


# 99cb2993 13-Jan-1996 Poul-Henning Kamp <phk@FreeBSD.org>

Add an option NFS_NOSERVER which saves 100K in the install kernel (or
any other kernel that uses it). Use with option NFS.


# e7ae3bf0 07-Jan-1996 Peter Wemm <peter@FreeBSD.org>

Remove the #ifdef SYSVSHM etc. Always call the functions, some stubs
are about to go in. This is to fix the problem with the ibcs2 and linux
lkm's not being able to call the sysv ipc functions unless the build is
modified.


# 50c73f36 04-Jan-1996 Garrett Wollman <wollman@FreeBSD.org>

Convert SYSV IPC to new-style options. (I hope I got everything...)
The LKMs will need an extra file, to come later.


# db6a20e2 03-Jan-1996 Garrett Wollman <wollman@FreeBSD.org>

Converted two options over to the new scheme: USER_LDT and KTRACE.


# bf4f3984 14-Dec-1995 Peter Wemm <peter@FreeBSD.org>

Add the direct sysv shm/sem/msg system calls, in the same way as NetBSD.
This costs very little, we gain prototypes for the calls from the linux
emulator, and this is one less thing in the way of NetBSD binary support.


# 93915a2a 11-Nov-1995 Bruce Evans <bde@FreeBSD.org>

Fixed the args list for mount(). We're not ready for the BSD4.4lite2/
NetBSD interface.

Increased the bogusness of the args list for mmap(). The args lists for
most of the memory mapping functions are bogus. The args lists in
syscalls.master are a little better than the ones in the args structs
currently being used, but the improvement for mmap() changed the object
code and I don't want to worry about that now.

Increased the bogusness of the args list for fcntl. BSD4.4lite2/NetBSD
uses `void *' instead of int for the third arg. This has the advantage
of working when `void *'s are longer than ints, but requires extra bogus
casts that I hope to avoid.

Fixed the args list for uname. `struct outsname' seems to be a typo,
not an old interface.

Added comments about bogus args lists for open, mount, msync, munmap,
mprotect, madvise, mincore, fcntl, semsys, msgsys and shmsys.


# a932a33d 07-Oct-1995 Steven Wallace <swallace@FreeBSD.org>

Fix misc formatting errors in makesyscalls.sh.

Add CPT_NOA type which is COMPAT with NOARGS -- do not produce argument
struct in sysproto.

Change accept, recvfrom, getsockname to CPT_NOA type.
Fix getrlimit, setrlimit argument #2 name to struct rlimit.


# f171307e 07-Oct-1995 Steven Wallace <swallace@FreeBSD.org>

Add new functionality to makesyscalls.sh:
o optional config-file to set vars: sysnames, sysproto, sysproto_h,
syshdr, syssw, syshide, syscallprefix, switchname, namesname, sysvec.
o change syntax of syscalls.master entry:
remove argument count.
add pseudo-prototype field defining function name and arguments.
o generates correct structure definitions for all system calls
in sys/sysproto.h
o add type NOARGS: same as STD except do not create structure in
sys/sysproto.h
o add type NOPROTO: same as STD except do not create structure or function
prototype in sys/sysproto.h

New functionality provides complete prototype definitions.
Usefull for generating files for emulated systems like my new ibcs2 code.

Update syscalls.master to reflect new changes. For example, read()
entry now looks like:

3 STD POSIX { int ibcs2_read(int fd, char *buf, u_int nbytes); }

This is similar to how NetBSD generates these files.


# 3cb43dbd 19-Sep-1995 Bruce Evans <bde@FreeBSD.org>

Generate prototypes for syscall-implementing functions. Put them in
<sys/sysproto.h> and use them (so far only) in kern/init_sysent.c.

Don't put $Id in generated files.

kern/syscalls.master:
I had to add some new fields to describe some non-orthogonal names.
E.g., the args struct for the syscall-implementing function foo()
is usually named `foo_args', but for getpid() it is named `args'.

sys/sysent.h:
sy_call_t is still incomplete to hide a couple of warnings.


# e876c909 22-Apr-1995 Andrey A. Chernov <ache@FreeBSD.org>

Make setreuid/setregid active syscalls


# bc4c84cf 25-Mar-1995 David Greenman <dg@FreeBSD.org>

Added a third "flags" argument to msync() ...as other systems have.


# 403ef252 03-Mar-1995 David Greenman <dg@FreeBSD.org>

Removed obsolete vtrace() remnants.


# 23f6ed01 14-Dec-1994 Garrett Wollman <wollman@FreeBSD.org>

Actually enable NTP kernel PLL. (Oops!)
Noticed by Pete Carah.


# 7216391e 01-Oct-1994 David Greenman <dg@FreeBSD.org>

"idle priority" support. Based on code from Henrik Vestergaard Draboel,
but substantially rewritten by me.


# 5ea9b263 28-Sep-1994 Garrett Wollman <wollman@FreeBSD.org>

LKM support is no longer optional.


# 3f31c649 18-Sep-1994 Garrett Wollman <wollman@FreeBSD.org>

Redo Kernel NTP PLL support, kernel side.

This code is mostly taken from the 1.1 port (which was in turn taken from
Dave Mills's kern.tar.Z example). A few significant differences:

1) ntp_gettime() is now a MIB variable rather than a system call. A few
fiddles are done in libc to make it behave the same.

2) mono_time does not participate in the PLL adjustments.

3) A new interface has been defined (in <machine/clock.h>) for doing
possibly machine-dependent things around the time of the clock update.
This is used in Pentium kernels to disable interrupts, set `time', and
reset the CPU cycle counter as quickly as possible to avoid jitter in
microtime(). Measurements show an apparent resolution of a bit more than
8.14usec, which is reasonable given system-call overhead.


# 3d903220 13-Sep-1994 Doug Rabson <dfr@FreeBSD.org>

Added SYSV ipcs.

Obtained from: NetBSD and FreeBSD-1.1.5


# 0960a7f0 12-Sep-1994 Garrett Wollman <wollman@FreeBSD.org>

Added namespace information for future pollution-control measures.


# e8fb0b2c 31-Aug-1994 David Greenman <dg@FreeBSD.org>

Realtime priority scheduling support.

Submitted by: Henrik Vestergaard Draboel


# 24ea21ce 26-Aug-1994 Garrett Wollman <wollman@FreeBSD.org>

Added ntp_gettime and ntp_adjtime syscalls, both nosys'ed out until
someone gets to re-integrating the code. ntp_gettime() should be
turned into a sysctl variable and emulated in the library.


# 3edb235c 19-Aug-1994 David Greenman <dg@FreeBSD.org>

Terry Lambert's loadable kernel module support w/improvements from the
NetBSD group.


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources