History log of /freebsd-10.1-release/sys/kern/syscalls.master
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 272461 02-Oct-2014 gjb

Copy stable/10@r272459 to releng/10.1 as part of
the 10.1-RELEASE process.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation

# 256281 10-Oct-2013 gjb

Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


# 255708 19-Sep-2013 jhb

Extend the support for exempting processes from being killed when swap is
exhausted.
- Add a new protect(1) command that can be used to set or revoke protection
from arbitrary processes. Similar to ktrace it can apply a change to all
existing descendants of a process as well as future descendants.
- Add a new procctl(2) system call that provides a generic interface for
control operations on processes (as opposed to the debugger-specific
operations provided by ptrace(2)). procctl(2) uses a combination of
idtype_t and an id to identify the set of processes on which to operate
similar to wait6().
- Add a PROC_SPROTECT control operation to manage the protection status
of a set of processes. MADV_PROTECT still works for backwards
compatability.
- Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc)
the first bit of which is used to track if P_PROTECT should be inherited
by new child processes.

Reviewed by: kib, jilles (earlier version)
Approved by: re (delphij)
MFC after: 1 month


# 255490 12-Sep-2013 jhb

Fix the type of the idtype argument to wait6() in syscalls.master.

Approved by: re (kib)
MFC after: 1 week


# 255219 04-Sep-2013 pjd

Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)

#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);

bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

cap_rights_t rights;

cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by: The FreeBSD Foundation


# 251526 08-Jun-2013 glebius

Add new system call - aio_mlock(). The name speaks for itself. It allows
to perform the mlock(2) operation, which can consume a lot of time, under
control of aio(4).

Reviewed by: kib, jilles
Sponsored by: Nginx, Inc.


# 250853 21-May-2013 kib

Fix the wait6(2) on 32bit architectures and for the compat32, by using
the right type for the argument in syscalls.master. Also fix the
posix_fallocate(2) and posix_fadvise(2) compat32 syscalls on the
architectures which require padding of the 64bit argument.

Noted and reviewed by: jhb
Pointy hat to: kib
MFC after: 1 week


# 250159 01-May-2013 jilles

Add pipe2() system call.

The pipe2() function is similar to pipe() but allows setting FD_CLOEXEC and
O_NONBLOCK (on both sides) as part of the function.

If p points to two writable ints, pipe2(p, 0) is equivalent to pipe(p).

If the pointer is not valid, behaviour differs: pipe2() writes into the
array from the kernel like socketpair() does, while pipe() writes into the
array from an architecture-specific assembler wrapper.

Reviewed by: kan, kib


# 250154 01-May-2013 jilles

Add accept4() system call.

The accept4() function, compared to accept(), allows setting the new file
descriptor atomically close-on-exec and explicitly controlling the
non-blocking status on the new socket. (Note that the latter point means
that accept() is not equivalent to any form of accept4().)

The linuxulator's accept4 implementation leaves a race window where the new
file descriptor is not close-on-exec because it calls sys_accept(). This
implementation leaves no such race window (by using falloc() flags). The
linuxulator could be fixed and simplified by using the new code.

Like accept(), accept4() is async-signal-safe, a cancellation point and
permitted in capability mode.


# 248995 02-Apr-2013 mdf

Fix return type of extattr_set_* and fix rmextattr(8) utility.

extattr_set_{fd,file,link} is logically a write(2)-like operation and
should return ssize_t, just like extattr_get_*. Also, the user-space
utility was using an int for the return value of extattr_get_* and
extattr_list_*, both of which return an ssize_t.

MFC after: 1 week


# 248599 21-Mar-2013 pjd

Implement chflagsat(2) system call, similar to fchmodat(2), but operates on
file flags.

Reviewed by: kib, jilles
Sponsored by: The FreeBSD Foundation


# 248597 21-Mar-2013 pjd

- Make 'flags' argument to chflags(2), fchflags(2) and lchflags(2) of type
u_long. Before this change it was of type int for syscalls, but prototypes
in sys/stat.h and documentation for chflags(2) and fchflags(2) (but not
for lchflags(2)) stated that it was u_long. Now some related functions
use u_long type for flags (strtofflags(3), fflagstostr(3)).
- Make path argument of type 'const char *' for consistency.

Discussed on: arch
Sponsored by: The FreeBSD Foundation


# 247667 02-Mar-2013 pjd

- Implement two new system calls:

int bindat(int fd, int s, const struct sockaddr *addr, socklen_t addrlen);
int connectat(int fd, int s, const struct sockaddr *name, socklen_t namelen);

which allow to bind and connect respectively to a UNIX domain socket with a
path relative to the directory associated with the given file descriptor 'fd'.

- Add manual pages for the new syscalls.

- Make the new syscalls available for processes in capability mode sandbox.

- Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on
the directory descriptor for the syscalls to work.

- Update audit(4) to support those two new syscalls and to handle path
in sockaddr_un structure relative to the given directory descriptor.

- Update procstat(1) to recognize the new capability rights.

- Document the new capability rights in cap_rights_limit(2).

Sponsored by: The FreeBSD Foundation
Discussed with: rwatson, jilles, kib, des


# 247602 01-Mar-2013 pjd

Merge Capsicum overhaul:

- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:

CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.

Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).

Added CAP_SYMLINKAT:
- Allow for symlinkat(2).

Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).

Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.

Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.

Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.

Removed CAP_MAPEXEC.

CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.

Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.

CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).

CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).

Added convinient defines:

#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE

#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)

Added defines for backward API compatibility:

#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib


# 242958 13-Nov-2012 kib

Add the wait6(2) system call. It takes POSIX waitid()-like process
designator to select a process which is waited for. The system call
optionally returns siginfo_t which would be otherwise provided to
SIGCHLD handler, as well as extended structure accounting for child
and cumulative grandchild resource usage.

Allow to get the current rusage information for non-exited processes
as well, similar to Solaris.

The explicit WEXITED flag is required to wait for exited processes,
allowing for more fine-grained control of the events the waiter is
interested in.

Fix the handling of siginfo for WNOWAIT option for all wait*(2)
family, by not removing the queued signal state.

PR: standards/170346
Submitted by: "Jukka A. Ukkonen" <jau@iki.fi>
MFC after: 1 month


# 239347 17-Aug-2012 davidxu

Implement syscall clock_getcpuclockid2, so we can get a clock id
for process, thread or others we want to support.
Use the syscall to implement POSIX API clock_getcpuclock and
pthread_getcpuclockid.

PR: 168417


# 236026 25-May-2012 ed

Remove use of non-ISO-C integer types from system call tables.

These files already use ISO-C-style integer types, so make them less
inconsistent by preferring the standard types.


# 227776 20-Nov-2011 lstewart

- Add the ffclock_getcounter(), ffclock_getestimate() and ffclock_setestimate()
system calls to provide feed-forward clock management capabilities to
userspace processes. ffclock_getcounter() returns the current value of the
kernel's feed-forward clock counter. ffclock_getestimate() returns the current
feed-forward clock parameter estimates and ffclock_setestimate() updates the
feed-forward clock parameter estimates.

- Document the syscalls in the ffclock.2 man page.

- Regenerate the script-derived syscall related files.

Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.

For more information, see http://www.synclab.org/radclock/

Submitted by: Julien Ridoux (jridoux at unimelb edu au)


# 227691 19-Nov-2011 ed

Improve *access*() parameter name consistency.

The current code mixes the use of `flags' and `mode'. This is a bit
confusing, since the faccessat() function as a `flag' parameter to store
the AT_ flag.

Make this less confusing by using the same name as used in the POSIX
specification -- `amode'.


# 227070 04-Nov-2011 jhb

Add the posix_fadvise(2) system call. It is somewhat similar to
madvise(2) except that it operates on a file descriptor instead of a
memory region. It is currently only supported on regular files.

Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types. The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are
thus filesystem independent. Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.

The second type of hints are used to hint to the OS that data will or
will not be used. These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request. This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode. This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue. The second change adds
a new vm_object_page_cache() method. This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.

To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.

Reviewed by: jilles, kib, arch@
Approved by: re (kib)
MFC after: 1 month


# 224987 18-Aug-2011 jonathan

Add experimental support for process descriptors

A "process descriptor" file descriptor is used to manage processes
without using the PID namespace. This is required for Capsicum's
Capability Mode, where the PID namespace is unavailable.

New system calls pdfork(2) and pdkill(2) offer the functional equivalents
of fork(2) and kill(2). pdgetpid(2) allows querying the PID of the remote
process for debugging purposes. The currently-unimplemented pdwait(2) will,
in the future, allow querying rusage/exit status. In the interim, poll(2)
may be used to check (and wait for) process termination.

When a process is referenced by a process descriptor, it does not issue
SIGCHLD to the parent, making it suitable for use in libraries---a common
scenario when using library compartmentalisation from within large
applications (such as web browsers). Some observers may note a similarity
to Mach task ports; process descriptors provide a subset of this behaviour,
but in a UNIX style.

This feature is enabled by "options PROCDESC", but as with several other
Capsicum kernel features, is not enabled by default in GENERIC 9.0.

Reviewed by: jhb, kib
Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc


# 224066 15-Jul-2011 jonathan

Add cap_new() and cap_getrights() system calls.

Implement two previously-reserved Capsicum system calls:
- cap_new() creates a capability to wrap an existing file descriptor
- cap_getrights() queries the rights mask of a capability.

Approved by: mentor (rwatson), re (Capsicum blanket)
Sponsored by: Google Inc


# 220791 18-Apr-2011 mdf

Add the posix_fallocate(2) syscall. The default implementation in
vop_stdallocate() is filesystem agnostic and will run as slow as a
read/write loop in userspace; however, it serves to correctly
implement the functionality for filesystems that do not implement a
VOP_ALLOCATE.

Note that __FreeBSD_version was already bumped today to 900036 for any
ports which would like to use this function.

Also reserve space in the syscall table for posix_fadvise(2).

Reviewed by: -arch (previous version)


# 220163 30-Mar-2011 trasz

Add rctl. It's used by racct to take user-configurable actions based
on the set of rules it maintains and the current resource usage. It also
privides userland API to manage that ruleset.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib (earlier version)


# 219304 05-Mar-2011 trasz

Add two new system calls, setloginclass(2) and getloginclass(2). This makes
it possible for the kernel to track login class the process is assigned to,
which is required for RCTL. This change also make setusercontext(3) call
setloginclass(2) and makes it possible to retrieve current login class using
id(1).

Reviewed by: kib (as part of a larger patch)


# 219129 01-Mar-2011 rwatson

Add initial support for Capsicum's Capability Mode to the FreeBSD kernel,
compiled conditionally on options CAPABILITIES:

Add a new credential flag, CRED_FLAG_CAPMODE, which indicates that a
subject (typically a process) is in capability mode.

Add two new system calls, cap_enter(2) and cap_getmode(2), which allow
setting and querying (but never clearing) the flag.

Export the capability mode flag via process information sysctls.

Sponsored by: Google, Inc.
Reviewed by: anderson
Discussed with: benl, kris, pjd
Obtained from: Capsicum Project
MFC after: 3 months


# 211998 30-Aug-2010 kib

Make the syscalls reserved for AFS usable by OpenAFS port.

Submitted by: Benjamin Kaduk <kaduk mit edu>
MFC after: 2 weeks


# 211838 26-Aug-2010 kib

Fix typo.

Submitted by: Ben Kaduk <minimarmot gmail com>


# 209579 28-Jun-2010 kib

Count number of threads that enter and leave dynamically registered
syscalls. On the dynamic syscall deregistration, wait until all
threads leave the syscall code. This somewhat increases the safety
of the loadable modules unloading.

Reviewed by: jhb
Tested by: pho
MFC after: 1 month


# 203660 08-Feb-2010 ed

Remove unused LIBCOMPAT keyword from syscalls.master.


# 198508 27-Oct-2009 kib

Current pselect(3) is implemented in usermode and thus vulnerable to
well-known race condition, which elimination was the reason for the
function appearance in first place. If sigmask supplied as argument to
pselect() enables a signal, the signal might be delivered before thread
called select(2), causing lost wakeup. Reimplement pselect() in kernel,
making change of sigmask and sleep atomic.

Since signal shall be delivered to the usermode, but sigmask restored,
set TDP_OLDMASK and save old mask in td_oldsigmask. The TDP_OLDMASK
should be cleared by ast() in case signal was not gelivered during
syscall execution.

Reviewed by: davidxu
Tested by: pho
MFC after: 1 month


# 197636 30-Sep-2009 rwatson

Reserve system call numbers for Capsicum security framework capabilities,
capability mode, and process descriptors: cap_new, cap_getrights, cap_enter,
cap_getmode, pdfork, pdkill, pdgetpid, and pdwait.

Obtained from: TrustedBSD Project
Sponsored by: Google
MFC after: 3 weeks


# 195458 08-Jul-2009 trasz

There is an optimization in chmod(1), that makes it not to call chmod(2)
if the new file mode is the same as it was before; however, this
optimization must be disabled for filesystems that support NFSv4 ACLs.
Chmod uses pathconf(2) to determine whether this is the case - however,
pathconf(2) always follows symbolic links, while the 'chmod -h' doesn't.

This change adds lpathconf(3) to make it possible to solve that problem
in a clean way.

Reviewed by: rwatson (earlier version)
Approved by: re (kib)


# 194910 24-Jun-2009 jhb

Change the ABI of some of the structures used by the SYSV IPC API:
- The uid/cuid members of struct ipc_perm are now uid_t instead of unsigned
short.
- The gid/cgid members of struct ipc_perm are now gid_t instead of unsigned
short.
- The mode member of struct ipc_perm is now mode_t instead of unsigned short
(this is merely a style bug).
- The rather dubious padding fields for ABI compat with SV/I386 have been
removed from struct msqid_ds and struct semid_ds.
- The shm_segsz member of struct shmid_ds is now a size_t instead of an
int. This removes the need for the shm_bsegsz member in struct
shmid_kernel and should allow for complete support of SYSV SHM regions
>= 2GB.
- The shm_nattch member of struct shmid_ds is now an int instead of a
short.
- The shm_internal member of struct shmid_ds is now gone. The internal
VM object pointer for SHM regions has been moved into struct
shmid_kernel.
- The existing __semctl(), msgctl(), and shmctl() system call entries are
now marked COMPAT7 and new versions of those system calls which support
the new ABI are now present.
- The new system calls are assigned to the FBSD-1.1 version in libc. The
FBSD-1.0 symbols in libc now refer to the old COMPAT7 system calls.
- A simplistic framework for tagging system calls with compatibility
symbol versions has been added to libc. Version tags are added to
system calls by adding an appropriate __sym_compat() entry to
src/lib/libc/incldue/compat.h. [1]

PR: kern/16195 kern/113218 bin/129855
Reviewed by: arch@, rwatson
Discussed with: kan, kib [1]


# 194894 24-Jun-2009 jhb

Deprecate the msgsys(), semsys(), and shmsys() system calls by moving
them under COMPAT_FREEBSD[4567]. Starting with FreeBSD 5.0 the SYSV IPC
API was implemented via direct system calls (e.g. msgctl(), msgget(), etc.)
rather than indirecting through the var-args *sys() system calls. The
shmsys() system call was already effectively deprecated for all but
COMPAT_FREEBSD4 already as its implementation for the !COMPAT_FREEBSD4 case
was to simply invoke nosys().


# 194833 24-Jun-2009 jhb

Add a new COMPAT7 flag for FreeBSD 7.x compatibility system calls.


# 194645 22-Jun-2009 jhb

Fix a typo in a comment.


# 194390 17-Jun-2009 jhb

- Add the ability to mix multiple flags seperated by pipe ('|') characters
in the type field of system call tables. Specifically, one can now use
the 'NO*' types as flags in addition to the 'COMPAT*' types. For example,
to tag 'COMPAT*' system calls as living in a KLD via NOSTD. The COMPAT*
type is required to be listed first in this case.
- Add new functions 'type()' and 'flag()' to the embedded awk script in
makesyscalls.sh that return true if a requested flag is found in the
type field ($3). The flag() function checks all of the flags in the
field, but type() only checks the first flag. type() is meant to be
used in the top-level "switch" statement and flag() should be used
otherwise.
- Retire the CPT_NOA type, it is now replaced with "COMPAT|NOARGS" using
the flags approach.
- Tweak the comment descriptions of COMPAT[46] system calls so that they
say "freebsd[46] foo" rather than "old foo".
- Document the COMPAT6 type.
- Sync comments in compat32 syscall table with the master table.


# 194384 17-Jun-2009 jhb

Remove the now-unused NOIMPL flag. It serves no useful purpose given the
existing UNIMPL and NOSTD types.


# 194383 17-Jun-2009 jhb

- NOSTD results in lkmressys being used instead of lkmssys.
- Mark nfsclnt as UNIMPL. It should have been NOSTD instead of NOIMPL back
when it lived in nfsclient.ko, but it was removed from that a long time
ago.


# 194262 15-Jun-2009 jhb

Add a new 'void closefrom(int lowfd)' system call. When called, it closes
any open file descriptors >= 'lowfd'. It is largely identical to the same
function on other operating systems such as Solaris, DFly, NetBSD, and
OpenBSD. One difference from other *BSD is that this closefrom() does not
fail with any errors. In practice, while the manpages for NetBSD and
OpenBSD claim that they return EINTR, they ignore internal errors from
close() and never return EINTR. DFly does return EINTR, but for the common
use case (closing fd's prior to execve()), the caller really wants all
fd's closed and returning EINTR just forces callers to call closefrom() in
a loop until it stops failing.

Note that this implementation of closefrom(2) does not make any effort to
resolve userland races with open(2) in other threads. As such, it is not
multithread safe.

Submitted by: rwatson (initial version)
Reviewed by: rwatson
MFC after: 2 weeks


# 191673 29-Apr-2009 jamie

Introduce the extensible jail framework, using the same "name=value"
interface as nmount(2). Three new system calls are added:
* jail_set, to create jails and change the parameters of existing jails.
This replaces jail(2).
* jail_get, to read the parameters of existing jails. This replaces the
security.jail.list sysctl.
* jail_remove to kill off a jail's processes and remove the jail.
Most jail parameters may now be changed after creation, and jails may be
set to exist without any attached processes. The current jail(2) system
call still exists, though it is now a stub to jail_set(2).

Approved by: bz (mentor)


# 184789 09-Nov-2008 ed

Mark uname(), getdomainname() and setdomainname() with COMPAT_FREEBSD4.

Looking at our source code history, it seems the uname(),
getdomainname() and setdomainname() system calls got deprecated
somewhere after FreeBSD 1.1, but they have never been phased out
properly. Because we don't have a COMPAT_FREEBSD1, just use
COMPAT_FREEBSD4.

Also fix the Linuxolator to build without the setdomainname() routine by
just making it call userland_sysctl on kern.domainname. Also replace the
setdomainname()'s implementation to use this approach, because we're
duplicating code with sysctl_domainname().

I wasn't able to keep these three routines working in our
COMPAT_FREEBSD32, because that would require yet another keyword for
syscalls.master (COMPAT4+NOPROTO). Because this routine is probably
unused already, this won't be a problem in practice. If it turns out to
be a problem, we'll just restore this functionality.

Reviewed by: rdivacky, kib


# 184588 03-Nov-2008 dfr

Implement support for RPCSEC_GSS authentication to both the NFS client
and server. This replaces the RPC implementation of the NFS client and
server with the newer RPC implementation originally developed
(actually ported from the userland sunrpc code) to support the NFS
Lock Manager. I have tested this code extensively and I believe it is
stable and that performance is at least equal to the legacy RPC
implementation.

The NFS code currently contains support for both the new RPC
implementation and the older legacy implementation inherited from the
original NFS codebase. The default is to use the new implementation -
add the NFS_LEGACYRPC option to fall back to the old code. When I
merge this support back to RELENG_7, I will probably change this so
that users have to 'opt in' to get the new code.

To use RPCSEC_GSS on either client or server, you must build a kernel
which includes the KGSSAPI option and the crypto device. On the
userland side, you must build at least a new libc, mountd, mount_nfs
and gssd. You must install new versions of /etc/rc.d/gssd and
/etc/rc.d/nfsd and add 'gssd_enable=YES' to /etc/rc.conf.

As long as gssd is running, you should be able to mount an NFS
filesystem from a server that requires RPCSEC_GSS authentication. The
mount itself can happen without any kerberos credentials but all
access to the filesystem will be denied unless the accessing user has
a valid ticket file in the standard place (/tmp/krb5cc_<uid>). There
is currently no support for situations where the ticket file is in a
different place, such as when the user logged in via SSH and has
delegated credentials from that login. This restriction is also
present in Solaris and Linux. In theory, we could improve this in
future, possibly using Brooks Davis' implementation of variant
symlinks.

Supporting RPCSEC_GSS on a server is nearly as simple. You must create
service creds for the server in the form 'nfs/<fqdn>@<REALM>' and
install them in /etc/krb5.keytab. The standard heimdal utility ktutil
makes this fairly easy. After the service creds have been created, you
can add a '-sec=krb5' option to /etc/exports and restart both mountd
and nfsd.

The only other difference an administrator should notice is that nfsd
doesn't fork to create service threads any more. In normal operation,
there will be two nfsd processes, one in userland waiting for TCP
connections and one in the kernel handling requests. The latter
process will create as many kthreads as required - these should be
visible via 'top -H'. The code has some support for varying the number
of service threads according to load but initially at least, nfsd uses
a fixed number of threads according to the value supplied to its '-n'
option.

Sponsored by: Isilon Systems
MFC after: 1 month


# 183361 25-Sep-2008 jhb

Tidy up a few things with syscall generation:
- Instead of using a syscall slot (370) just to get a function prototype
for lkmressys(), add an explicit function prototype to <sys/sysent.h>.
This also removes unused special case checks for 'lkmressys' from
makesyscalls.sh.
- Instead of having magic logic in makesyscalls.sh to only generate a
function prototype the first time 'lkmnosys' is seen, make 'NODEF'
always not generate a function prototype and include an explicit
prototype for 'lkmnosys' in <sys/sysent.h>.
- As a result of the fix in (2), update the LKM syscall entries in
the freebsd32 syscall table to use 'lkmnosys' rather than 'nosys'.
- Use NOPROTO for the __syscall() entry (198) in the native ABI. This
avoids the need for magic logic in makesyscalls.h to only generate
a function prototype the first time 'nosys' is encountered.


# 182123 24-Aug-2008 rwatson

When MPSAFE ttys were merged, a new BSM audit event identifier was
allocated for posix_openpt(2). Unfortunately, that identifier
conflicts with other events already allocated to other systems in
OpenBSM. Assign a new globally unique identifier and conform
better to the AUE_ event naming scheme.

This is a stopgap until a new OpenBSM import is done with the
correct identifier, so we'll maintain this as a local diff in svn
until then.

Discussed with: ed
Obtained from: TrustedBSD Project


# 181972 21-Aug-2008 obrien

Add comments on NOARGS, NODEF, and NOPROTO.


# 181905 20-Aug-2008 ed

Integrate the new MPSAFE TTY layer to the FreeBSD operating system.

The last half year I've been working on a replacement TTY layer for the
FreeBSD kernel. The new TTY layer was designed to improve the following:

- Improved driver model:

The old TTY layer has a driver model that is not abstract enough to
make it friendly to use. A good example is the output path, where the
device drivers directly access the output buffers. This means that an
in-kernel PPP implementation must always convert network buffers into
TTY buffers.

If a PPP implementation would be built on top of the new TTY layer
(still needs a hooks layer, though), it would allow the PPP
implementation to directly hand the data to the TTY driver.

- Improved hotplugging:

With the old TTY layer, it isn't entirely safe to destroy TTY's from
the system. This implementation has a two-step destructing design,
where the driver first abandons the TTY. After all threads have left
the TTY, the TTY layer calls a routine in the driver, which can be
used to free resources (unit numbers, etc).

The pts(4) driver also implements this feature, which means
posix_openpt() will now return PTY's that are created on the fly.

- Improved performance:

One of the major improvements is the per-TTY mutex, which is expected
to improve scalability when compared to the old Giant locking.
Another change is the unbuffered copying to userspace, which is both
used on TTY device nodes and PTY masters.

Upgrading should be quite straightforward. Unlike previous versions,
existing kernel configuration files do not need to be changed, except
when they reference device drivers that are listed in UPDATING.

Obtained from: //depot/projects/mpsafetty/...
Approved by: philip (ex-mentor)
Discussed: on the lists, at BSDCan, at the DevSummit
Sponsored by: Snow B.V., the Netherlands
dcons(4) fixed by: kan


# 178888 09-May-2008 julian

Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.

Constraints:
------------

I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.

One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".

One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.

This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.

To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.

The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.

The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.

In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.

One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).

You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.

This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.

Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.

Packets fall into one of a number of classes.

1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..

setfib -3 ping target.example.com # will use fib 3 for ping.

It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.

2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)

3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).

4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.

5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.

6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.

Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)

In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.

In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.

Early testing experience:
-------------------------

Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.

For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.

Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.

ipfw has grown 2 new keywords:

setfib N ip from anay to any
count ip from any to any fib N

In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.

SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.

Where to next:
--------------------

After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.

Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.

My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.

When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.

Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.

This work was sponsored by Ironport Systems/Cisco

Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco


# 177788 31-Mar-2008 kib

Add the openat(), fexecve() and other *at() syscalls to the table.

Based on the submission by rdivacky,
sponsored by Google Summer of Code 2007
Reviewed by: rwatson, rdivacky
Tested by: pho


# 177633 26-Mar-2008 dfr

Add the new kernel-mode NFS Lock Manager. To use it instead of the
user-mode lock manager, build a kernel with the NFSLOCKD option and
add '-k' to 'rpc_lockd_flags' in rc.conf.

Highlights include:

* Thread-safe kernel RPC client - many threads can use the same RPC
client handle safely with replies being de-multiplexed at the socket
upcall (typically driven directly by the NIC interrupt) and handed
off to whichever thread matches the reply. For UDP sockets, many RPC
clients can share the same socket. This allows the use of a single
privileged UDP port number to talk to an arbitrary number of remote
hosts.

* Single-threaded kernel RPC server. Adding support for multi-threaded
server would be relatively straightforward and would follow
approximately the Solaris KPI. A single thread should be sufficient
for the NLM since it should rarely block in normal operation.

* Kernel mode NLM server supporting cancel requests and granted
callbacks. I've tested the NLM server reasonably extensively - it
passes both my own tests and the NFS Connectathon locking tests
running on Solaris, Mac OS X and Ubuntu Linux.

* Userland NLM client supported. While the NLM server doesn't have
support for the local NFS client's locking needs, it does have to
field async replies and granted callbacks from remote NLMs that the
local client has contacted. We relay these replies to the userland
rpc.lockd over a local domain RPC socket.

* Robust deadlock detection for the local lock manager. In particular
it will detect deadlocks caused by a lock request that covers more
than one blocking request. As required by the NLM protocol, all
deadlock detection happens synchronously - a user is guaranteed that
if a lock request isn't rejected immediately, the lock will
eventually be granted. The old system allowed for a 'deferred
deadlock' condition where a blocked lock request could wake up and
find that some other deadlock-causing lock owner had beaten them to
the lock.

* Since both local and remote locks are managed by the same kernel
locking code, local and remote processes can safely use file locks
for mutual exclusion. Local processes have no fairness advantage
compared to remote processes when contending to lock a region that
has just been unlocked - the local lock manager enforces a strict
first-come first-served model for both local and remote lockers.

Sponsored by: Isilon Systems
PR: 95247 107555 115524 116679
MFC after: 2 weeks


# 177597 25-Mar-2008 ru

Fixed type of the fourth argument of cpuset_{get,set}affinity(2) to be size_t.

Prodded by: davidxu


# 177091 12-Mar-2008 jeff

Remove kernel support for M:N threading.

While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential. Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.


# 176730 02-Mar-2008 jeff

Add cpuset, an api for thread to cpu binding and cpu resource grouping
and assignment.
- Add a reference to a struct cpuset in each thread that is inherited from
the thread that created it.
- Release the reference when the thread is destroyed.
- Add prototypes for syscalls and macros for manipulating cpusets in
sys/cpuset.h
- Add syscalls to create, get, and set new numbered cpusets:
cpuset(), cpuset_{get,set}id()
- Add syscalls for getting and setting affinity masks for cpusets or
individual threads: cpuid_{get,set}affinity()
- Add types for the 'level' and 'which' parameters for the cpuset. This
will permit expansion of the api to cover cpu masks for other objects
identifiable with an id_t integer. For example, IRQs and Jails may be
coming soon.
- The root set 0 contains all valid cpus. All thread initially belong to
cpuset 1. This permits migrating all threads off of certain cpus to
reserve them for special applications.

Sponsored by: Nokia
Discussed with: arch, rwatson, brooks, davidxu, deischen
Reviewed by: antoine


# 176215 12-Feb-2008 ru

Change readlink(2)'s return type and type of the last argument
to match POSIX.

Prodded by: Alexey Lyashkov


# 175517 20-Jan-2008 rwatson

Use audit events AUE_SHMOPEN and AUE_SHMUNLINK with new system calls
shm_open() and shm_unlink(). More auditing will need to be done for
these calls to capture arguments properly.


# 175164 08-Jan-2008 jhb

Add a new file descriptor type for IPC shared memory objects and use it to
implement shm_open(2) and shm_unlink(2) in the kernel:
- Each shared memory file descriptor is associated with a swap-backed vm
object which provides the backing store. Each descriptor starts off with
a size of zero, but the size can be altered via ftruncate(2). The shared
memory file descriptors also support fstat(2). read(2), write(2),
ioctl(2), select(2), poll(2), and kevent(2) are not supported on shared
memory file descriptors.
- shm_open(2) and shm_unlink(2) are now implemented as system calls that
manage shared memory file descriptors. The virtual namespace that maps
pathnames to shared memory file descriptors is implemented as a hash
table where the hash key is generated via the 32-bit Fowler/Noll/Vo hash
of the pathname.
- As an extension, the constant 'SHM_ANON' may be specified in place of the
path argument to shm_open(2). In this case, an unnamed shared memory
file descriptor will be created similar to the IPC_PRIVATE key for
shmget(2). Note that the shared memory object can still be shared among
processes by sharing the file descriptor via fork(2) or sendmsg(2), but
it is unnamed. This effectively serves to implement the getmemfd() idea
bandied about the lists several times over the years.
- The backing store for shared memory file descriptors are garbage
collected when they are not referenced by any open file descriptors or
the shm_open(2) virtual namespace.

Submitted by: dillon, peter (previous versions)
Submitted by: rwatson (I based this on his version)
Reviewed by: alc (suggested converting getmemfd() to shm_open())


# 172819 19-Oct-2007 emaste

Put comments about syscalls by the correct ones, and use the correct syscall
number in the comment.


# 171859 16-Aug-2007 davidxu

Add thr_kill2 syscall which sends a signal to a thread in another process.

Submitted by: Tijl Coosemans tijl at ulyssis dot org
Approved by: re (kensmith)


# 171209 04-Jul-2007 peter

Create new syscalls for mmap(), lseek(), pread(), pwrite(), truncate() and
ftruncate(), but without the pad arg.

There are several reasons for this. Consider 'mmap()'. On AMD64, the
function call (and syscall) ABI allow for 6 register arguments. Additional
arguments go on the stack. mmap(2) has 6 arguments. However, the syscall
definition has an extra 'int pad' argument. This pushes it to 7 arguments,
which means one must spill into the memory stack. Since the kernel API
doesn't match userland API, we have a hack in libc - libc/sys/mmap.c.
This implements the userland API by calling __syscall() with an extra
argument and the pad argument, for a total of 8 args. This is all
unnecessary and inconvenient for several things, including the kernel's
syscall handler code which now has to handle merging stack arguments with
register arguments. It is a big deal for certain 3rd party code.

I'm adding libc glue to make the transition totally painless. I had
intended to mark the old syscalls as COMPAT6, but the potential to shoot
your feet by building a new kernel without COMPAT_FREEBSD6 but with a
slighly older userland was too great. For now, they have manual
"freebsd6_" prefixes rather than being COMPAT6. They will go back to
being marked 'COMPAT6' after 7-stable starts.

Approved by: re (kensmith)


# 163953 03-Nov-2006 rrs

Ok, here it is, we finally add SCTP to current. Note that this
work is not just mine, but it is also the works of Peter Lei
and Michael Tuexen. They both are my two key other developers
working on the project.. and they need ata-boy's too:
****
peterlei@cisco.com
tuexen@fh-muenster.de
****
I did do a make sysent which updated the
syscall's and sysproto.. I hope that is correct... without
it you don't build since we have new syscalls for SCTP :-0

So go out and look at the NOTES, add
option SCTP (make sure inet and inet6 are present too)
and play with SCTP.

I will see about comitting some test tools I have after I
figure out where I should place them. I also have a
lib (libsctp.a) that adds some of the missing socketapi
functions that I need to put into lib's.. I will talk
to George about this :-)

There may still be some 64 bit issues in here, none of
us have a 64 bit processor to test with yet.. Michael
may have a MAC but thats another beast too..

If you have a mac and want to use SCTP contact Michael
he maintains a web site with a loadable module with
this code :-)

Reviewed by: gnn
Approved by: gnn


# 163449 17-Oct-2006 davidxu

o Add keyword volatile for user mutex owner field.
o Fix type consistent problem by using type long for old
umtx and wait channel.
o Rename casuptr to casuword.


# 162991 03-Oct-2006 rwatson

Audit creat() system call (compat code), and change type for getpagesize(),
which isn't actually being audited anyway.

MFC after: 3 days
Obtained from: TrustedBSD Project


# 162497 21-Sep-2006 davidxu

Replace system call thr_getscheduler, thr_setscheduler, thr_setschedparam
with rtprio_thread, while rtprio system call is for process only, the new
system call rtprio_thread is responsible for LWP.


# 162373 17-Sep-2006 rwatson

AUE_SIGALTSTACK instead of AUE_SIGPENDING for sigaltstack().

Obtained from: TrustedBSD Project
MFC after: 3 days


# 161952 03-Sep-2006 rwatson

Assign proper audit event identifiers to a number of system calls not
covered in previous passes:

- sysarch, rtprio
- clock_settime
- preadv/pwritev
- __getcwd
- kqueue
- fhstatfs
- kldunloadf

Obtained from: TrustedBSD Project


# 161946 03-Sep-2006 rwatson

Use AUE_NTP_ADJTIME for ntp_adjtime() instead of AUE_ADJTIME.

Obtained from: TrustedBSD Project


# 161678 28-Aug-2006 davidxu

This is initial version of POSIX priority mutex support, a new userland
mutex structure is added as following:
struct umutex {
__lwpid_t m_owner;
uint32_t m_flags;
uint32_t m_ceilings[2];
uint32_t m_spare[4];
};
The m_owner represents owner thread, it is a thread id, in non-contested
case, userland can simply use atomic_cmpset_int to lock the mutex, if the
mutex is contested, high order bit will be set, and userland should do locking
and unlocking via kernel syscall. Flag UMUTEX_PRIO_INHERIT represents
pthread's PTHREAD_PRIO_INHERIT mutex, which when contention happens, kernel
should do priority propagating. Flag UMUTEX_PRIO_PROTECT indicates it is
pthread's PTHREAD_PRIO_PROTECT mutex, userland should initialize m_owner
to contested state UMUTEX_CONTESTED, then atomic_cmpset_int will be failure
and kernel syscall should be invoked to do locking, this becauses
for such a mutex, kernel should always boost the thread's priority before
it can lock the mutex, m_ceilings is used by PTHREAD_PRIO_PROTECT mutex,
the first element is used to boost thread's priority when it locked the mutex,
second element is used when the mutex is unlocked, the PTHREAD_PRIO_PROTECT
mutex's link list is kept in userland, the m_ceiling[1] is managed by thread
library so kernel needn't allocate memory to keep the link list, when such
a mutex is unlocked, kernel reset m_owner to UMUTEX_CONTESTED.
Flag USYNC_PROCESS_SHARED indicate if the synchronization object is process
shared, if the flag is not set, it saves a vm_map_lookup() call.

The umtx chain is still used as a sleep queue, when a thread is blocked on
PTHREAD_PRIO_INHERIT mutex, a umtx_pi is allocated to support priority
propagating, it is dynamically allocated and reference count is used,
it is not optimized but works well in my tests, while the umtx chain has
its own locking protocol, the priority propagating protocol are all protected
by sched_lock because priority propagating function is called with sched_lock
held from scheduler.

No visible performance degradation is found which these changes. Some parameter
names in _umtx_op syscall are renamed.


# 161367 16-Aug-2006 peter

Grab two syscall numbers. One is used to emulate functionality that linux
has in its procfs (do a readlink of /proc/self/fd/<nn> to find the pathname
that corresponds to a given file descriptor). Valgrind-3.x needs this
functionality. This is a placeholder only at this time.


# 161325 15-Aug-2006 jhb

- Use NOSTD rather than NOIMPL for nfssvc() to match other syscalls
provided via klds.
- Correct audit identifier for nfssvc().


# 160798 28-Jul-2006 jhb

Now that all system calls are MPSAFE, retire the SYF_MPSAFE flag used to
mark system calls as being MPSAFE:
- Stop conditionally acquiring Giant around system call invocations.
- Remove all of the 'M' prefixes from the master system call files.
- Remove support for the 'M' prefix from the script that generates the
syscall-related files from the master system call files.
- Don't explicitly set SYF_MPSAFE when registering nfssvc.


# 160797 28-Jul-2006 jhb

Various fixes to comments in the syscall master files including removing
cruft from the audit import and adding mention of COMPAT4 to freebsd32.


# 160319 13-Jul-2006 davidxu

Add syscalls thr_setscheduler, thr_getscheduler, and thr_setschedparam,
these syscalls are designed to set thread's scheduling parameters and
policy, because each syscall contains a size parameter, it is possible
to support future scheduling option, e.g SCHED_SPORADIC, this option
needs other fields in structure sched_param, current they are not
avaiblable.


# 160276 11-Jul-2006 jhb

- Add conditional VFS Giant locking to getdents_common() (linux ABIs),
ibcs2_getdents(), ibcs2_read(), ogetdirentries(), svr4_sys_getdents(),
and svr4_sys_getdents64() similar to that in getdirentries().
- Mark ibcs2_getdents(), ibcs2_read(), linux_getdents(), linux_getdents64(),
linux_readdir(), ogetdirentries(), svr4_sys_getdents(), and
svr4_sys_getdents64() MPSAFE.


# 160111 05-Jul-2006 wsalamon

Add audit events for the extended attribute system calls.

Obtained from: TrustedBSD Project
Approved by: rwatson (mentor)


# 159982 27-Jun-2006 jhb

- Expand the scope of Giant some in mount(2) to protect the vfsp structure
from going away. mount(2) is now MPSAFE.
- Expand the scope of Giant some in unmount(2) to protect the mp structure
(or rather, to handle concurrent unmount races) from going away.
umount(2) is now MPSAFE, as well as linux_umount() and linux_oldumount().
- nmount(2) and linux_mount() were already MPSAFE.


# 157211 28-Mar-2006 des

Revert previous commit at davidxu's insistance. Instead, use __DECONST
(argh!) and rearrange the prototypes to make it clear that _umtx_op()
is not deprecated.


# 157206 28-Mar-2006 des

The undocumented and deprecated system call _umtx_op() takes two pointer
arguments. The first one is never used (all callers pass in 0); the
second is sometimes used to pass in a struct timespec * which is used as
a timeout and never modified. Constify that argument so callers can pass
a const struct timespec * without jumping through hoops.


# 157037 23-Mar-2006 davidxu

Implement aio_fsync() syscall.


# 156134 01-Mar-2006 davidxu

Let kernel POSIX timer code and mqueue code to use integer as a resource
handle, the timer_t and mqd_t types will be a pointer which userland
will define it.


# 155377 06-Feb-2006 rwatson

Prefer AUE_FOO audit identifiers to AUE_O_FOO, which are largely left
over from the Darwin implementation.

When we implement a system call as a wrapper to sysctl(), audit it as
AUE_SYSCTL. This leads to greater compatibility with Solaris audit
trails as sysctl() argument tokens are not the same as the ones for
the originaly system calls (i.e., setdomainname()).

Replace references to AUE_ events that are equivilent to AUE_NULL with
AUE_NULL. In the case of process signal configuration, this is
because these events do not require auditing.

Move from the Darwin spelling of getsockopt() to the FreeBSD/Solaris
one.

Audit nmount().

Obtained from: TrustedBSD Project


# 155327 05-Feb-2006 davidxu

Implement thr_set_name to set a name for thread.

Reviewed by: julian


# 155249 03-Feb-2006 rwatson

Assign audit event identifiers to many system calls.

Much work by: wsalamon
Obtained from: TrustedBSD Project


# 155199 01-Feb-2006 rwatson

Map audit-related system calls to audit event identifiers.

Much work by: wsalamon
Obtained from: TrustedBSD Project


# 154669 22-Jan-2006 davidxu

Make aio code MP safe.


# 153679 23-Dec-2005 phk

Add abort2() systemcall.


# 152845 26-Nov-2005 davidxu

Don't use OpenBSD syscall numbers, instead, use new syscall numbers
for POSIX message queue.

Suggested by: rwatson


# 152825 26-Nov-2005 davidxu

Bring in experimental kernel support for POSIX message queue.


# 151867 30-Oct-2005 davidxu

Fix sigevent's POSIX incompatible problem by adding member fields
sigev_notify_function and sigev_notify_attributes. AIO syscalls
use sigevent, so they have to be adjusted.

Reviewed by: alc


# 151576 23-Oct-2005 davidxu

Implement POSIX timers. Current only CLOCK_REALTIME and CLOCK_MONOTONIC
clock are supported. I have plan to merge XSI timer ITIMER_REAL and other
two CPU timers into the new code, current three slots are available for
the XSI timers.
The SIGEV_THREAD notification type is not supported yet because our
sigevent struct lacks of two member fields:
sigev_notify_function
sigev_notify_attributes
I have found the sigevent is used in AIO, so I won't add the two members
unless the AIO code is adjusted.


# 151445 18-Oct-2005 stefanf

Const-qualify ksem_timedwait's parameter abstime as it's only passed in.


# 151316 14-Oct-2005 davidxu

1. Change prototype of trapsignal and sendsig to use ksiginfo_t *, most
changes in MD code are trivial, before this change, trapsignal and
sendsig use discrete parameters, now they uses member fields of
ksiginfo_t structure. For sendsig, this change allows us to pass
POSIX realtime signal value to user code.

2. Remove cpu_thread_siginfo, it is no longer needed because we now always
generate ksiginfo_t data and feed it to libpthread.

3. Add p_sigqueue to proc structure to hold shared signals which were
blocked by all threads in the proc.

4. Add td_sigqueue to thread structure to hold all signals delivered to
thread.

5. i386 and amd64 now return POSIX standard si_code, other arches will
be fixed.

6. In this sigqueue implementation, pending signal set is kept as before,
an extra siginfo list holds additional siginfo_t data for signals.
kernel code uses psignal() still behavior as before, it won't be failed
even under memory pressure, only exception is when deleting a signal,
we should call sigqueue_delete to remove signal from sigqueue but
not SIGDELSET. Current there is no kernel code will deliver a signal
with additional data, so kernel should be as stable as before,
a ksiginfo can carry more information, for example, allow signal to
be delivered but throw away siginfo data if memory is not enough.
SIGKILL and SIGSTOP have fast path in sigqueue_add, because they can
not be caught or masked.
The sigqueue() syscall allows user code to queue a signal to target
process, if resource is unavailable, EAGAIN will be returned as
specification said.
Just before thread exits, signal queue memory will be freed by
sigqueue_flush.
Current, all signals are allowed to be queued, not only realtime signals.

Earlier patch reviewed by: jhb, deischen
Tested on: i386, amd64


# 150619 27-Sep-2005 csjp

Mark the extended attribute syscalls as being MP safe.

Requested by: jhb


# 147831 08-Jul-2005 jhb

Mark second instance of lchown() MP safe just like the first.

Approved by: re (scottl)


# 147813 07-Jul-2005 jhb

- Add two new system calls: preadv() and pwritev() which are like readv()
and writev() except that they take an additional offset argument and do
not change the current file position. In SAT speak:
preadv:readv::pread:read and pwritev:writev::pwrite:write.
- Try to reduce code duplication some by merging most of the old
kern_foov() and dofilefoo() functions into new dofilefoo() functions
that are called by kern_foov() and kern_pfoov(). The non-v functions
now all generate a simple uio on the stack from the passed in arguments
and then call kern_foov(). For example, read() now just builds a uio and
calls kern_readv() and pwrite() just builds a uio and calls kern_pwritev().

PR: kern/80362
Submitted by: Marc Olzheim marcolz at stack dot nl (1)
Approved by: re (scottl)
MFC after: 1 week


# 146806 30-May-2005 rwatson

Introduce a new field in the syscalls.master file format to hold the
audit event identifier associated with each system call, which will
be stored by makesyscalls.sh in the sy_auevent field of struct sysent.
For now, default the audit identifier on all system calls to AUE_NULL,
but in the near future, other BSM event identifiers will be used. The
mapping of system calls to event identifiers is many:one due to
multiple system calls that map to the same end functionality across
compatibility wrappers, ABI wrappers, etc.

Submitted by: wsalamon
Obtained from: TrustedBSD Project


# 146783 29-May-2005 rwatson

Normalize white space in syscalls.master: try to use tabs before system
call types.


# 146723 28-May-2005 rwatson

Mark ntp_gettime() as MSTD, since its system call path will acquire
Giant if required.


# 146719 28-May-2005 rwatson

Mark the following compatability system calls as MCOMPAT or MCOMPAT4 based
on the their simply wrapping MPSAFE implementations of existing MPSAFE
system calls:

getfsstat()
lseek()
stat()
lstat()
truncate()
ftruncate()
statfs()
fstatfs()

Note that ogetdirentries() is not marked MPSAFE because it does not share
the MPSAFE implementation used for getdirentries(), and requires separate
locking to be implemented.


# 146716 28-May-2005 rwatson

Mark quotactl() as MSTD.


# 146713 28-May-2005 rwatson

Mark kenv(2) as MPSAFE, since it appears to be properly locked down.


# 146711 28-May-2005 rwatson

Also mark the COMPAT4 version of fhstatfs() as MPSAFE.


# 146710 28-May-2005 rwatson

Mark fhopen(), fhstat(), and fhstatfs() as MSTD, since they now
acquire Giant themselves.


# 145434 23-Apr-2005 davidxu

Add new syscall thr_new to create thread in atomic, it will
inherit signal mask from parent thread, setup TLS and stack, and
user entry address.
Also support POSIX thread's PTHREAD_SCOPE_PROCESS and PTHREAD_SCOPE_SYSTEM,
sysctl is also provided to control the scheduler scope.


# 143317 09-Mar-2005 stefanf

Fix typo in comment.


# 142932 01-Mar-2005 ps

Change the prototype of kevent to remove the const from the changelist.

Reviewed by: jhb


# 140840 26-Jan-2005 jeff

- Struct mount is not yet locked well enough to allow
mount/nmount/unmount to run without Giant. Mark them as STD here.


# 140724 24-Jan-2005 jeff

- Change all VFS syscalls to MSTD as they all manually deal with giant
or the appropriate filesystem locks.

Sponsored By: Isilon Systems, Inc.


# 139598 02-Jan-2005 marcel

uuidgen(2) is MP safe.


# 139292 25-Dec-2004 davidxu

Make _umtx_op() as more general interface, the final parameter needn't be
timespec pointer, every parameter will be interpreted by its opcode.


# 139013 18-Dec-2004 davidxu

1. make umtx sharable between processes, the way is two or more processes
call mmap() to create a shared space, and then initialize umtx on it,
after that, each thread in different processes can use the umtx same
as threads in same process.
2. introduce a new syscall _umtx_op to support timed lock and condition
variable semantics. also, orignal umtx_lock and umtx_unlock inline
functions now are reimplemented by using _umtx_op, the _umtx_op can
use arbitrary id not just a thread id.


# 138088 25-Nov-2004 phk

Mark mount, unmount and nmount MPSAFE


# 137874 18-Nov-2004 marks

Add ntp_gettime(2) system call.

Reviewed by: imp, phk, njl, peter
Approved by: njl


# 136830 23-Oct-2004 rwatson

Add system call place-holders for the following system calls
implementing Sun's BSM Audit API on FreeBSD:

audit()
auditon()
getauid()
setauid()
getaudit()
setaudit()
getaudit_addr()
setaudit_addr()
auditctl()

Submitted by: Wayne Salamon <wsalamon at computer dot org>
Obtained from: TrustedBSD Project


# 136207 06-Oct-2004 davidxu

Regen to unbreak world.

Pointy hat to: mtm


# 132116 13-Jul-2004 phk

Add kldunloadf() system call. Stay tuned for follwing commit messages.


# 132020 12-Jul-2004 davidxu

Change kse_switchin to accept kse_thr_mailbox pointer, the syscall
will be used heavily in debugging KSE threads. This breaks libpthread
on IA64, but because libpthread was not in 5.2.1 release, I would like
to change it so we needn't to introduce another syscall.


# 131431 01-Jul-2004 marcel

Change the thread ID (thr_id_t) used for 1:1 threading from being a
pointer to the corresponding struct thread to the thread ID (lwpid_t)
assigned to that thread. The primary reason for this change is that
libthr now internally uses the same ID as the debugger and the kernel
when referencing to a kernel thread. This allows us to implement the
support for debugging without additional translations and/or mappings.

To preserve the ABI, the 1:1 threading syscalls, including the umtx
locking API have not been changed to work on a lwpid_t. Instead the
1:1 threading syscalls operate on long and the umtx locking API has
not been changed except for the contested bit. Previously this was
the least significant bit. Now it's the most significant bit. Since
the contested bit should not be tested by userland, this change is
not expected to be visible. Just to be sure, UMTX_CONTESTED has been
removed from <sys/umtx.h>.

Reviewed by: mtm@
ABI preservation tested on: i386, ia64


# 130907 22-Jun-2004 rwatson

Mark unlink() as MPSAFE as we now acquire Giant in the unlink()
system call.


# 130904 22-Jun-2004 rwatson

Mark link() system call as MPSAFE.


# 127890 05-Apr-2004 dfr

Add lgetfh(2) which is like getfh(2) but doesn't follow symlinks.


# 127482 27-Mar-2004 mtm

Separate thread synchronization from signals in libthr. Instead
use msleep() and wakeup_one().

Discussed with: jhb, peter, tjr


# 127061 16-Mar-2004 dwmalone

Get ready to mark open, creat and nosys as MPSAFE.


# 127034 15-Mar-2004 jhb

Drop the proc lock around calls to the MD functions ptrace_single_step(),
ptrace_set_pc(), and cpu_ptrace() so that those functions are free to
acquire Giant, sleep, etc. We already do a PHOLD/PRELE around them so
that it is safe to sleep inside of these routines if necessary. This
allows ptrace() to be marked MP safe again as it no longer triggers lock
order reversals on Alpha.

Tested by: wilko


# 126932 13-Mar-2004 peter

Push Giant down a little further:
- no longer serialize on Giant for thread_single*() and family in fork,
exit and exec
- thread_wait() is mpsafe, assert no Giant
- reduce scope of Giant in exit to not cover thread_wait and just do
vm_waitproc().
- assert that thread_single() family are not called with Giant
- remove the DROP/PICKUP_GIANT macros from thread_single() family
- assert that thread_suspend_check() s not called with Giant
- remove manual drop_giant hack in thread_suspend_check since we know it
isn't held.
- remove the DROP/PICKUP_GIANT macros from thread_suspend_check() family
- mark kse_create() mpsafe


# 125368 03-Feb-2004 deischen

Add ksem_timedwait() to complement ksem_wait().

Glanced at by: alfred


# 123853 26-Dec-2003 alfred

Put restrict back in, the compilation failure was my fault when I
did a bad merge from the PR.

Thanks to Bruce Evans for explaining.


# 123817 24-Dec-2003 alfred

We're not ready for restrict qualifiers here.


# 123811 24-Dec-2003 alfred

Add restrict qualifiers.

PR: 44394
Submitted by: Craig Rodrigues <rodrige@attbi.com>


# 123750 23-Dec-2003 peter

Remove namespc column and attempt to un-fold some of the longer lines
that now fit.


# 123412 10-Dec-2003 peter

Previous commit also changed the sendmsg prototype to something more
closely matching reality. I did not actually mean to commit that yet.


# 123408 10-Dec-2003 peter

Update file locations for syscall tables to copy to.


# 123252 07-Dec-2003 marcel

Add kse_switchin(2). This syscall can be used by KSE implementations
to have the kernel switch to a new thread, instead of doing it in
userland. It is in fact needed on ia64 where syscall restarts do not
return to userland first. It's completely handled inside the kernel.
As such, any context created by the kernel as part of an upcall and
caused by some syscall needs to be restored by the kernel.


# 122635 14-Nov-2003 jeff

- Revision 1.156 marked ptrace() SMP safe. Unfortunately, alpha implements
parts of ptrace using proc_rwmem(). proc_rwmem() requires giant, and
giant must be acquired prior to the proc lock, so ptrace must require giant
still.


# 122537 12-Nov-2003 mckusick

Update the statfs structure with 64-bit fields to allow
accurate reporting of multi-terabyte filesystem sizes.

You should build and boot a new kernel BEFORE doing a `make world'
as the new kernel will know about binaries using the old statfs
structure, but an old kernel will not know about the new system
calls that support the new statfs structure. Running an old kernel
after a `make world' will cause programs such as `df' that do a
statfs system call to fail with a bad system call.

Reviewed by: Bruce Evans <bde@zeta.org.au>
Reviewed by: Tim Robbins <tjr@freebsd.org>
Reviewed by: Julian Elischer <julian@elischer.org>
Reviewed by: the hoards of <arch@freebsd.org>
Sponsored by: DARPA & NAI Labs.


# 122241 07-Nov-2003 jhb

Mark ptrace(), ktrace(), utrace(), sysarch(), and issetugid() as MP safe.
The parts of these calls that are not yet MP safe acquire Giant explicitly.


# 121298 21-Oct-2003 scottl

Don peril-sensitive sunglasses and mark pipe(2) as MPSAFE. I've beaten up
on it for the last 15 hours with no signs of problems. It gives a small
(1%) gain on buildworld since pipe_read/pipe_write are already free of Giant.


# 121284 20-Oct-2003 dwmalone

Mark dup as MPSAFE. Giant was pushed into dup ages ago, but it looks
like it was missed in syscalls.master.

Spotted by: alc


# 119827 07-Sep-2003 alc

msync(2) should be declared MP-safe.


# 117704 17-Jul-2003 davidxu

o Refine kse_thr_interrupt to allow it to handle different commands.
o Remove TDF_NOSIGPOST.
o Add a member td_waitset to proc structure, it will be used for sigwait.

Tested by: deischen


# 116963 28-Jun-2003 davidxu

o Change kse_thr_interrupt to allow send a signal to a specified thread,
or unblock a thread in kernel, and allow UTS to specify whether syscall
should be restarted.
o Add ability for UTS to monitor signal comes in and removed from process,
the flag PS_SIGEVENT is used to indicate the events.
o Add a KMF_WAITSIGEVENT for KSE mailbox flag, UTS call kse_release with
this flag set to wait for above signal event.
o For SA based thread, kernel masks all signal in its signal mask, let
UTS to use kse_thr_interrupt interrupt a thread, and install a signal
frame in userland for the thread.
o Add a tm_syncsig in thread mailbox, when a hardware trap occurs,
it is used to deliver synchronous signal to userland, and upcall
is schedule, so UTS can process the synchronous signal for the thread.

Reviewed by: julian (mentor)


# 115799 04-Jun-2003 rwatson

Add system calls to explicitly list extended attributes on a
file/directory/link, rather than using a less explicit hack on
the extattr retrieval API:

extattr_list_fd()
extattr_list_file()
extattr_list_link()

The existing API was counter-intuitive, and poorly documented.
The prototypes for these system calls are identical to
extattr_get_*(), but without a specific attribute name to
leave NULL.

Pointed out by: Dominic Giampaolo <dbg@apple.com>
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 113275 09-Apr-2003 mike

o In struct prison, add an allprison linked list of prisons (protected
by allprison_mtx), a unique prison/jail identifier field, two path
fields (pr_path for reporting and pr_root vnode instance) to store
the chroot() point of each jail.
o Add jail_attach(2) to allow a process to bind to an existing jail.
o Add change_root() to perform the chroot operation on a specified
vnode.
o Generalize change_dir() to accept a vnode, and move namei() calls
to callers of change_dir().
o Add a new sysctl (security.jail.list) which is a group of
struct xprison instances that represent a snapshot of active jails.

Reviewed by: rwatson, tjr


# 112911 01-Apr-2003 jeff

- Mark the various thr syscalls as MP safe. Previously there was a bug if
this was not done since thr_exit() unwinds giant.


# 112906 31-Mar-2003 jeff

- Include umtx.h in files generated by makesyscalls.sh
- Add system calls for umtx.


# 112901 31-Mar-2003 jeff

- Add the four thr related system calls.


# 112893 31-Mar-2003 jeff

- Define sigwait, sigtimedwait, and sigwaitinfo in terms of
kern_sigtimedwait() which is capable of supporting all of their semantics.
- These should be POSIX compliant but more careful review is needed before
we announce this.


# 111169 20-Feb-2003 davidxu

Add a timeout parameter to kse_release.


# 109895 26-Jan-2003 alfred

Add const qualifier to data argument for msgsnd.

PR: standards/45274
Submitted by: Craig Rodrigues <rodrigc@attbi.com>


# 109831 25-Jan-2003 alfred

Bring shm functions closer the the opengroup standards.

PR: 47469
Submitted by: Craig Rodrigues <rodrigc@attbi.com>


# 109829 25-Jan-2003 alfred

Bring semop() closer the the opengroup standards.

PR: 47471
Submitted by: Craig Rodrigues <rodrigc@attbi.com>


# 108659 04-Jan-2003 davidxu

Some KSE syscalls are MPSAFE.


# 108405 29-Dec-2002 rwatson

Add definitions for four new system calls:

__acl_get_link() Retrieve an ACL by name without following
symbolic links.
__acl_set_link() Set an ACL by name without following
symbolic links.
__acl_delete_link() Delete an ACL by name without following
symbolic links.
__acl_aclcheck_link() Check an ACL against a file by name without
following symbolic links.

These calls are similar in spirit to lstat(), lchown(), lchmod(), etc,
and will be used under similar circumstances.

Obtained from: TrustedBSD Project


# 107913 15-Dec-2002 dillon

This is David Schultz's swapoff code which I am finally able to commit.
This should be considered highly experimental for the moment.

Submitted by: David Schultz <dschultz@uclink.Berkeley.EDU>
MFC after: 3 weeks


# 106977 16-Nov-2002 deischen

Add getcontext, setcontext, and swapcontext as system calls.
Previously these were libc functions but were requested to
be made into system calls for atomicity and to coalesce what
might be two entrances into the kernel (signal mask setting
and floating point trap) into one.

A few style nits and comments from bde are also included.

Tested on alpha by: gallatin


# 106466 05-Nov-2002 rwatson

Flesh out the definition of __mac_execve(): per earlier discussion,
it's essentially execve() with an optional MAC label argument.

Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 106312 01-Nov-2002 rwatson

Rename __execve_mac() to __mac_execve() for increased consistency
with other MAC system calls.

Requested by: various (phk, gordont, jake, ...)


# 105950 25-Oct-2002 peter

Split 4.x and 5.x signal handling so that we can keep 4.x signal
handling clean and functional as 5.x evolves. This allows some of the
nasty bandaids in the 5.x codepaths to be unwound.

Encapsulate 4.x signal handling under COMPAT_FREEBSD4 (there is an
anti-foot-shooting measure in place, 5.x folks need this for a while) and
finish encapsulating the older stuff under COMPAT_43. Since the ancient
stuff is required on alpha (longjmp(3) passes a 'struct osigcontext *'
to the current sigreturn(2), instead of the 'ucontext_t *' that sigreturn
is supposed to take), add a compile time check to prevent foot shooting
there too. Add uniform COMPAT_43 stubs for ia64/sparc64/powerpc.

Tested on: i386, alpha, ia64. Compiled on sparc64 (a few days ago).
Approved by: re


# 105691 22-Oct-2002 rwatson

Flesh out prototypes for __mac_get_pid, __mac_get_link, and
__mac_set_link, based on __mac_get_proc() except with a pid,
and __mac_get_file(), __mac_set_file() except that they do
not follow symlinks. First in a series of commits to flesh
out the user API.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 105490 19-Oct-2002 peter

Stake a claim on 418 (__xstat), 419 (__xfstat), 420 (__xlstat)


# 105486 19-Oct-2002 peter

Grab 416/417 real estate before I get burned while testing again.
This is for the not-quite-ready signal/fpu abi stuff. It may not see
the light of day, but I'm certainly not going to be able to validate it
when getting shot in the foot due to syscall number conflicts.


# 105476 19-Oct-2002 rwatson

Add a placeholder for the execve_mac() system call, similar to SELinux's
execve_secure() system call, which permits a process to pass in a label
for a label change during exec. This permits SELinux to change the
label for the resulting exec without a race following a manual label
change on the process. Because this interface uses our general purpose
MAC label abstraction, we call it execve_mac(), and wrap our port of
SELinux's execve_secure() around it with appropriate sid mappings.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 105144 14-Oct-2002 peter

Restore pointer that was removed in 1.128. This wasn't a merge-o.


# 104747 10-Oct-2002 rwatson

Fix what looks like a merge-o from a conflict in the last commit to
syscalls.master.


# 104734 09-Oct-2002 peter

Add a pointer to the alternate syscall tables on 64 bit platforms.


# 104730 09-Oct-2002 rwatson

Flesh out the extattr_{delete,get,set}_link() system calls: variations
on the _file() theme that do not follow symlinks. Sync to MAC tree.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 104379 02-Oct-2002 archie

Let kse_wakeup() take a KSE mailbox pointer argument.

Reviewed by: julian


# 104262 01-Oct-2002 rwatson

Reserve system call numbers for the following system calls:

__mac_get_pid Retrieve MAC label of a process by pid

Similar to __mac_get_proc() except that the target process of
the operation is explicitly specified rather than assuming
curthread.

__mac_get_link Retrieve MAC label of a path with NOFOLLOW
__mac_set_link Set MAC label of a path with NOFOLLOW
extattr_set_link Set EAs on a path with NOFOLLOW
extattr_get_link Retrieve EAs on a path with NOFOLLOW
extattr_delete_link Delete EAs on a path with NOFOLLOW

These calls are similar to __mac_get_file(), __mac_set_file(),
extattr_set_file(), extattr_get_file(), and extattr_delete_file(),
except that they do not follow symlinks. The distinction between
these calls is similar to lchown() vs chown().

Implementations to follow.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 103972 25-Sep-2002 archie

Make the following name changes to KSE related functions, etc., to better
represent their purpose and minimize namespace conflicts:

kse_fn_t -> kse_func_t
struct thread_mailbox -> struct kse_thr_mailbox
thread_interrupt() -> kse_thr_interrupt()
kse_yield() -> kse_release()
kse_new() -> kse_create()

Add missing declaration of kse_thr_interrupt() to <sys/kse.h>.
Regenerate the various generated syscall files. Minor style fixes.

Reviewed by: julian


# 103574 18-Sep-2002 alfred

Add the rest of the kernel support for the sem_ API in kern/uipc_sem.c.

Option 'P1003_1B_SEMAPHORES' to compile them in, or load the "sem" module
to activate them.

Have kern/makesyscalls.sh emit an include for sys/_semaphore.h into sysproto.h
to pull in the typedef for semid_t.

Add the syscalls to the syscall table as module stubs.


# 102132 19-Aug-2002 rwatson

mac_syscall is now implemented, switch to MSTD.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 101425 06-Aug-2002 rwatson

Rename mac_policy() to mac_syscall() to be more reflective of its
purpose.

Submitted by: cvance@tislabs.com
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 100990 30-Jul-2002 rwatson

Introduce support for Mandatory Access Control and extensible
kernel access control.

Replace 'void *' with 'struct mac *' now that mac.h is in the base
tree. The current POSIX.1e-derived userland MAC interface is
schedule for replacement, but will act as a functional placeholder
until the replacement is done. These system calls allow userland
processes to get and set labels on both the current process, as well
as file system objects and file descriptor backed objects.


# 100954 30-Jul-2002 rwatson

Introduce a mac_policy() system call that will provide MAC policies
with a general purpose front end entry point for user applications
to invoke. The MAC framework will route the system call to the
appropriate policy by name.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 100896 30-Jul-2002 rwatson

Prototype function arguments, only with MAC-specific structures
replaced with void until we bring in the actual structure definitions.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 99916 13-Jul-2002 alfred

Remove incorrect comment about now corrected manpage.


# 99855 12-Jul-2002 alfred

Create a bug-for-bug FreeBSD4 compatible version of sendfile and move the
fixed sendfile over. This is needed to preserve binary compatibility from
4.x to 5.x.


# 99072 29-Jun-2002 julian

Part 1 of KSE-III

The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)

Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)

NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..


# 98197 13-Jun-2002 rwatson

Keep POSIX.1e capabilities system call placeholders, but remove definitions.


# 97369 28-May-2002 marcel

Add syscall uuidgen() for generating Univerally Unique Identifiers
(UUIDs). On ia64 UUIDs, aka GUIDs, are used by EFI and the firmware
among others. To create GUID Partition Tables (GPTs), we need to
be able to generate UUIDs.


# 96083 05-May-2002 mux

Add an entry for the lchflags(2) syscall. It's useful to prevent
a symlink deletion.

Reviewed by: rwatson


# 94935 17-Apr-2002 mux

Add an entry for the kenv(2) syscall (code to follow).

Reviewed by: peter


# 94640 14-Apr-2002 alc

Remove the requirement that Giant be held around sigreturn().


# 94446 11-Apr-2002 alc

Remove the requirement that Giant be held around osigreturn(). All platform-
specific implementations are MPSAFE.


# 91692 05-Mar-2002 rwatson

Reserve system call numbers for the MAC framework. This will prevent
people working on the MAC tree from getting toasted whenever system call
numbers are allocated in the main tree (for example, for KSE :-).
Calls allocated: __mac_{get,set}_proc, __mac_{get,set}_{fd,file}().

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 90889 19-Feb-2002 julian

Add stub syscalls and definitions for KSE calls.
"Book'em Danno"


# 90886 19-Feb-2002 julian

Add 5 KSE syscalls. Two will be implemented with the next KSE
step and the others are reservations for coming code.
All will be stubbed in this kernel in the next commit.
This will allow people to easily make KSE binaries for userland testing
(the syscalls will be in libc) but they will still need a real KSE kernel
to test it. (libc looks in /sys to decide what it should add stubs for).


# 90777 17-Feb-2002 deischen

Fix prototype to sigreturn to use struct __ucontext instead of ucontext_t.


# 90448 10-Feb-2002 rwatson

Part I: Update extended attribute API and ABI:

o Modify the system call syntax for extattr_{get,set}_{fd,file}() so
as not to use the scatter gather API (which appeared not to be used
by any consumers, and be less portable), rather, accepts 'data'
and 'nbytes' in the style of other simple read/write interfaces.
This changes the API and ABI.

o Modify system call semantics so that extattr_get_{fd,file}() return
a size_t. When performing a read, the number of bytes read will
be returned, unless the data pointer is NULL, in which case the
number of bytes of data are returned. This changes the API only.

o Modify the VOP_GETEXTATTR() vnode operation to accept a *size_t
argument so as to return the size, if desirable. If set to NULL,
the size will not be returned.

o Update various filesystems (pseodofs, ufs) to DTRT.

These changes should make extended attributes more useful and more
portable. More commits to rebuild the system call files, as well
as update userland utilities to follow.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 90073 01-Feb-2002 bde

Made osigreturn(2) standard so that SYS_osigreturn can be used in the
signal trampoline for old signals. The arches that support old signals
currently abuse sigreturn(2) instead. This mainly complicates things
and slightly breaks the the new sigreturn(2).

COMPAT is too limited to support the correct configuration of osigreturn,
and this commit doesn't attempt to fix it; it just moves the bogusness:
osigreturn() must now be provided unconditionally even on arches that
don't really need it; previously it had to be provided under the bogus
condition defined(COMPAT_43).


# 88633 29-Dec-2001 alfred

Make AIO a loadable module.

Remove the explicit call to aio_proc_rundown() from exit1(), instead AIO
will use at_exit(9).

Add functions at_exec(9), rm_at_exec(9) which function nearly the
same as at_exec(9) and rm_at_exec(9), these functions are called
on behalf of modules at the time of execve(2) after the image
activator has run.

Use a modified version of tegge's suggestion via at_exec(9) to close
an exploitable race in AIO.

Fix SYSCALL_MODULE_HELPER such that it's archetecuterally neutral,
the problem was that one had to pass it a paramater indicating the
number of arguments which were actually the number of "int". Fix
it by using an inline version of the AS macro against the syscall
arguments. (AS should be available globally but we'll get to that
later.)

Add a primative system for dynamically adding kqueue ops, it's really
not as sophisticated as it should be, but I'll discuss with jlemon when
he's around.


# 85890 02-Nov-2001 phk

Reserve 378 for the new mount syscall Maxime Henrion <mux@qualys.com>
is working on. (This is to get us more than 32 mountoptions).


# 84883 13-Oct-2001 rwatson

o Reserve system call 377 for afs_syscall; by reserving a system call
number, portable OpenAFS applications don't have to attempt to determine
what system call number was dynamically allocated. No system call
prototype or implementation is defined.

Requested by: Tom Maher <tardis@watson.org>


# 83795 21-Sep-2001 rwatson

o Introduce eaccess(2), a version of access(2) that uses the effective
credentials rather than the real credentials. This is useful for
implementing GUI's which need to modify icons based on access rights,
but where use of open(2) is too expensive, use of stat(2) doesn't
reflect the file system's real protection model, and use of
access() suffers from real/effective credential confusion. This
implementation provides the same semantics as the call of the same
name on SCO OpenServer. Note: using this call improperly can
leave you subject to some of the same races present in the
access(2) call.
o To implement this, break out the basic logic of access(2) into
vpaccess(), which accepts a passed credential to perform the
invocation of VOP_ACCESS(). Add eaccess(2) to invoke vpaccess(),
and modify access(2) to use vpaccess().

Obtained from: TrustedBSD Project


# 83651 18-Sep-2001 peter

Cleanup and split of nfs client and server code.
This builds on the top of several repo-copies.


# 82753 01-Sep-2001 dillon

Synchronize syscalls.master(s) with recent Giant pushdown work


# 82711 01-Sep-2001 dillon

Make yield() MPSAFE.

Synchronize syscalls.master with all MPSAFE changes to date. Synchronize
new syscall generation follows because yield() will panic if it is out
of sync with syscalls.master.


# 82610 30-Aug-2001 dillon

Giant pushdown syscalls in kern/uipc_syscalls.c. Affected calls:

recvmsg(), sendmsg(), recvfrom(), accept(), getpeername(), getsockname(),
socket(), connect(), accept(), send(), recv(), bind(), setsockopt(), listen(),
sendto(), shutdown(), socketpair(), sendfile()


# 82607 30-Aug-2001 dillon

Giant Pushdown: sysv shm, sem, and msg calls.


# 82585 30-Aug-2001 dillon

Remove the MPSAFE keyword from the parser for syscalls.master.
Instead introduce the [M] prefix to existing keywords. e.g.
MSTD is the MP SAFE version of STD. This is prepatory for a
massive Giant lock pushdown. The old MPSAFE keyword made
syscalls.master too messy.

Begin comments MP-Safe procedures with the comment:
/*
* MPSAFE
*/
This comments means that the procedure may be called without
Giant held (The procedure itself may still need to obtain
Giant temporarily to do its thing).

sv_prepsyscall() is now MP SAFE and assumed to be MP SAFE
sv_transtrap() is now MP SAFE and assumed to be MP SAFE

ktrsyscall() and ktrsysret() are now MP SAFE (Giant Pushdown)
trapsignal() is now MP SAFE (Giant Pushdown)

Places which used to do the if (mtx_owned(&Giant)) mtx_unlock(&Giant)
test in syscall[2]() in */*/trap.c now do not. Instead they
explicitly unlock Giant if they previously obtained it, and then
assert that it is no longer held to catch broken system calls.

Rebuild syscall tables.


# 77386 29-May-2001 phk

Remove a comment which was past its shelf life.

PR: 18750
Submitted by: Tony Finch <dot@dotat.at>


# 76827 18-May-2001 alfred

Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb


# 76472 11-May-2001 tegge

gettimeofday() is MP safe on both -current and -stable.


# 75426 11-Apr-2001 rwatson

o Introduce a new system call, __setsugid(), which allows a process to
toggle the P_SUGID bit explicitly, rather than relying on it being
set implicitly by other protection and credential logic. This feature
is introduced to support inter-process authorization regression testing
by simplifying userland credential management allowing the easy
isolation and reproduction of authorization events with specific
security contexts. This feature is enabled only by "options REGRESSION"
and is not intended to be used by applications. While the feature is
not known to introduce security vulnerabilities, it does allow
processes to enter previously inaccessible parts of the credential
state machine, and is therefore disabled by default. It may not
constitute a risk, and therefore in the future pending further analysis
(and appropriate need) may become a published interface.

Obtained from: TrustedBSD Project


# 75038 31-Mar-2001 rwatson

o Introduce extattr_{delete,get,set}_fd() to allow extended attribute
operations on file descriptors, which complement the existing set of
calls, extattr_{delete,get,set}_file() which act on paths. In doing
so, restructure the system call implementation such that the two sets
of functions share most of the relevant code, rather than duplicating
it. This pushes the vnode locking into the shared code, but keeps
the copying in of some arguments in the system call code. Allowing
access via file descriptors reduces the opportunity for race
conditions when managing extended attributes.

Obtained from: TrustedBSD Project


# 74437 19-Mar-2001 rwatson

o Rename "namespace" argument to "attrnamespace" as namespace is a C++
reserved word.

Submitted by: jkh
Obtained from: TrustedBSD Project


# 74273 15-Mar-2001 rwatson

o Change the API and ABI of the Extended Attribute kernel interfaces to
introduce a new argument, "namespace", rather than relying on a first-
character namespace indicator. This is in line with more recent
thinking on EA interfaces on various mailing lists, including the
posix1e, Linux acl-devel, and trustedbsd-discuss forums. Two namespaces
are defined by default, EXTATTR_NAMESPACE_SYSTEM and
EXTATTR_NAMESPACE_USER, where the primary distinction lies in the
access control model: user EAs are accessible based on the normal
MAC and DAC file/directory protections, and system attributes are
limited to kernel-originated or appropriately privileged userland
requests.

o These API changes occur at several levels: the namespace argument is
introduced in the extattr_{get,set}_file() system call interfaces,
at the vnode operation level in the vop_{get,set}extattr() interfaces,
and in the UFS extended attribute implementation. Changes are also
introduced in the VFS extattrctl() interface (system call, VFS,
and UFS implementation), where the arguments are modified to include
a namespace field, as well as modified to advoid direct access to
userspace variables from below the VFS layer (in the style of recent
changes to mount by adrian@FreeBSD.org). This required some cleanup
and bug fixing regarding VFS locks and the VFS interface, as a vnode
pointer may now be optionally submitted to the VFS_EXTATTRCTL()
call. Updated documentation for the VFS interface will be committed
shortly.

o In the near future, the auto-starting feature will be updated to
search two sub-directories to the ".attribute" directory in appropriate
file systems: "user" and "system" to locate attributes intended for
those namespaces, as the single filename is no longer sufficient
to indicate what namespace the attribute is intended for. Until this
is committed, all attributes auto-started by UFS will be placed in
the EXTATTR_NAMESPACE_SYSTEM namespace.

o The default POSIX.1e attribute names for ACLs and Capabilities have
been updated to no longer include the '$' in their filename. As such,
if you're using these features, you'll need to rename the attribute
backing files to the same names without '$' symbols in front.

o Note that these changes will require changes in userland, which will
be committed shortly. These include modifications to the extended
attribute utilities, as well as to libutil for new namespace
string conversion routines. Once the matching userland changes are
committed, a buildworld is recommended to update all the necessary
include files and verify that the kernel and userland environments
are in sync. Note: If you do not use extended attributes (most people
won't), upgrading is not imperative although since the system call
API has changed, the new userland extended attribute code will no longer
compile with old include files.

o Couple of minor cleanups while I'm there: make more code compilation
conditional on FFS_EXTATTR, which should recover a bit of space on
kernels running without EA's, as well as update copyright dates.

Obtained from: TrustedBSD Project


# 69512 02-Dec-2000 jake

Remove thr_sleep and thr_wakeup. Remove fields p_nthread and p_wakeup
from struct proc, which are now unused (p_nthread already was).
Remove process flag P_KTHREADP which was untested and only set
in vfs_aio.c (it should use kthread_create). Move the yield
system call to kern_synch.c as kern_threads.c has been removed
completely.

moral support from: alfred, jhb


# 69449 01-Dec-2000 alfred

sysvipc loadable.

new syscall entry lkmressys - "reserved loadable syscall"

Make syscall_register allow overwriting of such entries (lkmressys).


# 65150 28-Aug-2000 marcel

Fix prototypes for {o|}{g|s}etrlimit. A recent change in the
Linuxulator caused this bug to trigger.


# 64001 29-Jul-2000 peter

Sigh. Fix SYS_exit problems. I misunderstood the significance of these
trailing options.


# 63986 28-Jul-2000 peter

Change the 'exit()' system call to 'sys_exit()'. This avoids overlapping
gcc's internal exit() prototypes and the (futile) hackery that we did to
try and avoid warnings. main() was renamed for similar reasons.
Remove an exit related hack from makesyscalls.sh.


# 63452 18-Jul-2000 jlemon

Simplify kqueue API slightly.

Discussed on: -arch


# 63082 13-Jul-2000 rwatson

o Introduce syscall prototypes, stubs for __cap_{get,set}_{fd,file},
syscalls to manage capability sets on files. First of two commits.

Obtained from: TrustedBSD Project


# 61718 15-Jun-2000 rwatson

Introduce syscalls for process capability manipulation. Currently backs
onto already committed stubs. Commit one of two.

Reviewed by: Damned if I can remember. Many people.
Obtained from: TrustedBSD Project


# 60247 09-May-2000 bde

Fixed the declaration of mmap(). The crufty padding arg had the wrong
type. This gave an inconsistent amount of crufty padding on i386's with
64-bit longs (8 bytes instead of 4). On alphas it gives a consistent
amount of crufty padding (8 bytes) in addition to the 4 bytes of normal
padding caused by passing int args as register_t's.

Fixed the args struct tag for the NOPROTO syscalls (netbsd_lchown() and
netbsd_msync()). The tag is currently unused for NOPROTO syscalls, so
the bug has no effect, but it will be used even in the NOPROTO case to
calculate sy_nargs correctly.


# 59827 01-May-2000 peter

Remove undocumented broken-as-designed semconfig() syscall.


# 59288 16-Apr-2000 jlemon

Introduce kqueue() and kevent(), a kernel event notification facility.


# 58963 03-Apr-2000 alfred

Make makesyscalls.sh parse an optional field 'MPSAFE' that specifies
that a syscall does not want the BGL to be grabbed automatically.

Add the new MPSAFE flag to the syscalls that dillon has determined to
be MPSAFE.


# 56270 19-Jan-2000 rwatson

Fix bde'isms in acl/extattr syscall interface, renaming syscalls to
prettier (?) names, adding some const's around here, et al.

Commit 1 out of 3.

Reviewed by: bde


# 56115 16-Jan-2000 peter

Implement setres[ug]id() and getres[ug]id(). This has been sitting in
my tree for ages (~2 years) waiting for an excuse to commit it. Now Linux
has implemented it and it seems that Staroffice (when using the
linux_base6.1 port's libc) calls this in the linux emulator and dies in
setup. The Linux emulator can call these now.


# 55943 14-Jan-2000 jasone

Add aio_waitcomplete(). Make aio work correctly for socket descriptors.
Make gratuitous style(9) fixes (me, not the submitter) to make the aio
code more readable.

PR: kern/12053
Submitted by: Chris Sedore <cmsedore@maxwell.syr.edu>


# 54970 21-Dec-1999 alfred

make getfh a standard syscall instead of dependant on having
NFSSERVER defined, useful for userland fileservers that want to
use a filehandle type interface to the filesystem.

Submitted by: Assar Westerlund assar@stacken.kth.se
PR: kern/15452


# 54802 19-Dec-1999 rwatson

First pass commit to introduce new ACL and Extended Attribute system calls.
The second pass commit with all the supporting code will happen shortly
afterwards.

Reviewed by: eivind


# 53299 17-Nov-1999 brian

modfind(char *) -> modfind(const char *)

Reminded by: dfr


# 52149 12-Oct-1999 marcel

Now that userland including modules don't use the osig* syscalls,
make them of type COMPAT.


# 51790 29-Sep-1999 marcel

sigset_t change (part 1 of 5)
-----------------------------

Rename sigaction, sigprocmask, sigpending and sigsuspend to
osigaction, osigprocmask, osigpending and osigsuspend (resp)
and add new syscalls for them to support the new sisgset_t
without breaking existing binaries.

Change the prototype of sigaltstack to use the typedef stack_t
instead of struct sigaltstack to reflect that it is SUSv2
compliant.

Also, rename sigreturn to osigreturn and add a new syscall
to support the modified stackframe. The change is caused by
sigreturn operating on ucontext_t now and the fact that
siginfo_t has been updated to conform to SUSv2.


# 51138 10-Sep-1999 alfred

Seperate the export check in VFS_FHTOVP, exports are now checked via
VFS_CHECKEXP.

Add fh(open|stat|stafs) syscalls to allow userland to query filesystems
based on (network) filehandle.

Obtained from: NetBSD


# 50477 27-Aug-1999 peter

$Id$ -> $FreeBSD$


# 49641 11-Aug-1999 nik

Add CPT_NOA, LIBCOMPAT, NODEF, NOARGS, NOPROTO, and NOIMPL to the commented
list of available types.

PR: docs/13007
Submitted by: Assar Westerlund <assar@sics.se>


# 49428 05-Aug-1999 jkh

Move syscall 180 back to where it was before and fix the
incorrect comment which led me to move it in the first place.


# 49420 04-Aug-1999 jkh

Reserve a syscall for the arla folks. I'm assuming that since syscalls.c
and init_sysent.c are checked into CVS, I should also commit the regenerated
copies even though they're built by syscalls.master. Correct? Bruce? :)


# 47103 13-May-1999 bde

Fixed nonsense arg type `const caddr_t' in the prototype() for utrace().
Changed to `const void *'. utrace() is undocumented, so nothing should
notice.

Fixed missing consts for utrace() and ktrace() in syscalls.master.

sys/ktrace.h is missing some Lite2 changes of shorts to ints.


# 46154 28-Apr-1999 phk

Add the jail system call.


# 45311 04-Apr-1999 dt

Add standard padding argument to pread and pwrite syscall. That should make them
NetBSD compatible.

Add parameter to fo_read and fo_write. (The only flag FOF_OFFSET mean that
the offset is set in the struct uio).

Factor out some common code from read/pread/write/pwrite syscalls.


# 45065 27-Mar-1999 alc

Added pread and pwrite. These functions are defined by the X/Open
Threads Extension. (Note: We use the same syscall numbers as NetBSD.)

Submitted by: John Plevyak <jplevyak@inktomi.com>


# 41088 11-Nov-1998 peter

A kldsym(2) syscall prototype for extracting information from the in-kernel
linker. This is intended to replace kvm_mkdb etc. The first version
only does name->value lookups, but it's open ended. value->name lookups
would probably be a good thing to do too.

It's been suggested to try and connect the symbol tables to sysctl (which
is probably a more flexible way of doing it if it's done right), but that
is far more complex and difficult than I was ready to have a shot at.


# 40931 05-Nov-1998 dg

Implemented zero-copy TCP/IP extensions via sendfile(2) - send a
file to a stream socket. sendfile(2) is similar to implementations in
HP-UX, Linux, and other systems, but the API is more extensive and
addresses many of the complaints that the Apache Group and others have
had with those other implementations. Thanks to Marc Slemko of the
Apache Group for helping me work out the best API for this.
Anyway, this has the "net" result of speeding up sends of files over
TCP/IP sockets by about 10X (that is to say, uses 1/10th of the CPU
cycles) when compared to a traditional read/write loop.


# 38515 24-Aug-1998 dfr

Fix a few syscall arguments to use size_t instead of u_int.


# 36735 07-Jun-1998 dfr

This commit fixes various 64bit portability problems required for
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.

The prototype FreeBSD/alpha machdep will follow in a couple of days
time.


# 36033 14-May-1998 peter

deep-six signanosleep(). It sounded like a good idea at the time.


# 35938 11-May-1998 dyson

Fix the futimes/undelete/utrace conflict with other BSD's. Note that
the only common usage of utrace (the possible problem with this
commit) is with malloc, so this should be a real problem. Add
the various NetBSD syscalls that allow full emulation of their
development environment.


# 34925 28-Mar-1998 dufault

Finish _POSIX_PRIORITY_SCHEDULING. Needs P1003_1B and
_KPOSIX_PRIORITY_SCHEDULING options to work. Changes:

Change all "posix4" to "p1003_1b". Misnamed files are left
as "posix4" until I'm told if I can simply delete them and add
new ones;

Add _POSIX_PRIORITY_SCHEDULING system calls for FreeBSD and Linux;

Add man pages for _POSIX_PRIORITY_SCHEDULING system calls;

Add options to LINT;

Minor fixes to P1003_1B code during testing.


# 33040 03-Feb-1998 bde

Fixed type of mincore().


# 32889 30-Jan-1998 phk

Retire LFS.

If you want to play with it, you can find the final version of the
code in the repository the tag LFS_RETIREMENT.

If somebody makes LFS work again, adding it back is certainly
desireable, but as it is now nobody seems to care much about it,
and it has suffered considerable bitrot since its somewhat haphazard
integration.

R.I.P


# 32726 24-Jan-1998 eivind

Make all file-system (MFS, FFS, NFS, LFS, DEVFS) related option new-style.

This introduce an xxxFS_BOOT for each of the rootable filesystems.
(Presently not required, but encouraged to allow a smooth move of option *FS
to opt_dontuse.h later.)

LFS is temporarily disabled, and will be re-enabled tomorrow.


# 32166 01-Jan-1998 alex

Added missing caddr_t --> void * conversions for sys/mman.h functions.

Submitted by: bde


# 30740 26-Oct-1997 phk

Add "NOIMPL" for syscalls we know what is, but don't implement as "STD".
Use this for getfh & nfssvc.


# 29391 14-Sep-1997 phk

Add a __getcwd() syscall. This is intentionally undocumented, but all
it does is to try to figure the pwd out from the vfs namecache, and
return a reversed string to it. libc:getcwd() is responsible for
flipping it back.


# 29348 14-Sep-1997 peter

Activate poll(2) syscall


# 28399 19-Aug-1997 peter

SVR4/XPG-style getpgid()/getsid() syscalls.


# 26671 15-Jun-1997 dyson

Modifications to existing files to support the initial AIO/LIO and
kernel based threading support.


# 26333 01-Jun-1997 peter

New syscall, signanosleep(), which is a hybrid of sigsuspend(2) and
nanosleep(2). It sleeps until either the time expires, or a signal
permitted by the supplied mask arrives (eg: SIGALRM if appropriate)


# 25581 08-May-1997 peter

oops. NODIDE -> NOHIDE


# 25580 08-May-1997 peter

Define entries for the posix-style clock/timer syscalls including
nanosleep(). Also, note some syscall conflicts with other systems and
indicate slots tagged for use with other syscalls some day.


# 25537 07-May-1997 dfr

This is the kernel linker. To use it, you will first need to apply
the patches in freefall:/home/dfr/ld.diffs to your ld sources and set
BINFORMAT to aoutkld when linking the kernel.

Library changes and userland utilities will appear in a later commit.


# 24451 31-Mar-1997 peter

issetugid is now implemented rather than reserved


# 24439 31-Mar-1997 peter

Reserve 252 (poll, first in OpenBSD)
Reserve 253 (issetugid, as in OpenBSD)
Allocate 254 for lchown(2)


# 22975 22-Feb-1997 peter

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 22521 10-Feb-1997 dyson

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 21776 16-Jan-1997 bde

Reduced #include spam in <sys/sysproto.h> and fixed things that depended
on it.

makesyscalls.sh:
This parsed $Id$. Fixed(?) to parse $FreeBSD$. The output is wrong when
the id is not expanded in the source file.

syscalls.master:
Fixed declaration of sigsuspend(). There are still some bogons and
spam involving sigset_t.
Use `struct foo *' instead of the equivalent `foo_t *' for some nfs and
lfs syscalls so that <sys/sysproto.h> doesn't depend on <sys/mount.h>.


# 21673 14-Jan-1997 jkh

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 18398 19-Sep-1996 phk

Add the utrace(caddr_t addr,size_t len) syscall, that will store the
data pointed at in a ktrace file, if this process is being ktrace'ed.
I'm using this to profile malloc usage.
The advantage is that there is no context around this call, ie, no
open file or socket, so it will work in any process, and you can
decide if you want it to collect data or not.


# 17702 20-Aug-1996 smpatel

Remove the kernel FD_SETSIZE limit for select().
Make select()'s first argument 'int' not 'u_int'.

Reviewed by: bde


# 14322 02-Mar-1996 peter

Change the 'int len' args in the mmap/msync/mincore/etc class syscalls
to 'size_t' as per bde's request.


# 14219 23-Feb-1996 peter

Add hooks for rfork/minherit pair, and reset args of vfork in preperation
for adding the syscalls.


# 14215 23-Feb-1996 peter

Note the syscall numbers used in BSD/OS 2.x. We dont want to
accidently use one of these ourselves as it'd make it harder to run
their binaries.
Also, remove the now-defunct #include "opt_sysvipc.h".


# 13416 13-Jan-1996 phk

Add an option NFS_NOSERVER which saves 100K in the install kernel (or
any other kernel that uses it). Use with option NFS.


# 13331 08-Jan-1996 peter

Remove the #ifdef SYSVSHM etc. Always call the functions, some stubs
are about to go in. This is to fix the problem with the ibcs2 and linux
lkm's not being able to call the sysv ipc functions unless the build is
modified.


# 13226 04-Jan-1996 wollman

Convert SYSV IPC to new-style options. (I hope I got everything...)
The LKMs will need an extra file, to come later.


# 13203 03-Jan-1996 wollman

Converted two options over to the new scheme: USER_LDT and KTRACE.


# 12864 15-Dec-1995 peter

Add the direct sysv shm/sem/msg system calls, in the same way as NetBSD.
This costs very little, we gain prototypes for the calls from the linux
emulator, and this is one less thing in the way of NetBSD binary support.


# 12216 12-Nov-1995 bde

Fixed the args list for mount(). We're not ready for the BSD4.4lite2/
NetBSD interface.

Increased the bogusness of the args list for mmap(). The args lists for
most of the memory mapping functions are bogus. The args lists in
syscalls.master are a little better than the ones in the args structs
currently being used, but the improvement for mmap() changed the object
code and I don't want to worry about that now.

Increased the bogusness of the args list for fcntl. BSD4.4lite2/NetBSD
uses `void *' instead of int for the third arg. This has the advantage
of working when `void *'s are longer than ints, but requires extra bogus
casts that I hope to avoid.

Fixed the args list for uname. `struct outsname' seems to be a typo,
not an old interface.

Added comments about bogus args lists for open, mount, msync, munmap,
mprotect, madvise, mincore, fcntl, semsys, msgsys and shmsys.


# 11330 07-Oct-1995 swallace

Fix misc formatting errors in makesyscalls.sh.

Add CPT_NOA type which is COMPAT with NOARGS -- do not produce argument
struct in sysproto.

Change accept, recvfrom, getsockname to CPT_NOA type.
Fix getrlimit, setrlimit argument #2 name to struct rlimit.


# 11294 07-Oct-1995 swallace

Add new functionality to makesyscalls.sh:
o optional config-file to set vars: sysnames, sysproto, sysproto_h,
syshdr, syssw, syshide, syscallprefix, switchname, namesname, sysvec.
o change syntax of syscalls.master entry:
remove argument count.
add pseudo-prototype field defining function name and arguments.
o generates correct structure definitions for all system calls
in sys/sysproto.h
o add type NOARGS: same as STD except do not create structure in
sys/sysproto.h
o add type NOPROTO: same as STD except do not create structure or function
prototype in sys/sysproto.h

New functionality provides complete prototype definitions.
Usefull for generating files for emulated systems like my new ibcs2 code.

Update syscalls.master to reflect new changes. For example, read()
entry now looks like:

3 STD POSIX { int ibcs2_read(int fd, char *buf, u_int nbytes); }

This is similar to how NetBSD generates these files.


# 10905 19-Sep-1995 bde

Generate prototypes for syscall-implementing functions. Put them in
<sys/sysproto.h> and use them (so far only) in kern/init_sysent.c.

Don't put $Id in generated files.

kern/syscalls.master:
I had to add some new fields to describe some non-orthogonal names.
E.g., the args struct for the syscall-implementing function foo()
is usually named `foo_args', but for getpid() it is named `args'.

sys/sysent.h:
sy_call_t is still incomplete to hide a couple of warnings.


# 8019 23-Apr-1995 ache

Make setreuid/setregid active syscalls


# 7359 25-Mar-1995 dg

Added a third "flags" argument to msync() ...as other systems have.


# 6875 04-Mar-1995 dg

Removed obsolete vtrace() remnants.


# 5107 14-Dec-1994 wollman

Actually enable NTP kernel PLL. (Oops!)
Noticed by Pete Carah.


# 3291 02-Oct-1994 dg

"idle priority" support. Based on code from Henrik Vestergaard Draboel,
but substantially rewritten by me.


# 3178 28-Sep-1994 wollman

LKM support is no longer optional.


# 2858 18-Sep-1994 wollman

Redo Kernel NTP PLL support, kernel side.

This code is mostly taken from the 1.1 port (which was in turn taken from
Dave Mills's kern.tar.Z example). A few significant differences:

1) ntp_gettime() is now a MIB variable rather than a system call. A few
fiddles are done in libc to make it behave the same.

2) mono_time does not participate in the PLL adjustments.

3) A new interface has been defined (in <machine/clock.h>) for doing
possibly machine-dependent things around the time of the clock update.
This is used in Pentium kernels to disable interrupts, set `time', and
reset the CPU cycle counter as quickly as possible to avoid jitter in
microtime(). Measurements show an apparent resolution of a bit more than
8.14usec, which is reasonable given system-call overhead.


# 2729 13-Sep-1994 dfr

Added SYSV ipcs.

Obtained from: NetBSD and FreeBSD-1.1.5


# 2696 12-Sep-1994 wollman

Added namespace information for future pollution-control measures.


# 2441 01-Sep-1994 dg

Realtime priority scheduling support.

Submitted by: Henrik Vestergaard Draboel


# 2297 26-Aug-1994 wollman

Added ntp_gettime and ntp_adjtime syscalls, both nosys'ed out until
someone gets to re-integrating the code. ntp_gettime() should be
turned into a sysctl variable and emulated in the library.


# 2124 19-Aug-1994 dg

Terry Lambert's loadable kernel module support w/improvements from the
NetBSD group.


# 1817 02-Aug-1994 dg

Added $Id$


# 1549 25-May-1994 rgrimes

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# 1542 24-May-1994 rgrimes

This commit was generated by cvs2svn to compensate for changes in r1541,
which included commits to RCS files with non-trunk default branches.


# 1541 24-May-1994 rgrimes

BSD 4.4 Lite Kernel Sources