History log of /freebsd-current/sys/compat/freebsd32/freebsd32_misc.c
Revision Date Author Comments
# d0efabdf 19-Mar-2024 Brooks Davis <brooks@FreeBSD.org>

syscalls.master: make __sys_fcntl take an intptr_t

The (optional) third argument of fcntl is sometimes a pointer so change
the type to intptr_t. Update the libc-internal defintion (actually used
by libthr) to take a fixed intptr_t argument rather than pretending it's
a variadic function. (That worked because all supported architectures
pass variadic arguments as though the function was declared with those
types. In CheriBSD that changes because variadic arguments are passed
via a bounded array.)

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D44381


# d060b420 18-Mar-2024 Brooks Davis <brooks@FreeBSD.org>

freebsd32: struct siginfo32 -> struct __siginfo32

In the next commit I will update syscalls.master to use struct __siginfo
(which actually exists) so this update will be needed to make
generated files (from make sysent) align.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D44380


# 694ef157 19-Mar-2024 Brooks Davis <brooks@FreeBSD.org>

freebsd32: freebsd32_copyinuio takes const iovp

We only read the iovp so make it const like in copyinuio.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D44376


# 61cc4830 18-Jan-2024 Alfredo Mazzinghi <am2419@cl.cam.ac.uk>

Abstract UIO allocation and deallocation.

Introduce the allocuio() and freeuio() functions to allocate and
deallocate struct uio. This hides the actual allocator interface, so it
is easier to modify the sub-allocation layout of struct uio and the
corresponding iovec array.

Obtained from: CheriBSD
Reviewed by: kib, markj
MFC after: 2 weeks
Sponsored by: CHaOS, EPSRC grant EP/V000292/1
Differential Revision: https://reviews.freebsd.org/D43711


# d0adc2f2 25-Dec-2023 Mark Johnston <markj@FreeBSD.org>

sendfile: Explicitly ignore errors from copyout()

There is a documented bug in sendfile.2 which notes that sendfile(2)
does not raise an error if it fails to copy out the number of bytes
written. Explicitly ignore the error from copyout() calls in
preparation for annotating copyout() with __result_use_check.

Reviewed by: glebius, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D43129


# 81eb7baa 25-Dec-2023 Mark Johnston <markj@FreeBSD.org>

freebsd32: Report errors when copying out oldlenp in __sysctl

This matches the native implementation's behaviour.

Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D43101


# bd1654ce 21-Dec-2023 Mark Johnston <markj@FreeBSD.org>

freebsd32: Fix error handling for suword32() calls

suword32() returns -1 upon an error, not an errno value.

MFC after: 1 week


# bddc7a8a 18-Nov-2023 Konstantin Belousov <kib@FreeBSD.org>

Tweak compat_freebsd32_bit feature name

Mark the current name 'compat_freebsd_32bit' as legacy, and add the
new name 'compat_freebsd32'. This seems to help with some make and
shell uses.

Requested by: jrtc27
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D42641


# 5a2bbace 16-Nov-2023 Konstantin Belousov <kib@FreeBSD.org>

FEATURE compat_freebsd_32bit: only report on arm64 when support is present

depending on hardware support for aarch32.

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D42641


# 918966a2 05-Sep-2023 Jake Freeland <jfree@FreeBSD.org>

timerfd: Relocate 32-bit compat code

32-bit compatibility code is conventionally stored in
sys/compat/freebsd32. Move freebsd32_timerfd_gettime() and
freebsd32_timerfd_settime() from sys/kern/sys_timerfd.c to
sys/compat/freebsd32/freebsd32_misc.c.

MFC After: 3 days
Reviewed by: imp, markj
Differential Revision; https://reviews.freebsd.org/D41640


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 9b65fa69 29-Jul-2023 Konstantin Belousov <kib@FreeBSD.org>

linuxolator: implement Linux' PROT_GROWSDOWN

From the Linux man page for mprotect(2):
PROT_GROWSDOWN
Apply the protection mode down to the beginning of a mapping
that grows downward (which should be a stack segment or a
segment mapped with the MAP_GROWSDOWN flag set).

Reported by: dchagin
Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D41099


# 4d846d26 10-May-2023 Warner Losh <imp@FreeBSD.org>

spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD

The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix


# cb858340 28-Apr-2023 Dmitry Chagin <dchagin@FreeBSD.org>

linux(4): Add a dedicated statat() implementation

Get rid of calling Linux stat translation hook and specific to Linux
handling of non-vnode dirfd from kern_statat(),

Reviewed by: kib, mjg
Differential revision: https://reviews.freebsd.org/D35474


# 140ceb5d 30-Nov-2022 Konstantin Belousov <kib@FreeBSD.org>

ptrace(2): add PT_SC_REMOTE remote syscall request

Reviewed by: markj
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37590


# f081a291 30-Nov-2022 Konstantin Belousov <kib@FreeBSD.org>

compat32: move struct ptrace_sc_ret32 definition from .c to .h

Reviewed by: markj
Sponsoreed by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37590


# eafafebe 23-Nov-2022 Ed Maste <emaste@FreeBSD.org>

compat32: retire now-unused MIPS support

This reverts commit a6d20bbaa2f4bb3684d2c396ef1f1411c8fb8499.


# 7b673a2c 15-Sep-2022 Jessica Clarke <jrtc27@FreeBSD.org>

freebsd32: Make sendmsg match native ABI for unpadded final control message

The API says that CMSG_SPACE should be used for msg_controllen, but in
practice the native ABI allows you to only use CMSG_LEN for the final
(typically only) control message, and real-world software does this,
including Wayland. For freebsd32, this is in practice mostly harmless,
since control messages are generally used to carry file descriptors,
which are already 4 bytes in size and thus no padding is needed, but
they can carry other quantities that may not result in an aligned
length. This was discovered after CheriBSD's freebsd64 equivalent was
updated to match the freebsd32 implementation, as that uses 8 byte
alignment which does break the file descriptor use case, and thus
Wayland.

This used to be addressed by aligning buflen before the first iteration,
but that allowed unwanted invalid inputs and was lost in 1b1428dcc82b,
with no safer equivalent put in its place.

Reviewed by: brooks, kib, markj
Obtained from: CheriBSD
Fixes: 1b1428dcc82b ("Fix a TOCTOU vulnerability in freebsd32_copyin_control().")
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D36554


# c46697b9 24-Aug-2022 Brooks Davis <brooks@FreeBSD.org>

freebsd32_sendmsg: fix control message ABI

When a freebsd32 caller uses all or most allowed space for control
messages (MCLBYTES == 2K) then the message may no longer fit when
the messages are padded for 64-bit alignment. Historically we've just
shrugged and said there is no ABI guarantee. We ran into this on
CheriBSD where a capsicumized 64-bit nm would fail when called with more
than 64 files.

Fix this by not gratutiously capping size of mbuf data we'll allocate
to MCLBYTES and let m_get2 allocate up to MJUMPAGESIZE (4K or larger).
Instead of hard-coding a length check, let m_get2 do it and check for a
NULL return.

Reviewed by: markj, jhb, emaste
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D36322


# 361971fb 02-Jun-2022 Kornel Dulęba <kd@FreeBSD.org>

Rework how shared page related data is stored

Store the shared page address in struct vmspace.
Also instead of storing absolute addresses of various shared page
segments save their offsets with respect to the shared page address.
This will be more useful when the shared page address is randomized.

Approved by: mw(mentor)
Sponsored by: Stormshield
Obtained from: Semihalf
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35393


# d46174cd 28-May-2022 Dmitry Chagin <dchagin@FreeBSD.org>

Finish cpuset_getaffinity() after f35093f8

Split cpuset_getaffinity() into a two counterparts, where the
user_cpuset_getaffinity() is intended to operate on the cpuset_t from
user va, while kern_cpuset_getaffinity() expects the cpuset from kernel
va.
Accordingly, the code that clears the high bits is moved to the
user_cpuset_getaffinity(). Linux sched_getaffinity() syscall returns
the size of set copied to the user-space and then glibc wrapper clears
the high bits.

MFC after: 2 weeks


# 4a3e5133 20-May-2022 Mark Johnston <markj@FreeBSD.org>

cpuset: Fix the KASAN and KMSAN builds

Rename the "copyin" and "copyout" fields of struct cpuset_copy_cb to
something less generic, since sanitizers define interceptors for
copyin() and copyout() using #define.

Reported by: syzbot+2db5d644097fc698fb6f@syzkaller.appspotmail.com
Fixes: 47a57144af25 ("cpuset: Byte swap cpuset for compat32 on big endian architectures")
Sponsored by: The FreeBSD Foundation


# 47a57144 12-May-2022 Justin Hibbits <jhibbits@FreeBSD.org>

cpuset: Byte swap cpuset for compat32 on big endian architectures

Summary:
BITSET uses long as its basic underlying type, which is dependent on the
compile type, meaning on 32-bit builds the basic type is 32 bits, but on
64-bit builds it's 64 bits. On little endian architectures this doesn't
matter, because the LSB is always at the low bit, so the words get
effectively concatenated moving between 32-bit and 64-bit, but on
big-endian architectures it throws a wrench in, as setting bit 0 in
32-bit mode is equivalent to setting bit 32 in 64-bit mode. To
demonstrate:

32-bit mode:

BIT_SET(foo, 0): 0x00000001

64-bit sees: 0x0000000100000000

cpuset is the only system interface that uses bitsets, so solve this
by swapping the integer sub-components at the copyin/copyout points.

Reviewed by: kib
MFC after: 3 days
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D35225


# f35093f8 11-May-2022 Dmitry Chagin <dchagin@FreeBSD.org>

Use Linux semantics for the thread affinity syscalls.

Linux has more tolerant checks of the user supplied cpuset_t's.

Minimum cpuset_t size that the Linux kernel permits in case of
getaffinity() is the maximum CPU id, present in the system / NBBY,
the maximum size is not limited.
For setaffinity(), Linux does not limit the size of the user-provided
cpuset_t, internally using only the meaningful part of the set, where
the upper bound is the maximum CPU id, present in the system, no larger
than the size of the kernel cpuset_t.
Unlike FreeBSD, Linux ignores high bits if set in the setaffinity(),
so clear it in the sched_setaffinity() and Linuxulator itself.

Reviewed by: Pau Amma (man pages)
In collaboration with: jhb
Differential revision: https://reviews.freebsd.org/D34849
MFC after: 2 weeks


# 8299f9a5 30-Mar-2022 Ed Maste <emaste@FreeBSD.org>

compat32: add size CTASSERTs for non-amd64 cases

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34712


# f90cd1ae 29-Mar-2022 Ed Maste <emaste@FreeBSD.org>

Clear non-x86 compat stat syscall kernel stack memory disclosure

32-bit architectures other than i386 have 64-bit time_t which results
in a struct timespec with 12 bytes for tv_sec and tv_nsec, and 4 bytes
of padding. Zero the padding holes in struct stat32 and struct
freebsd11_stat32.

i386 has 32-bit time_t; struct timespec is 8 bytes and has no padding.

Found by inspection, prompted by a report by Reno Robert of Trend Micro
Zero Day Initiative. The originally reported issue (ZDI-CAN-14538) is
already fixed in all supported FreeBSD versions (it was addressed
incidentally as part of the 64-bit inode project).

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34709


# 949e3959 07-Feb-2022 John Baldwin <jhb@FreeBSD.org>

Trim duplicate code for copying in iovecs for PT_[GS]ETREGSET.

Reviewed by: andrew, emaste
Differential Revision: https://reviews.freebsd.org/D34177


# 548a2ec4 24-Jan-2022 Andrew Turner <andrew@FreeBSD.org>

Add PT_GETREGSET

This adds the PT_GETREGSET and PT_SETREGSET ptrace types. These can be
used to access all the registers from a specified core dump note type.
The NT_PRSTATUS and NT_FPREGSET notes are initially supported. Other
machine-dependant types are expected to be added in the future.

The ptrace addr points to a struct iovec pointing at memory to hold the
registers along with its length. On success the length in the iovec is
updated to tell userspace the actual length the kernel wrote or, if the
base address is NULL, the length the kernel would have written.

Because the data field is an int the arguments are backwards when
compared to the Linux PTRACE_GETREGSET call.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19831


# fe6db727 21-Jan-2022 Konstantin Belousov <kib@FreeBSD.org>

Add security.bsd.allow_ptrace sysctl

that disables any access to ptrace(2) for all processes.

Reviewed by: emaste
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33986


# 758d98de 17-Jan-2022 Mark Johnston <markj@FreeBSD.org>

exec: Remove the stack gap implementation

ASLR stack randomization will reappear in a forthcoming commit. Rather
than inserting a random gap into the stack mapping, the entire stack
mapping itself will be randomized in the same way that other mappings
are when ASLR is enabled.

No functional change intended, as the stack gap implementation is
currently disabled by default.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33704


# 706f4a81 17-Jan-2022 Mark Johnston <markj@FreeBSD.org>

exec: Introduce the PROC_PS_STRINGS() macro

Rather than fetching the ps_strings address directly from a process'
sysentvec, use this macro. With stack address randomization the
ps_strings address is no longer fixed.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33704


# f04a0960 30-Dec-2021 Mark Johnston <markj@FreeBSD.org>

exec: Simplify sv_copyout_strings implementations a bit

Simplify control flow around handling of the execpath length and signal
trampoline. Cache the sysentvec pointer in a local variable.

No functional change intended.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33703


# cc5aa0a4 28-Dec-2021 John Baldwin <jhb@FreeBSD.org>

sys/compat: Use C99 fixed-width integer types.

No functional change.

Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D33632


# 794d3e8e 05-Dec-2021 Konstantin Belousov <kib@FreeBSD.org>

fcntl(2): add F_KINFO operation

that returns struct kinfo_file for the given file descriptor. Among
other data, it also returns kf_path, if file op was able to restore file
path.

Reviewed by: jhb, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33277


# 1caaf555 24-Nov-2021 Mateusz Guzik <mjg@FreeBSD.org>

32-bit compat: plug a set-but-not-unused var in freebsd32_copy_msg_out

Sponsored by: Rubicon Communications, LLC ("Netgate")


# 6eefabd4 22-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls: improve nstat, nfstat, nlstat

Optionally return errors when truncating dev_t, ino_t, and nlink_t.
In the interest of code reuse, use freebsd11_cvtstat() to perform the
truncation and error handling and then convert the resulting struct
freebsd11_stat to struct nstat.

Add missing freebsd32 compat syscalls. These syscalls require
translation because struct nstat contains four instances of struct
timespec which in turn contains a time_t and a long.

Reviewed by: kib


# fea4a9af 17-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

fspacectl: remove unneeded freebsd32 wrapper

fspacectl(2) does not require special handling on freebsd32. The
presence of off_t in a struct does not cause it's size to change
between the native ABI and the 32-bit ABI supported by freebsd32
because off_t is always int64_t on BSD systems. Further, byte
order only requires handling for paired argument or return registers.

(32-byte alignment of 64-bit objects on i386 can require special
handling, but that situtation does not apply here.)

Reviewed by: kib, khng, emaste, delphij
Differential Revision: https://reviews.freebsd.org/D32994


# 158dcd73 17-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

freebsd32: have sigqueue take a void *

This matches the default ABI and we work around issues with
union sigval by extracting the bottom 32-bits in a manual handler.

Reviewed by: kevans


# 2b9d052d 17-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

freebsd32: fix getfsstat sign extension bugs

Add freebsd32 versions of getfsstat and freebsd11_getfsstat so that
bufsize is properly sign-extended if a negative value is passed.
Reject negative values before passing to kern_getfsstat as a size_t.

Reviewed by: kevans


# f19e3fd2 17-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

freebsd32: signed long corrections

Syscalls that take signed longs need to treat the 32-bit versions as
signed int so that sign extension happens correctly. Improve
decleration quality and add a few minimal syscall implementations.

Reviewed by: kevans


# f089a2f3 17-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

freebsd32: add stubs for ofreebsd32_(send|recv)msg

The upcoming change to generate freebsd32 generated files from
sys/kern/syscalls.master doesn't have a way to handle disabling
this one without disabling the non-COMPAT counterpart so just add
a stub for now.

Reviewed by: kevans


# e3e811a3 17-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

freebsd32: add feed foward clock syscalls

These are required when supporting i386 because time_t is 32-bit which
reduces struct bintime to 12-bytes when combined with the fact that 64-bit
integers only requiring 32-bit alignment on i386. Reusing the default
ABI version resulted in 4-byte overreads or overwrites to userspace.

Reviewed by: kevans


# 25fec55b 17-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

freebsd32: remove freebsd11_freebsd32_getdents

It's exactly the same as freebsd11_getdents.

Reviewed by: kevans


# 1de34945 17-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

freebsd32: remove redundent osig*() implementations

ofreebsd32_sigprocmask, ofreebsd32_sigblock, ofreebsd32_sigsetmask,
and ofreebsd32_sigsuspend were all duplicates of the default ABI
versions and there are no type concerns as all arguments are the
same.

Reviewed by: kevans


# dbb47e92 17-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

freebsd32: remove freebsd32_recvfrom

The freebsd32_recvfrom() serves no purpose as no arguments require
translation. The prototype was mis-declared and the implementation
contained (relatively harmless) errors.

Reviewed by: kevans


# ad582667 17-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

freebsd32: remove redundant no-arg syscalls

pipe requires no special handling.

ofreebsd32_sigpending did differ from osigpending in that it acted
on the siglist rather than the sigqueue, but this appears to be an
oversight in 3fbdb3c21524d9d95278ada1d61b4d1e6bee654b.

ogetpagesize could theoretically have ABI-dependent results, but in
practice does not. If it does it would be easy handle in the central
implementation and be the least of the problems in changing the value of
PAGE_SIZE.

Reviewed by: kevans


# ab3ccb75 17-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

freebsd32: rename fstat() stat buffer argument

Reviewed by: kevans


# b35c2bca 17-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

freebsd32: rename struct wrusage32 to struct __wrusage32

This matches struct __wrusage

Reviewed by: kevans


# f1a14110 17-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

freebsd32: rename fstat argument to match default abi

Reviewed by: kevans


# 5d0d6869 17-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

freebsd32: struct kld32_file_stat -> struct kld_file_stat32

Follow common convention and put the `32` on the end of the struct
name. This is a step toward generating freebsd32 syscall files
from sys/kern/syscalls.master.

Reviewed by: kevans


# 2e89f95d 17-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

freebsd32: fix types on statfs syscalls

Rename struct statfs32 to struct ostatfs32 to mirror struct ostatfs.
These structs are use for COMPAT4 support. Stop using struct statfs32
for modern implementations as struct statfs uses fixed-width types
and it the same on all architectures.

Reviewed by: kevans


# a944d28d 17-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

freebsd32: sprinkle in missing consts

A number of syscalls have missing consts on their arguments relative to
the default syscalls.master.

Also, use timespec32 and timeval32 where appropriate.

No functional change.

Reviewed by: kevans


# 01ce7fca 15-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

ommap: fix signed len and pos arguments

4.3 BSD's mmap took an int len and long pos. Reject negative lengths
and in freebsd32 sign-extend pos correctly rather than mis-handling
negative positions as large positive ones.

Reviewed by: kib


# 8e4a3add 15-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

struct kevent_freebsd11 -> struct freebsd11_kevent

Rename to match the naming of syscalls and allow 32 to be appended
without making an ugly name like kevent_freebsd1132.

While here, make the kevent changelist argument const.

Reviewed by: kib


# fea1a98e 19-Sep-2021 Mark Johnston <markj@FreeBSD.org>

freebsd32: Fix a double copyin in sendmsg() and recvmsg()

freebsd32_sendmsg() and freebsd32_recvmsg() both copyin the message
header twice, once directly and once in freebsd32_copyinmsghdr(). The
iovec length from the former is used when copying in msg_iov, but the
rest of the kernel uses the iovec length from the latter. When
kern_sendit() and kern_recvit() iterate over the iovec to compute the
residual for I/O, they can therefore end up walking past the end of the
copied in iovec, either resulting in a system call error, userspace
memory corruption from uiomove() with invalid iovecs, or a kernel page
fault if the copied-in iovec is followed by an unmapped KVA region.

Reported by: syzbot+7cc64cd0c49605acd421@syzkaller.appspotmail.com
Reviewed by: kib, emaste
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32010


# 4bda16ff 19-Sep-2021 Mark Johnston <markj@FreeBSD.org>

freebsd32: Provide an ANSI definition for freebsd32_recvmsg()

Fix style in the freebsd32_sendmsg() definition.

MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# 796a8e1a 01-Sep-2021 Konstantin Belousov <kib@FreeBSD.org>

procctl(2): Add PROC_WXMAP_CTL/STATUS

It allows to override kern.elf{32,64}.allow_wx on per-process basis.
In particular, it makes it possible to run binaries without PT_GNU_STACK
and without elfctl note while allow_wx = 0.

Reviewed by: brooks, emaste, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31779


# f575573c 15-Sep-2021 Konstantin Belousov <kib@FreeBSD.org>

Remove PT_GET_SC_ARGS_ALL

Reimplement bdf0f24bb16d556a5b by checking for the caller' ABI in
the implementation of PT_GET_SC_ARGS, and copying out everything if
it is Linuxolator.

Also fix a minor information leak: if PT_GET_SC_ARGS_ALL is done on the
thread reused after other process, it allows to read some number of that
thread last syscall arguments. Clear td_sa.args in thread_alloc().

Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D31968


# bdf0f24b 12-Sep-2021 Edward Tomasz Napierala <trasz@FreeBSD.org>

linux: implement PTRACE_GET_SYSCALL_INFO

This is one of the pieces required to make modern (ie Focal)
strace(1) work.

Reviewed By: jhb (earlier version)
Sponsored by: EPSRC
Differential Revision: https://reviews.freebsd.org/D28212


# 0dc332bf 05-Aug-2021 Ka Ho Ng <khng@FreeBSD.org>

Add fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9).

fspacectl(2) is a system call to provide space management support to
userspace applications. VOP_DEALLOCATE(9) is a VOP call to perform the
deallocation. vn_deallocate(9) is a public KPI for kmods' use.

The purpose of proposing a new system call, a KPI and a VOP call is to
allow bhyve or other hypervisor monitors to emulate the behavior of SCSI
UNMAP/NVMe DEALLOCATE on a plain file.

fspacectl(2) comprises of cmd and flags parameters to specify the
space management operation to be performed. Currently cmd has to be
SPACECTL_DEALLOC, and flags has to be 0.

fo_fspacectl is added to fileops.
VOP_DEALLOCATE(9) is added as a new VOP call. A trivial implementation
of VOP_DEALLOCATE(9) is provided.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D28347


# 7cf06e07 28-Jul-2021 Dmitry Chagin <dchagin@FreeBSD.org>

freebsd32: Remove the unnecessary spaces.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31247
MFC after: 2 weeks


# 3c886cb6 28-Jul-2021 Dmitry Chagin <dchagin@FreeBSD.org>

freebsd32: Remove unused umtx.h include.

Differential Revision: https://reviews.freebsd.org/D31246
MFC after: 2 weeks


# db8d680e 01-Jul-2021 Edward Tomasz Napierala <trasz@FreeBSD.org>

procctl(2): add PROC_NO_NEW_PRIVS_CTL, PROC_NO_NEW_PRIVS_STATUS

This introduces a new, per-process flag, "NO_NEW_PRIVS", which
is inherited, preserved on exec, and cannot be cleared. The flag,
when set, makes subsequent execs ignore any SUID and SGID bits,
instead executing those binaries as if they not set.

The main purpose of the flag is implementation of Linux
PROC_SET_NO_NEW_PRIVS prctl(2), and possibly also unpriviledged
chroot.

Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D30939


# 87a64872 23-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

Add ptrace(PT_COREDUMP)

It writes the core of live stopped process to the file descriptor
provided as an argument.

Based on the initial version from https://reviews.freebsd.org/D29691,
submitted by Michał Górny <mgorny@gentoo.org>.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29955


# 7a1591c1 22-Jan-2021 Brooks Davis <brooks@FreeBSD.org>

Rename kern_mmap_req to kern_mmap

Replace all uses of kern_mmap with kern_mmap_req move the old kern_mmap.
Reand rename kern_mmap_req to kern_mmap .

The helper saved some code churn initially, but having multiple
interfaces is sub-optimal.

Obtained from: CheriBSD
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D28292


# 6b3a9a0f 11-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

Convert remaining cap_rights_init users to cap_rights_init_one

semantic patch:

@@

expression rights, r;

@@

- cap_rights_init(&rights, r)
+ cap_rights_init_one(&rights, r)


# 022ca2fc 02-Jan-2021 Alan Somers <asomers@FreeBSD.org>

Add aio_writev and aio_readv

POSIX AIO is great, but it lacks vectored I/O functions. This commit
fixes that shortcoming by adding aio_writev and aio_readv. They aren't
part of the standard, but they're an obvious extension. They work just
like their synchronous equivalents pwritev and preadv.

It isn't yet possible to use vectored aiocbs with lio_listio, but that
could be added in the future.

Reviewed by: jhb, kib, bcr
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D27743


# 673e2dd6 18-Dec-2020 Konstantin Belousov <kib@FreeBSD.org>

Add ELF flag to disable ASLR stack gap.

Also centralize and unify checks to enable ASLR stack gap in a new
helper exec_stackgap().

PR: 239873
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 31df9c26 04-Dec-2020 Konstantin Belousov <kib@FreeBSD.org>

Fix compat32 for ntp_adjtime(2).

struct timex is not 32-bit safe, it uses longs for members.
Provide translation.

Reviewed by: brooks, cy
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27471


# 15eaec6a 21-Nov-2020 Kyle Evans <kevans@FreeBSD.org>

_umtx_op: move compat32 definitions back in

These are reasonably compact, and a future commit will blur the compat32
lines by supporting 32-bit operations with the native _umtx_op.


# 63ecb272 16-Nov-2020 Kyle Evans <kevans@FreeBSD.org>

umtx_op: reduce redundancy required for compat32

All of the compat32 variants are substantially the same, save for
copyin/copyout (mostly). Apply the same kind of technique used with kevent
here by having the syscall routines supply a umtx_copyops describing the
operations needed.

umtx_copyops carries the bare minimum needed- size of timespec and
_umtx_time are used for determining if copyout is needed in the sem2_wait
case.

Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27222


# 275c821d 24-Oct-2020 Kyle Evans <kevans@FreeBSD.org>

audit: correct reporting of *execve(2) success

r326145 corrected do_execve() to return EJUSTRETURN upon success so that
important registers are not clobbered. This had the side effect of tapping
out 'failures' for all *execve(2) audit records, which is less than useful
for auditing purposes.

Audit exec returns earlier, where we can know for sure that EJUSTRETURN
translates to success. Note that this unsets TDP_AUDITREC as we commit the
audit record, so the usual audit in the syscall return path will do nothing.

PR: 249179
Reported by: Eirik Oeverby <ltning-freebsd anduin net>
Reviewed by: csjp, kib
MFC after: 1 week
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26922


# aaf78c16 23-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Do not leak oldvmspace if image activation failed

and current address space is already destroyed, so kern_execve()
terminates the process.

While there, clean up some internals of post_execve() inlined in init_main.

Reported by: Peter <pmc@citylink.dinoex.sub.org>
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26525


# 1a180032 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

compat: clean up empty lines in .c and .h files


# 1b1428dc 05-Aug-2020 Mark Johnston <markj@FreeBSD.org>

Fix a TOCTOU vulnerability in freebsd32_copyin_control().

PR: 248257
Reported by: m00nbsd working with Trend Micro Zero Day Initiative
Reviewed by: kib
Security: SA-20:23.sendmsg
Security: CVE-2020-7460
Security: ZDI-CAN-11543


# 58b552dc 09-Jun-2020 John Baldwin <jhb@FreeBSD.org>

Refactor ptrace() ABI compatibility.

Add a freebsd32_ptrace() and move as many freebsd32 shims as possible
to freebsd32_ptrace(). Aside from register sets, freebsd32 passes
pointers to native structures to kern_ptrace() and converts to/from
native/32-bit structure formats in freebsd32_ptrace() outside of
kern_ptrace().

Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D25195


# b24e6ac8 16-Apr-2020 Brooks Davis <brooks@FreeBSD.org>

Convert canary, execpathp, and pagesizes to pointers.

Use AUXARGS_ENTRY_PTR to export these pointers. This is a followup to
r359987 and r359988.

Reviewed by: jhb
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D24446


# 9df1c38b 15-Apr-2020 Brooks Davis <brooks@FreeBSD.org>

Export argc, argv, envc, envv, and ps_strings in auxargs.

This simplifies discovery of these values, potentially with reducing the
number of syscalls we need to make at runtime. Longer term, we wish to
convert the startup process to pass an auxargs pointer to _start() and
use that rather than walking off the end of envv. This is cleaner,
more C-friendly, and for systems with strong bounds (e.g. CHERI)
necessary.

Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D24407


# 397df744 15-Apr-2020 Brooks Davis <brooks@FreeBSD.org>

Make ps_strings in struct image_params into a pointer.

This is a prepratory commit for D24407.

Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA


# 618a20d4 14-Apr-2020 Brooks Davis <brooks@FreeBSD.org>

Remove bogus use of useracc() in (clock_)nanosleep.

There's no point in pre-checking that we can access the user's rmtp
pointer before we do it in copyout().

While here, improve style(9) compliance.

Reviewed by: imp
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D24409


# d8010b11 09-Dec-2019 John Baldwin <jhb@FreeBSD.org>

Copy out aux args after the argument and environment vectors.

Partially revert r354741 and r354754 and go back to allocating a
fixed-size chunk of stack space for the auxiliary vector. Keep
sv_copyout_auxargs but change it to accept the address at the end of
the environment vector as an input stack address and no longer
allocate room on the stack. It is now called at the end of
copyout_strings after the argv and environment vectors have been
copied out.

This should fix a regression in r354754 that broke the stack alignment
for newer Linux amd64 binaries (and probably broke Linux arm64 as
well).

Reviewed by: kib
Tested on: amd64 (native, linux64 (only linux-base-c7), and i386)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22695


# 31174518 03-Dec-2019 John Baldwin <jhb@FreeBSD.org>

Use uintptr_t instead of register_t * for the stack base.

- Use ustringp for the location of the argv and environment strings
and allow destp to travel further down the stack for the stackgap
and auxv regions.
- Update the Linux copyout_strings variants to move destp down the
stack as was done for the native ABIs in r263349.
- Stop allocating a space for a stack gap in the Linux ABIs. This
used to hold translated system call arguments, but hasn't been used
since r159992.

Reviewed by: kib
Tested on: md64 (amd64, i386, linux64), i386 (i386, linux)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22501


# 03b0d68c 18-Nov-2019 John Baldwin <jhb@FreeBSD.org>

Check for errors from copyout() and suword*() in sv_copyout_args/strings.

Reviewed by: brooks, kib
Tested on: amd64 (amd64, i386, linux64), i386 (i386, linux)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22401


# e3532331 15-Nov-2019 John Baldwin <jhb@FreeBSD.org>

Add a sv_copyout_auxargs() hook in sysentvec.

Change the FreeBSD ELF ABIs to use this new hook to copyout ELF auxv
instead of doing it in the sv_fixup hook. In particular, this new
hook allows the stack space to be allocated at the same time the auxv
values are copied out to userland. This allows us to avoid wasting
space for unused auxv entries as well as not having to recalculate
where the auxv vector is by walking back up over the argv and
environment vectors.

Reviewed by: brooks, emaste
Tested on: amd64 (amd64 and i386 binaries), i386, mips, mips64
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22355


# fe69291f 03-Sep-2019 Konstantin Belousov <kib@FreeBSD.org>

Add procctl(PROC_STACKGAP_CTL)

It allows a process to request that stack gap was not applied to its
stacks, retroactively. Also it is possible to control the gaps in the
process after exec.

PR: 239894
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21352


# d05b53e0 02-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

Add sysctlbyname system call

Previously userspace would issue one syscall to resolve the sysctl and then
another one to actually use it. Do it all in one trip.

Fallback is provided in case newer libc happens to be running on an older
kernel.

Submitted by: Pawel Biernacki
Reported by: kib, brooks
Differential Revision: https://reviews.freebsd.org/D17282


# fc83c5a7 31-Jul-2019 Konstantin Belousov <kib@FreeBSD.org>

Make randomized stack gap between strings and pointers to argv/envs.

This effectively makes the stack base on the csu _start entry
randomized.

The gap is enabled if ASLR is for the ABI is enabled, and then
kern.elf{64,32}.aslr.stack_gap specify the max percentage of the
initial stack size that can be wasted for gap. Setting it to zero
disables the gap, and max is capped at 50%.

Only amd64 for now.

Reviewed by: cem, markj
Discussed with: emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21081


# 5dc7e31a 02-Jul-2019 Konstantin Belousov <kib@FreeBSD.org>

Control implicit PROT_MAX() using procctl(2) and the FreeBSD note
feature bit.

In particular, allocate the bit to opt-out the image from implicit
PROTMAX enablement. Provide procctl(2) verbs to set and query
implicit PROTMAX handling. The knobs mimic the same per-image flag
and per-process controls for ASLR.

Reviewed by: emaste, markj (previous version)
Discussed with: brooks
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D20795


# f0645b3a 30-Mar-2019 Jason A. Harmening <jah@FreeBSD.org>

freebsd32: fix padding of computed control message length for recvmsg()

Each control message region must be aligned on a 4-byte boundary on 32-bit
architectures. The 32-bit compat shim for recvmsg() gets the actual layout
right, but doesn't pad the payload length when computing msg_controllen for
the output message header. If a control message contains an unaligned
payload, such as the 1-byte TTL field in the example attached to PR 236737,
this can produce control message payload boundaries that extend beyond
the boundary reported by msg_controllen.

PR: 236737
Reported by: Yuval Pavel Zholkover <paulzhol@gmail.com>
Reviewed by: markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D19768


# fd8d844f 16-Mar-2019 Konstantin Belousov <kib@FreeBSD.org>

amd64 KPTI: add control from procctl(2).

Add the infrastructure to allow MD procctl(2) commands, and use it to
introduce amd64 PTI control and reporting. PTI mode cannot be
modified for existing pmap, the knob controls PTI of the new vmspace
created on exec.

Requested by: jhb
Reviewed by: jhb, markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19514


# 329f0aa9 11-Mar-2019 Warner Losh <imp@FreeBSD.org>

Kill tz_minuteswest and tz_dsttime.

Research Unix, 7th Edition introduced TIMEZONE and DSTFLAG
compile-time constants in sys/param.h to communicate these values for
the machine. 4.2BSD moved from the compile-time to run-time and
introduced these variables and used for localtime() to return the
right offset from UTC (sometimes referred to as GMT, for this purpose
is the same). 4.4BSD migrated to using the tzdata code/database and
these variables were basically unused.

FreeBSD removed the real need for these with adjkerntz in
1995. However, some RTC clocks continued to use these variables,
though they were largely unused otherwise. Later, phk centeralized
most of the uses in utc_offset, but left it using both tz_minuteswest
and adjkerntz.

POSIX (IEEE Std 1003.1-2017) states in the gettimeofday specification
"If tzp is not a null pointer, the behavior is unspecified" so there's
no standards reason to retain it anymore. In fact, gettimeofday has
been marked as obsolecent, meaning it could be removed from a future
release of the standard. It is the only interface defined in POSIX
that references these two values. All other references come from the
tzdata database via tzset().

These were used to more faithfully implement early unix ABIs which
have been removed from FreeBSD. NetBSD has completely eliminated
these variables years ago. Linux has migrated to tzdata as well,
though these variables technically still exist for compatibility
with unspecified older programs.

So, there's no real reason to have them these days. They are a
historical vestige that's no longer used in any meaningful way.

Reviewed By: jhb@, brooks@
Differential Revision: https://reviews.freebsd.org/D19550


# fa50a355 10-Feb-2019 Konstantin Belousov <kib@FreeBSD.org>

Implement Address Space Layout Randomization (ASLR)

With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.

Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.

The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.

I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.

To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.

The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.

ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.

Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.

PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603


# a7f67fac 08-Feb-2019 Konstantin Belousov <kib@FreeBSD.org>

Normalize the declaration of i386_read_exec variable.

It is currently re-declared in sys/sysent.h which is a wrong place for
MD variable. Which causes redeclaration error with gcc when
sys/sysent.h and machine/md_var.h are included both.

Remove it from sys/sysent.h and instead include machine/md_var.h when
needed, under #ifdef for both i386 and amd64.

Reported and tested by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# f373437a 29-Nov-2018 Brooks Davis <brooks@FreeBSD.org>

Add helper functions to copy strings into struct image_args.

Given a zeroed struct image_args with an allocated buf member,
exec_args_add_fname() must be called to install a file name (or NULL).
Then zero or more calls to exec_args_add_env() followed by zero or
more calls to exec_args_add_env(). exec_args_adjust_args() may be
called after args and/or env to allow an interpreter to be prepended to
the argument list.

To allow code reuse when adding arg and env variables, begin_envv
should be accessed with the accessor exec_args_get_begin_envv()
which handles the case when no environment entries have been added.

Use these functions to simplify exec_copyin_args() and
freebsd32_exec_copyin_args().

Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15468


# 9a38df59 09-Nov-2018 Brooks Davis <brooks@FreeBSD.org>

Fix freebsd32 mknod(at).

As dev_t is now a 64-bit integer, it requires special handling as a
system call argument. 64-bit arguments are split between two 64-bit
integers due to the way arguments are promoted to allow reuse of most
system call implementations. They must be reassembled before use.
Further, 64-bit arguments at an odd offset (counting from zero) are
padded and slid to the next slot on powerpc and mips. Fix the
non-COMPAT11 system call by adding a freebsd32_mknodat() and
appropriately padded declerations.

The COMPAT11 system calls are fully compatible with the 64-bit
implementations so remove the freebsd32_ versions.

Use uint32_t consistently as the type of the old dev_t. This matches
the old definition.

Reviewed by: kib
MFC after: 3 days
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17928


# 40747517 08-Nov-2018 Brooks Davis <brooks@FreeBSD.org>

Make __sysctl follow the freebsd32_foo convention.

Sponsored by: DARPA, AFRL


# 12e69f96 02-Nov-2018 Brooks Davis <brooks@FreeBSD.org>

Add const to input-only char * arguments.

These arguments are mostly paths handled by NAMEI*() macros which already
take const char * arguments.

This change improves the match between syscalls.master and the public
declerations of system calls.

Reviewed by: kib (prior version)
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17812


# c542c43e 16-Aug-2018 Jamie Gritton <jamie@FreeBSD.org>

Revert r337922, except for some documention-only bits. This needs to wait
until user is changed to stop using jail(2).

Differential Revision: D14791


# 284001a2 16-Aug-2018 Jamie Gritton <jamie@FreeBSD.org>

Put jail(2) under COMPAT_FREEBSD11. It has been the "old" way of creating
jails since FreeBSD 7.

Along with the system call, put the various security.jail.allow_foo and
security.jail.foo_allowed sysctls partly under COMPAT_FREEBSD11 (or
BURN_BRIDGES). These sysctls had two disparate uses: on the system side,
they were global permissions for jails created via jail(2) which lacked
fine-grained permission controls; inside a jail, they're read-only
descriptions of what the current jail is allowed to do. The first use
is obsolete along with jail(2), but keep them for the second-read-only use.

Differential Revision: D14791


# c7902fbe 07-Aug-2018 Mark Johnston <markj@FreeBSD.org>

Improve handling of control message truncation.

If a recvmsg(2) or recvmmsg(2) caller doesn't provide sufficient space
for all control messages, the kernel sets MSG_CTRUNC in the message
flags to indicate truncation of the control messages. In the case
of SCM_RIGHTS messages, however, we were failing to dispose of the
rights that had already been externalized into the recipient's file
descriptor table. Add a new function and mbuf type to handle this
cleanup task, and use it any time we fail to copy control messages
out to the recipient. To simplify cleanup, control message truncation
is now only performed at control message boundaries.

The change also fixes a few related bugs:
- Rights could be leaked to the recipient process if an error occurred
while copying out a message's contents.
- We failed to set MSG_CTRUNC if the truncation occurred on a control
message boundary, e.g., if the caller received two control messages
and provided only the exact amount of buffer space needed for the
first.

PR: 131876
Reviewed by: ed (previous version)
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16561


# 3de1e9aa 31-Jul-2018 Konstantin Belousov <kib@FreeBSD.org>

Provide compat32 shims for sched_rr_get_interval(2).

The interface uses struct timespec, which needs a translation.

Reported and reviewed by: asomers
PR: 230175
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16525


# 9dea3ac8 29-Jul-2018 Alan Somers <asomers@FreeBSD.org>

freebsd32_getrusage(2): skip freebsd32_rusage_out on error

PR: 230153
Reported by: kib
MFC after: 2 weeks
X-MFC-With: 336871
Differential Revision: https://reviews.freebsd.org/D16500


# 5cf35a10 29-Jul-2018 Alan Somers <asomers@FreeBSD.org>

getrusage(2): fix return value under 32-bit emulation

According to the man page, getrusage(2) should return EFAULT if the rusage
argument lies outside of the process's address space. But due to an
oversight in r100384, that's never been the case during 32-bit emulation.
Fix it.

PR: 230153
Reported by: tests(7)
Reviewed by: cem
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D16500


# ab35e1c7 12-Jun-2018 Bruce Evans <bde@FreeBSD.org>

Fix the encoding of major and minor numbers in 64-bit dev_t by restoring
the old encodings for the lower 16 and 32 bits and only using the
higher 32 bits for unusually large major and minor numbers. This
change breaks compatibility with the previous encoding (which was only
used in -current).

Fix truncation to (essentially) 16-bit dev_t in newnfs v3.

Any encoding of device numbers gives an ABI, so it can't be changed
without translations for compatibility. Extra bits give the much
larger complication that the translations need to compress into fewer
bits. Fortunately, more than 32 bits are rarely needed, so
compression is rarely needed except for 16-bit linux dev_t where it
was always needed but never done.

The previous encoding moved the major number into the top 32 bits.
Almost no translation code handled this, so the major number was blindly
truncated away in most 32-bit encodings. E.g., for ffs, mknod(8) with
major = 1 and minor = 2 gave dev_t = 0x10000002; ffs cannot represent
this and blindly truncated it to 2. But if this mknod was run on any
released version of FreeBSD, it gives dev_t = 0x102. ffs can represent
this, but in the previous encoding it was not decoded, giving major = 0,
minor = 0x102.

The presence of bugs was most obvious for exporting dev_t's from an
old system to -current, since bugs in newnfs augment them. I fixed
oldnfs to support 32-bit dev_t in 1996 (r16634), but this regressed
to 16-bit dev_t in newnfs, first to the old 16-bit encoding and then
further in -current. E.g., old ad0 with major = 234, minor = 0x10002
had the correct (major, minor) number on the wire, but newnfs truncated
this to (234, 2) and then the previous encoding shifted the major
number into oblivion as seen by ffs or old applications.

I first tried to fix this by translating on every ABI/API boundary, but
there are too many boundaries and too many sloppy translations by blind
truncation. So use the old encoding for the low 32 bits so that sloppy
translations work no worse than before provided the high 32 bits are
not set. Add some error checking for when bits are lost. Keep not
doing any error checking for translations for almost everything in
compat/linux.

compat/freebsd32/freebsd32_misc.c:
Optionally check for losing bits after possibly-truncating assignments as
before.

compat/linux/linux_stats.c:
Depend on the representation being compatible with Linux's (or just with
itself for local use) and spell some of the translations as assignments in
a macro that hides the details.

fs/nfsclient/nfs_clcomsubs.c:
Essentially the same fix as in 1996, except there is now no possible
truncation in makedev() itself. Also fix nearby style bugs.

kern/vfs_syscalls.c:
As for freebsd32. Also update the sysctl description to include file
numbers, and change it to describe device ids as device numbers.

sys/types.h:
Use inline functions (wrapped by macros) since the expressions are now
a bit too complicated for plain macros. Describe the encoding and
some of the reasons for it. 16-bit compatibility didn't leave many
reasonable choices for the 32-bit encoding, and 32-bit compatibility
doesn't leave many reasonable choices for the 64-bit encoding. My
choice is to put the 8 new minor bits in the low 8 bits of the top 32
bits. This minimizes discontiguities.

Reviewed by: kib (except for rewrite of the comment in linux_stats.c)


# 372639f9 13-Jun-2018 Bruce Evans <bde@FreeBSD.org>

Fix some bugs found while fixing the representation and translation
of 64-bit dev_t's (but not ones involving dev_t's).

st_size was supposed to be clamped in cvtstat() and linux's copy_stat(),
but the clamping code wasn't aware that st_size is signed, and also had
an obfuscated off-by-1 value for the unsigned limit, so its effect was
to produce a bizarre negative size instead of clamping.

Change freebsd32's copy_ostat() to be no worse than cvtstat(). It was
missing clamping and bzero()ing of padding.

Reviewed by: kib (except a final fix of the clamp to the signed maximum)


# e15f0023 15-May-2018 Brooks Davis <brooks@FreeBSD.org>

Allow freebsd32 __sysctl(2) to return ENOMEM.

This is required by programs like sockstat that read variably sized
sysctls such as kern.file. The normal path has no such restriction and
the restriction was added without comment along with initial support for
freebsd32 in 2002 (r100384).

Reviewed by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15438


# 1302eea7 20-Apr-2018 Konstantin Belousov <kib@FreeBSD.org>

Rename PROC_PDEATHSIG_SET -> PROC_PDEATHSIG_CTL and PROC_PDEATHSIG_GET
-> PROC_PDEATHSIG_STATUS for consistency with other procctl(2)
operations names.

Requested by: emaste
Sponsored by: The FreeBSD Foundation
MFC after: 13 days


# 73c8686e 19-Apr-2018 John Baldwin <jhb@FreeBSD.org>

Simplify the code to allocate stack for auxv, argv[], and environment vectors.

Remove auxarg_size as it was only used once right after a confusing
assignment in each of the variants of exec_copyout_strings().

Reviewed by: emaste
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D15123


# b9408863 18-Apr-2018 Konstantin Belousov <kib@FreeBSD.org>

Add PROC_PDEATHSIG_SET to procctl interface.

Allow processes to request the delivery of a signal upon death of
their parent process. Supposed consumer of the feature is PostgreSQL.

Submitted by: Thomas Munro
Reviewed by: jilles, mjg
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D15106


# 6469bdcd 06-Apr-2018 Brooks Davis <brooks@FreeBSD.org>

Move most of the contents of opt_compat.h to opt_global.h.

opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941


# fb441a88 27-Mar-2018 Konstantin Belousov <kib@FreeBSD.org>

Fix several leaks of kernel stack data through paddings.

It is random collection of fixes for issues not yet corrected,
reported at https://tsyrklevi.ch/clang_analyzer/freebsd_013017/. Many
issues from that list were already corrected. Most of them are for
compat32, old compat32 or affect both primary host ABI and compat32.

The freebsd32_kldstat(), for instance, was already fixed by using
malloc(M_ZERO). Patch includes correction to report the supplied
version back, which is just pedantic.

Reviewed by: brooks, emaste (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D14868


# b81e88d2 20-Feb-2018 Brooks Davis <brooks@FreeBSD.org>

Reduce duplication in dynamic syscall registration code.

Remove the unused syscall_(de)register() functions in favor of the
better documented and easier to use syscall_helper_(un)register(9)
functions.

The default and freebsd32 versions differed in which array of struct
sysents they used and a few missing updates to the 32-bit code as
features were added to the main code.

Reviewed by: cem
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14337


# a4dcd0ef 15-Feb-2018 Brooks Davis <brooks@FreeBSD.org>

Remove freebsd32_getdirentries(), it will be unused after the next
commit.


# 3f289c3f 12-Jan-2018 Jeff Roberson <jeff@FreeBSD.org>

Implement 'domainset', a cpuset based NUMA policy mechanism. This allows
userspace to control NUMA policy administratively and programmatically.

Implement domainset based iterators in the page layer.

Remove the now legacy numa_* syscalls.

Cleanup some header polution created by having seq.h in proc.h.

Reviewed by: markj, kib
Discussed with: alc
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13403


# 7f2d13d6 27-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/compat: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.


# ffb66079 24-Nov-2017 John Baldwin <jhb@FreeBSD.org>

Decode kevent structures logged via ktrace(2) in kdump.

- Add a new KTR_STRUCT_ARRAY ktrace record type which dumps an array of
structures.

The structure name in the record payload is preceded by a size_t
containing the size of the individual structures. Use this to
replace the previous code that dumped the kevent arrays dumped for
kevent(). kdump is now able to decode the kevent structures rather
than dumping their contents via a hexdump.

One change from before is that the 'changes' and 'events' arrays are
not marked with separate 'read' and 'write' annotations in kdump
output. Instead, the first array is the 'changes' array, and the
second array (only present if kevent doesn't fail with an error) is
the 'events' array. For kevent(), empty arrays are denoted by an
entry with an array containing zero entries rather than no record.

- Move kevent decoding tables from truss to libsysdecode.

This adds three new functions to decode members of struct kevent:
sysdecode_kevent_filter, sysdecode_kevent_flags, and
sysdecode_kevent_fflags.

kdump uses these helper functions to pretty-print kevent fields.

- Move structure definitions for freebsd11 and freebsd32 kevent
structures to <sys/event.h> so that they can be shared with userland.
The 32-bit structures are only exposed if _WANT_KEVENT32 is defined.
The freebsd11 structures are only exposed if _WANT_FREEBSD11_KEVENT is
defined. The 32-bit freebsd11 structure requires both.

- Decode freebsd11 kevent structures in truss for the compat11.kevent()
system call.

- Log 32-bit kevent structures via ktrace for 32-bit compat kevent()
system calls.

- While here, constify the 'void *data' argument to ktrstruct().

Reviewed by: kib (earlier version)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D12470


# edb01d11 15-Nov-2017 Gordon Tetlow <gordon@FreeBSD.org>

Properly bzero kldstat structure to prevent kernel information leak.

Submitted by: kib
Reported by: TJ Corley
Security: CVE-2017-1088


# afbd12c1 06-Sep-2017 Maxim Sobolev <sobomax@FreeBSD.org>

In the recvmsg32() system call iterate over returned structure(s)
and convert any messages of types SCM_BINTIME, SCM_TIMESTAMP,
SCM_REALTIME and SCM_MONOTONIC from 64-bit to its 32-bit
representation. Otherwise we either run out of user-supplied
buffer to copy those out resulting in the MSG_CTRUNC or simply
return values that the userland 32-bit code is not going
to parse correctly. This fixes at least two regression tests
failing to function properly in 32-bit compat mode:

tools/regression/sockets/udp_pingpong
tools/regression/sockets/unix_cmsg

PR: kern/222039
MFC after: 30 days


# aef2a6a7 01-Jul-2017 Konstantin Belousov <kib@FreeBSD.org>

Port PowerPC kqueue(2) compat32 fix in r320500 to MIPS.

All 32bit MIPS ABIs align uint64_t on 8-byte. Since struct kevent32
is defined using 32bit types to avoid extra alignment on amd64/i386,
layout of the structure needs paddings on PowerPC and apparently MIPS.

Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D11434


# cfb2d93b 30-Jun-2017 Konstantin Belousov <kib@FreeBSD.org>

Amend the layout of kevent32 on powerpc where uint64_t has 8-byte
alignment.

Reported,tested and assertion updates by: andreast
Sponsored by: The FreeBSD Foundation


# b4366092 26-Jun-2017 Justin Hibbits <jhibbits@FreeBSD.org>

Update comments and simplify conditionals for compat32

Only amd64 (because of i386) needs 32-bit time_t compat now, everything else is
64-bit time_t. Rather than checking on all 64-bit time_t archs, only check the
oddball amd64/i386.

Reviewed By: emaste, kib, andrew
Differential Revision: https://reviews.freebsd.org/D11364


# fbcf7bcd 25-Jun-2017 Justin Hibbits <jhibbits@FreeBSD.org>

Solve the y2038 problem for powerpc

AKA Make time_t 64 bits on powerpc(32).

PowerPC currently (until now) was one of two architectures with a 32-bit time_t
on 32-bit archs (the other being i386). This is an ABI breakage, so all ports,
and all local binaries, *must* be recompiled.

Tested by: andreast, others
MFC after: Never
Relnotes: Yes


# 2b34e843 16-Jun-2017 Konstantin Belousov <kib@FreeBSD.org>

Add abstime kqueue(2) timers and expand struct kevent members.

This change implements NOTE_ABSTIME flag for EVFILT_TIMER, which
specifies that the data field contains absolute time to fire the
event.

To make this useful, data member of the struct kevent must be extended
to 64bit. Using the opportunity, I also added ext members. This
changes struct kevent almost to Apple struct kevent64, except I did
not changed type of ident and udata, the later would cause serious API
incompatibilities.

The type of ident was kept uintptr_t since EVFILT_AIO returns a
pointer in this field, and e.g. CHERI is sensitive to the type
(discussed with brooks, jhb).

Unlike Apple kevent64, symbol versioning allows us to claim ABI
compatibility and still name the new syscall kevent(2). Compat shims
are provided for both host native and compat32.

Requested by: bapt
Reviewed by: bapt, brooks, ngie (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D11025


# 7abe0df2 09-Jun-2017 Konstantin Belousov <kib@FreeBSD.org>

Enhance vfs.ino64_trunc_error sysctl.

Provide a new mode "2" which returns a special overflow indicator in
the non-representable field instead of the silent truncation (mode
"0") or EOVERFLOW (mode "1").

In particular, the typical use of st_ino to detect hard links with
mode "2" reports false positives, which might be more suitable for
some uses.

Discussed with: bde
Sponsored by: The FreeBSD Foundation


# 3df7ebc4 05-Jun-2017 Konstantin Belousov <kib@FreeBSD.org>

Add sysctl vfs.ino64_trunc_error controlling action on truncating
inode number or link count for the ABI compat binaries.

Right now, and by default after the change, too large 64bit values are
silently truncated to 32 bits. Enabling the knob causes the system to
return EOVERFLOW for stat(2) family of compat syscalls when some
values cannot be completely represented by the old structures. For
getdirentries(2), knob skips the dirents which would cause non-trivial
truncation of d_ino.

EOVERFLOW error is specified by the X/Open 1996 LFS document
('Adding Support for Arbitrary File Sizes to the Single UNIX
Specification').

Based on the discussion with: bde
Sponsored by: The FreeBSD Foundation


# 69921123 23-May-2017 Konstantin Belousov <kib@FreeBSD.org>

Commit the 64-bit inode project.

Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment. Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.

ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks. Unfortunately, not everything can be
fixed, especially outside the base system. For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.

Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.

Struct xvnode changed layout, no compat shims are provided.

For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.

Update note: strictly follow the instructions in UPDATING. Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.

Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver. Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).

Sponsored by: The FreeBSD Foundation (emaste, kib)
Differential revision: https://reviews.freebsd.org/D10439


# f19351aa 05-May-2017 Brooks Davis <brooks@FreeBSD.org>

Provide a freebsd32 implementation of sigqueue()

The previous misuse of sys_sigqueue() was sending random register or
stack garbage to 64-bit targets. The freebsd32 implementation preserves
the sival_int member of value when signaling a 64-bit process.

Document the mixed ABI implementation of union sigval and the
incompability of sival_ptr with pointer integrity schemes.

Reviewed by: kib, wblock
MFC after: 1 week
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D10605


# 3f8455b0 18-Mar-2017 Eric van Gyzen <vangyzen@FreeBSD.org>

Add clock_nanosleep()

Add a clock_nanosleep() syscall, as specified by POSIX.
Make nanosleep() a wrapper around it.

Attach the clock_nanosleep test from NetBSD. Adjust it for the
FreeBSD behavior of updating rmtp only when interrupted by a signal.
I believe this to be POSIX-compliant, since POSIX mentions the rmtp
parameter only in the paragraph about EINTR. This is also what
Linux does. (NetBSD updates rmtp unconditionally.)

Copy the whole nanosleep.2 man page from NetBSD because it is complete
and closely resembles the POSIX description. Edit, polish, and reword it
a bit, being sure to keep any relevant text from the FreeBSD page.

Reviewed by: kib, ngie, jilles
MFC after: 3 weeks
Relnotes: yes
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D10020


# 4cf66812 18-Mar-2017 Eric van Gyzen <vangyzen@FreeBSD.org>

nanosleep: plug a kernel memory disclosure

nanosleep() updates rmtp on EINVAL. In that case, kern_nanosleep()
has not updated rmt, so sys_nanosleep() updates the user-space rmtp
by copying garbage from its stack frame. This is not only a kernel
memory disclosure, it's also not POSIX-compliant. Fix it to update
rmtp only on EINTR.

Reviewed by: jilles (via D10020), dchagin
MFC after: 3 days
Security: possibly
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D10044


# 01feb4c3 14-Mar-2017 Konstantin Belousov <kib@FreeBSD.org>

Use designated initializers for kevent_copyops.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 496ab053 13-Feb-2017 Konstantin Belousov <kib@FreeBSD.org>

Rework r313352.

Rename kern_vm_* functions to kern_*. Move the prototypes to
syscallsubr.h. Also change Mach VM types to uintptr_t/size_t as
needed, to avoid headers pollution.

Requested by: alc, jhb
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D9535


# 995b8f4f 12-Feb-2017 Konstantin Belousov <kib@FreeBSD.org>

Style: wrap long line.

Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 69cdfcef 06-Feb-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern_vm_mmap2(), kern_vm_mprotect(), kern_vm_msync(), kern_vm_munlock(),
kern_vm_munmap(), and kern_vm_madvise(), and use them in various compats
instead of their sys_*() counterparts.

Reviewed by: ed, dchagin, kib
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9378


# 96ee4310 05-Feb-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern_cpuset_getaffinity() and kern_cpuset_getaffinity(),
and use it in compats instead of their sys_*() counterparts.

Reviewed by: kib, jhb, dchagin
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9383


# b38b22b0 31-Jan-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern_pread() and kern_pwrite(), and use it in compats instead
of their sys_*() counterparts. The svr4 is left unchanged.

Reviewed by: kib@
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9379


# fc8bde8f 31-Jan-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Replace calls to sys_truncate() with kern_truncate().

Reviewed by: kib@
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9371


# ea2ebdc1 31-Jan-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern_cpuset_getid() and kern_cpuset_setid(), and use them
in compat32 instead of their sub_*() counterparts.

Reviewed by: jhb@, kib@
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9382


# f67d6b5f 29-Jan-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern_lseek() and use it instead of sys_lseek() in various compats.
I didn't touch svr4/, there's no point.

Reviewed by: ed@, kib@
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9366


# ae6b6ef6 30-Jan-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Replace sys_ftruncate() with kern_ftruncate() in various compats.

Reviewed by: kib@
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9368


# 2f304845 05-Jan-2017 Konstantin Belousov <kib@FreeBSD.org>

Do not allocate struct statfs on kernel stack.

Right now size of the structure is 472 bytes on amd64, which is
already large and stack allocations are indesirable. With the ino64
work, MNAMELEN is increased to 1024, which will make it impossible to have
struct statfs on the stack.

Extracted from: ino64 work by gleb
Discussed with: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 607fa849 05-Jan-2017 Konstantin Belousov <kib@FreeBSD.org>

Some style fixes for getfstat(2)-related code.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 34ed0c63 27-Dec-2016 John Baldwin <jhb@FreeBSD.org>

Rename the 'flags' argument to getfsstat() to 'mode' and validate it.

This argument is not a bitmask of flags, but only accepts a single value.
Fail with EINVAL if an invalid value is passed to 'flag'. Rename the
'flags' argument to getmntinfo(3) to 'mode' as well to match.

This is a followup to r308088.

Reviewed by: kib
MFC after: 1 month


# 643f6f47 21-Sep-2016 Konstantin Belousov <kib@FreeBSD.org>

Add PROC_TRAPCAP procctl(2) controls and global sysctl kern.trap_enocap.

Both can be used to cause processes in capability mode to receive
SIGTRAP when ENOTCAPABLE or ECAPMODE errors are returned from
syscalls.

Idea by: emaste
Reviewed by: oshogbo (previous version), emaste
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D7965


# a72c64b0 22-Jun-2016 Brooks Davis <brooks@FreeBSD.org>

Generate syscall tables and update pipe() implementation after r302094.

Mark the pipe() system call as COMPAT10.

As of r302092 libc uses pipe2() with a zero flags value instead of pipe().

Approved by: re (gjb)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D6816


# 9c64cfe5 29-Mar-2016 Gleb Smirnoff <glebius@FreeBSD.org>

The sendfile(2) allows to send extra data from userspace before the file
data (headers). Historically the size of the headers was not checked
against the socket buffer space. Application could easily overcommit the
socket buffer space.

With the new sendfile (r293439) the problem remained, but a KASSERT was
inserted that checked that amount of data written to the socket matches
its space. In case when size of headers is bigger that socket space,
KASSERT fires. Without INVARIANTS the new sendfile won't panic, but
would report incorrect amount of bytes sent.

o With this change, the headers copyin is moved down into the cycle, after
the sbspace() check. The uio size is trimmed by socket space there,
which fixes the overcommit problem and its consequences.
o The compatibility handling for FreeBSD 4 sendfile headers API is pushed
up the stack to syscall wrappers. This required a copy and paste of the
code, but in turn this allowed to remove extra stack carried parameter
from fo_sendfile_t, and embrace entire compat code into #ifdef. If in
future we got more fo_sendfile_t function, the copy and paste level would
even reduce.

Reviewed by: emax, gallatin, Maxim Dounin <mdounin mdounin.ru>
Tested by: Vitalij Satanivskij <satan ukr.net>
Sponsored by: Netflix


# 0acf5d0b 25-Feb-2016 Mark Johnston <markj@FreeBSD.org>

Improve error handling for posix_fallocate(2) and posix_fadvise(2).

- Set td_errno so that ktrace and dtrace can obtain the syscall error
number in the usual way.
- Pass negative error numbers directly to the syscall layer, as they're
not intended to be returned to userland.

Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D5425


# 7b445033 10-May-2015 Konstantin Belousov <kib@FreeBSD.org>

On exec, single-threading must be enforced before arguments space is
allocated from exec_map. If many threads try to perform execve(2) in
parallel, the exec map is exhausted and some threads sleep
uninterruptible waiting for the map space. Then, the thread which won
the race for the space allocation, cannot single-thread the process,
causing deadlock.

Reported and tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 76cd2549 04-May-2015 Peter Wemm <peter@FreeBSD.org>

Fix an error in r281551, part of the getfsstat() / kern_getfsstat()
rework. The number of entries was supposed to be returned to the user,
not used as a scratch variable.

This broke RELENG_4 jails starting up on current systems.


# 1c73bcab 15-Apr-2015 Edward Tomasz Napierala <trasz@FreeBSD.org>

Rewrite linprocfs_domtab() as a wrapper around kern_getfsstat(). This
adds missing jail and MAC checks.

Differential Revision: https://reviews.freebsd.org/D2193
Reviewed by: kib@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation


# 2205e0d1 23-Jan-2015 Jilles Tjoelker <jilles@FreeBSD.org>

Add futimens and utimensat system calls.

The core kernel part is patch file utimes.2008.4.diff from
pluknet@FreeBSD.org. I updated the code for API changes, added the manual
page and added compatibility code for old kernels. There is also audit and
Capsicum support.

A new UTIME_* constant might allow setting birthtimes in future.

Differential Revision: https://reviews.freebsd.org/D1426
Submitted by: pluknet (partially)
Reviewed by: delphij, pluknet, rwatson
Relnotes: yes


# 677258f7 18-Jan-2015 Konstantin Belousov <kib@FreeBSD.org>

Add procctl(2) PROC_TRACE_CTL command to enable or disable debugger
attachment to the process. Note that the command is not intended to
be a security measure, rather it is an obfuscation feature,
implemented for parity with other operating systems.

Discussed with: jilles, rwatson
Man page fixes by: rwatson
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# b53fc49c 15-Jan-2015 Konstantin Belousov <kib@FreeBSD.org>

fcntl F_O{GET,SET}LK take pointer as the arg, handle them properly for
compat32.

Reported and tested by: Alex Tutubalin <lexa@lexa.ru>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 237623b0 14-Dec-2014 Konstantin Belousov <kib@FreeBSD.org>

Add a facility for non-init process to declare itself the reaper of
the orphaned descendants. Base of the API is modelled after the same
feature from the DragonFlyBSD.

Requested by: bapt
Reviewed by: jilles (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks


# 6e646651 13-Nov-2014 Konstantin Belousov <kib@FreeBSD.org>

Remove the no-at variants of the kern_xx() syscall helpers. E.g., we
have both kern_open() and kern_openat(); change the callers to use
kern_openat().

This removes one (sometimes two) levels of indirection and
consolidates arguments checks.

Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 186d9c34 12-Nov-2014 Dmitry Chagin <dchagin@FreeBSD.org>

Add the ppoll() system call.
Export kern_poll() needed by an upcoming Linuxulator change.

Differential Revision: https://reviews.freebsd.org/D1133
Reviewed by: kib, wblock
MFC after: 1 month


# efe28398 11-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Fix build.


# 0e87b36e 11-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Remove SF_KQUEUE code. This code was developed at Netflix, but was not
ever used. It didn't go into stable/10, neither was documented.
It might be useful, but we collectively decided to remove it, rather
leave it abandoned and unmaintained. It is removed in one single
commit, so restoring it should be easy, if anyone wants to reopen
this idea.

Sponsored by: Netflix


# 0a2c94b8 28-Oct-2014 Konstantin Belousov <kib@FreeBSD.org>

Replace some calls to fuword() by fueword() with proper error checking.

Sponsored by: The FreeBSD Foundation
Tested by: pho
MFC after: 3 weeks


# e015b1ab 26-Oct-2014 Mateusz Guzik <mjg@FreeBSD.org>

Avoid dynamic syscall overhead for statically compiled modules.

The kernel tracks syscall users so that modules can safely unregister them.

But if the module is not unloadable or was compiled into the kernel, there is
no need to do this.

Achieve this by adding SY_THR_STATIC_KLD macro which expands to SY_THR_STATIC
during kernel build and 0 otherwise.

Reviewed by: kib (previous version)
MFC after: 2 weeks


# f69261f2 25-Sep-2014 Konstantin Belousov <kib@FreeBSD.org>

Fix fcntl(2) compat32 after r270691. The copyin and copyout of the
struct flock are done in the sys_fcntl(), which mean that compat32 used
direct access to userland pointers.

Move code from sys_fcntl() to new wrapper, kern_fcntl_freebsd(), which
performs neccessary userland memory accesses, and use it from both
native and compat32 fcntl syscalls.

Reported by: jhibbits
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 8fbeebf5 26-Aug-2014 Konstantin Belousov <kib@FreeBSD.org>

Fix handling of the third argument for fcntl(2). The native syscall
uses long for arg, which needs translation.

Discussed with and tested by: mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# e7d939bd 06-Jul-2014 Marcel Moolenaar <marcel@FreeBSD.org>

Remove ia64.

This includes:
o All directories named *ia64*
o All files named *ia64*
o All ia64-specific code guarded by __ia64__
o All ia64-specific makefile logic
o Mention of ia64 in comments and documentation

This excludes:
o Everything under contrib/
o Everything under crypto/
o sys/xen/interface
o sys/sys/elf_common.h

Discussed at: BSDcan


# 0fa211be 05-Apr-2014 Marcel Moolenaar <marcel@FreeBSD.org>

In freebsd32_sendmsg(), replace the call to sockargs() followed by a
call to freebsd32_convert_msg_in() with freebsd32_copyin_control() to
readin and convert in a single step. This makes it simpler to put all
the control messages in a single mbuf or mbuf cluster as per the
limitations imposed upon us by ip6_setpktopts().

The logic is as follows:
1. Go over the array of control messages to determine overall size
and include extra padding for proper alignment as we go.
2. Get a mbuf or mbuf cluster as needed or fail if the overall
(adjusted) size is larger than a cluster.
3. Go over the array of control messages again, but now copy them
into kernel space and into aligned offsets.
4. Update the length of the control message to take padding between
the header and the data into account (but not for padding added
between one control message and the next).

Obtained from: Juniper Networks, Inc.
MFC after: 1 week


# 8a27a339 30-Mar-2014 Warner Losh <imp@FreeBSD.org>

Remove instances of variables that were set, but never used. gcc 4.9
warns about these by default.


# 88b124ce 18-Mar-2014 Konstantin Belousov <kib@FreeBSD.org>

Make the array pointed to by AT_PAGESIZES auxv properly aligned.

Also, remove the expression which calculated the location of the
strings for a new image and grown over the time to be
non-comprehensible. Instead, calculate the offsets by steps, which
also makes fixing the alignments much cleaner.

Reported and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 4a144410 16-Mar-2014 Robert Watson <rwatson@FreeBSD.org>

Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after: 3 weeks


# 6f2b769c 15-Mar-2014 John-Mark Gurney <jmg@FreeBSD.org>

change td_retval into a union w/ off_t, with defines to mask the
change... This eliminates a cast, and also forces td_retval
(often 2 32-bit registers) to be aligned so that off_t's can be
stored there on arches with strict alignment requirements like
armeb (AVILA)... On i386, this doesn't change alignment, and on
amd64 it doesn't either, as register_t is already 64bits...

This will also prevent future breakage due to people adding additional
fields to the struct...

This gets AVILA booting a bit farther...

Reviewed by: bde


# 49d39308 30-Jan-2014 Konstantin Belousov <kib@FreeBSD.org>

The posix_madvise(3) and posix_fadvise(2) should return error on
failure, same as posix_fallocate(2).

Noted by: Bob Bishop <rb@gid.co.uk>
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 2852de04 23-Jan-2014 Konstantin Belousov <kib@FreeBSD.org>

The posix_fallocate(2) syscall should return error number on error,
without modifying errno.

Reported and tested by: Gennady Proskurin <gpr@mail.ru>
Reviewed by: mdf
PR: standards/186028
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 0cfea1c8 16-Jan-2014 Adrian Chadd <adrian@FreeBSD.org>

Implement a kqueue notification path for sendfile.

This fires off a kqueue note (of type sendfile) to the configured kqfd
when the sendfile transaction has completed and the relevant memory
backing the transaction is no longer in use by this transaction.
This is analogous to SF_SYNC waiting for the mbufs to complete -
except now you don't have to wait.

Both SF_SYNC and SF_KQUEUE should work together, even if it
doesn't necessarily make any practical sense.

This is designed for use by applications which use backing cache/store
files (eg Varnish) or POSIX shared memory (not sure anything is using
it yet!) to know when a region of memory is free for re-use. Note
it doesn't mark the region as free overall - only free from this
transaction. The application developer still needs to track which
ranges are in the process of being recycled and wait until all
pending transactions are completed.

TODO:

* documentation, as always

Sponsored by: Netflix, Inc.


# a43caef1 08-Jan-2014 Adrian Chadd <adrian@FreeBSD.org>

Refactor out the common sendfile code from the do_sendfile() and the
compat32 sendfile syscall.

Sponsored by: Netflix, Inc.


# 79750e3b 30-Nov-2013 Adrian Chadd <adrian@FreeBSD.org>

Migrate the sendfile_sync structure into a public(ish) API in preparation
for extending and reusing it.

The sendfile_sync wrapper is mostly just a "mbuf transaction" wrapper,
used to indicate that the backing store for a group of mbufs has completed.
It's only being used by sendfile for now and it's only implementing a
sleep/wakeup rendezvous. However, there are other potential signaling
paths (kqueue) and other potential uses (socket zero-copy write) where the
same mechanism would also be useful.

So, with that in mind:

* extract the sendfile_sync code out into sf_sync_*() methods
* teach the sf_sync_alloc method about the current config flag -
it will eventually know about kqueue.
* move the sendfile_sync code out of do_sendfile() - the only thing
it now knows about is the sfs pointer. The guts of the sync
rendezvous (setup, rendezvous/wait, free) is now done in the
syscall wrapper.
* .. and teach the 32-bit compat sendfile call the same.

This should be a no-op. It's primarily preparation work for teaching
the sendfile_sync about kqueue notification.

Tested:

* Peter Holm's sendfile stress / regression scripts

Sponsored by: Netflix, Inc.


# b5019bc4 28-Nov-2013 Peter Wemm <peter@FreeBSD.org>

jail_v0.ip_number was always in host byte order. This was handled
in one of the many layers of indirection and shims through stable/7
in jail_handle_ips(). When it was cleaned up and unified through
kern_jail() for 8.x, the byte order swap was lost.

This only matters for ancient binaries that call jail(2) themselves
internally.


# 7689abae 26-Nov-2013 Adrian Chadd <adrian@FreeBSD.org>

Fix the compat32 sendfile() to be in line with my recent changes.

Reminded by: kib


# 55648840 19-Sep-2013 John Baldwin <jhb@FreeBSD.org>

Extend the support for exempting processes from being killed when swap is
exhausted.
- Add a new protect(1) command that can be used to set or revoke protection
from arbitrary processes. Similar to ktrace it can apply a change to all
existing descendants of a process as well as future descendants.
- Add a new procctl(2) system call that provides a generic interface for
control operations on processes (as opposed to the debugger-specific
operations provided by ptrace(2)). procctl(2) uses a combination of
idtype_t and an id to identify the set of processes on which to operate
similar to wait6().
- Add a PROC_SPROTECT control operation to manage the protection status
of a set of processes. MADV_PROTECT still works for backwards
compatability.
- Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc)
the first bit of which is used to track if P_PROTECT should be inherited
by new child processes.

Reviewed by: kib, jilles (earlier version)
Approved by: re (delphij)
MFC after: 1 month


# edb572a3 09-Sep-2013 John Baldwin <jhb@FreeBSD.org>

Add a mmap flag (MAP_32BIT) on 64-bit platforms to request that a mapping use
an address in the first 2GB of the process's address space. This flag should
have the same semantics as the same flag on Linux.

To facilitate this, add a new parameter to vm_map_find() that specifies an
optional maximum virtual address. While here, fix several callers of
vm_map_find() to use a VMFS_* constant for the findspace argument instead of
TRUE and FALSE.

Reviewed by: alc
Approved by: re (kib)


# 7008be5b 04-Sep-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)

#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);

bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

cap_rights_t rights;

cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by: The FreeBSD Foundation


# d6f6b876 18-Aug-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Move the PAIR32TO64() macro and the RETVAL_HI/RETVAL_LO defines to a
header file for use by other .c files.

Sponsored by: The FreeBSD Foundation


# ca04d21d 15-Aug-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Make sendfile() a method in the struct fileops. Currently only
vnode backed file descriptors have this method implemented.

Reviewed by: kib
Sponsored by: Nginx, Inc.
Sponsored by: Netflix


# 643ee871 21-Jul-2013 Konstantin Belousov <kib@FreeBSD.org>

Implement compat32 wrappers for the ktimer_* syscalls.

Reported, reviewed and tested by: Petr Salinger <Petr.Salinger@seznam.cz>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 97319989 21-Jul-2013 Konstantin Belousov <kib@FreeBSD.org>

Move the convert_sigevent32() utility function into freebsd32_misc.c
for consumption outside the vfs_aio.c.

For SIGEV_THREAD_ID and SIGEV_SIGNAL notification delivery methods,
also copy in the sigev_value, since librt event pumping loop compares
note generation number with the value passed through sigev_value.

Tested by: Petr Salinger <Petr.Salinger@seznam.cz>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# df4d0ed1 21-Jul-2013 Konstantin Belousov <kib@FreeBSD.org>

Cosmetic change, use the same union name on the left and right sides
of the conversion.

Tested by: Petr Salinger <Petr.Salinger@seznam.cz>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# d31e4b3a 20-Jul-2013 Konstantin Belousov <kib@FreeBSD.org>

id_t is 64bit, provide the compat32 wrapper for clock_getcpuclockid2(2).

Reported and tested by: Petr Salinger <Petr.Salinger@seznam.cz>
PR: threads/180652
Sponsored by: The FreeBSD Foundation


# 1f4e9654 31-May-2013 David E. O'Brien <obrien@FreeBSD.org>

Add a "kern.features" MIB for 32bit support under a 64bit kernel.


# 48947ecc 21-May-2013 Konstantin Belousov <kib@FreeBSD.org>

Fix the wait6(2) on 32bit architectures and for the compat32, by using
the right type for the argument in syscalls.master. Also fix the
posix_fallocate(2) and posix_fadvise(2) compat32 syscalls on the
architectures which require padding of the 64bit argument.

Noted and reviewed by: jhb
Pointy hat to: kib
MFC after: 1 week


# a2a85596 15-Nov-2012 Konstantin Belousov <kib@FreeBSD.org>

Style fixes for r242958.

Reported and reviewed by: bde
MFC after: 28 days


# f13b5a0f 12-Nov-2012 Konstantin Belousov <kib@FreeBSD.org>

Add the wait6(2) system call. It takes POSIX waitid()-like process
designator to select a process which is waited for. The system call
optionally returns siginfo_t which would be otherwise provided to
SIGCHLD handler, as well as extended structure accounting for child
and cumulative grandchild resource usage.

Allow to get the current rusage information for non-exited processes
as well, similar to Solaris.

The explicit WEXITED flag is required to wait for exited processes,
allowing for more fine-grained control of the events the waiter is
interested in.

Fix the handling of siginfo for WNOWAIT option for all wait*(2)
family, by not removing the queued signal state.

PR: standards/170346
Submitted by: "Jukka A. Ukkonen" <jau@iki.fi>
MFC after: 1 month


# 76dcec5d 24-May-2012 Gleb Kurtsou <gleb@FreeBSD.org>

Add kern_fhstat(), adjust sys_fhstat() to use it.

Extend kern_getdirentries() to accept uio segflag and optionally return
buffer residue.

Sponsored by: Google Summer of Code 2011


# a6d20bba 03-Mar-2012 Juli Mallett <jmallett@FreeBSD.org>

On MIPS, _ALIGN always aligns to 8 bytes, even for 32-bit binaries. This might
not be ideal, but is the ABI we've shipped so far. Fix macros which reflect
the results of _ALIGN on 32-bit MIPS to use the right alignment.

This fixes sendmsg under COMPAT_FREEBSD32 on n64 MIPS kernels.


# 9624d947 03-Mar-2012 Juli Mallett <jmallett@FreeBSD.org>

o) Add COMPAT_FREEBSD32 support for MIPS kernels using the n64 ABI with userlands
using the o32 ABI. This mostly follows nwhitehorn's lead in implementing
COMPAT_FREEBSD32 on powerpc64.
o) Add a new type to the freebsd32 compat layer, time32_t, which is time_t in the
32-bit ABI being used. Since the MIPS port is relatively-new, even the 32-bit
ABIs use a 64-bit time_t.
o) Because time{spec,val}32 has the same size and layout as time{spec,val} on MIPS
with 32-bit compatibility, then, disable some code which assumes otherwise
wrongly when built for MIPS. A more general macro to check in this case would
seem like a good idea eventually. If someone adds support for using n32
userland with n64 kernels on MIPS, then they will have to add a variety of
flags related to each piece of the ABI that can vary. That's probably the
right time to generalize further.
o) Add MIPS to the list of architectures which use PAD64_REQUIRED in the
freebsd32 compat code. Probably this should be generalized at some point.

Reviewed by: gonzo


# cc672d35 16-Jan-2012 Kirk McKusick <mckusick@FreeBSD.org>

Make sure all intermediate variables holding mount flags (mnt_flag)
and that all internal kernel calls passing mount flags are declared
as uint64_t so that flags in the top 32-bits are not lost.

MFC after: 2 weeks


# 7edec621 14-Nov-2011 John Baldwin <jhb@FreeBSD.org>

- Split out a kern_posix_fadvise() from the posix_fadvise() system call so
it can be used by in-kernel consumers.
- Make kern_posix_fallocate() public.
- Use kern_posix_fadvise() and kern_posix_fallocate() to implement the
freebsd32 wrappers for the two system calls.


# 936c09ac 03-Nov-2011 John Baldwin <jhb@FreeBSD.org>

Add the posix_fadvise(2) system call. It is somewhat similar to
madvise(2) except that it operates on a file descriptor instead of a
memory region. It is currently only supported on regular files.

Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types. The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are
thus filesystem independent. Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.

The second type of hints are used to hint to the OS that data will or
will not be used. These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request. This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode. This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue. The second change adds
a new vm_object_page_cache() method. This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.

To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.

Reviewed by: jilles, kib, arch@
Approved by: re (kib)
MFC after: 1 month


# 126b36a2 14-Oct-2011 Konstantin Belousov <kib@FreeBSD.org>

Control the execution permission of the readable segments for
i386 binaries on the amd64 and ia64 with the sysctl, instead of
unconditionally enabling it.

Reviewed by: marcel


# f9403424 14-Oct-2011 John Baldwin <jhb@FreeBSD.org>

Use PAIR32TO64() for the offset and length parameters to
freebsd32_posix_fallocate() to properly handle big-endian platforms.

Reviewed by: mdf
MFC after: 1 week


# 90fd594a 13-Oct-2011 Marcel Moolenaar <marcel@FreeBSD.org>

Use PTRIN().


# f8244106 13-Oct-2011 Marcel Moolenaar <marcel@FreeBSD.org>

Wrap mprotect(2) so that we can add execute permissions when read
permissions are requested. This is needed on amd64 and ia64 for
JDK 1.4.x


# 488a1605 13-Oct-2011 Marcel Moolenaar <marcel@FreeBSD.org>

In freebsd32_mmap() and when compiling for amd64 or ia64, also
ask for execute permissions when read permissions are wanted.
This is needed for JDK 1.4.x on i386.


# 8451d0dd 16-Sep-2011 Kip Macy <kmacy@FreeBSD.org>

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


# 71812178 16-Jun-2011 Konstantin Belousov <kib@FreeBSD.org>

Implement compat32 for old lseek, for the a.out binaries on amd64.


# d91f88f7 18-Apr-2011 Matthew D Fleming <mdf@FreeBSD.org>

Add the posix_fallocate(2) syscall. The default implementation in
vop_stdallocate() is filesystem agnostic and will run as slow as a
read/write loop in userspace; however, it serves to correctly
implement the functionality for filesystems that do not implement a
VOP_ALLOCATE.

Note that __FreeBSD_version was already bumped today to 900036 for any
ports which would like to use this function.

Also reserve space in the syscall table for posix_fadvise(2).

Reviewed by: -arch (previous version)


# 7332c129 01-Apr-2011 Konstantin Belousov <kib@FreeBSD.org>

Add support for executing the FreeBSD 1/i386 a.out binaries on amd64.

In particular:
- implement compat shims for old stat(2) variants and ogetdirentries(2);
- implement delivery of signals with ancient stack frame layout and
corresponding sigreturn(2);
- implement old getpagesize(2);
- provide a user-mode trampoline and LDT call gate for lcall $7,$0;
- port a.out image activator and connect it to the build as a module
on amd64.

The changes are hidden under COMPAT_43.

MFC after: 1 month


# 86665509 30-Mar-2011 Konstantin Belousov <kib@FreeBSD.org>

Provide compat32 shims for kldstat(2).

Requested and tested by: jpaetzel
MFC after: 1 week


# 6297a3d8 08-Jan-2011 Konstantin Belousov <kib@FreeBSD.org>

Create shared (readonly) page. Each ABI may specify the use of page by
setting SV_SHP flag and providing pointer to the vm object and mapping
address. Provide simple allocator to carve space in the page, tailored
to put the code with alignment restrictions.

Enable shared page use for amd64, both native and 32bit FreeBSD
binaries. Page is private mapped at the top of the user address
space, moving a start of the stack one page down. Move signal
trampoline code from the top of the stack to the shared page.

Reviewed by: alc


# f03749ca 23-Nov-2010 Sergey Kandaurov <pluknet@FreeBSD.org>

Update MNT_ROOTFS comments after changes in the root mount logic.

Reported by: arundel
Suggested by: marcel (phrasing)
Approved by: kib (mentor)


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# ee235bef 17-Aug-2010 Konstantin Belousov <kib@FreeBSD.org>

Supply some useful information to the started image using ELF aux vectors.
In particular, provide pagesize and pagesizes array, the canary value
for SSP use, number of host CPUs and osreldate.

Tested by: marius (sparc64)
MFC after: 1 month


# 1757d969 07-Aug-2010 Konstantin Belousov <kib@FreeBSD.org>

Prefer struct sysentvec sv_psstrings to hardcoding FREEBSD32_PS_STRINGS
in the compat32 code. Use sv_usrstack instead of FREEBSD32_USRSTACK as well.

MFC after: 1 week


# 64dc04de 04-Aug-2010 Konstantin Belousov <kib@FreeBSD.org>

Copy inode birthtime to the struct stat32.

MFC after: 1 week


# 45b6fa3b 04-Aug-2010 Konstantin Belousov <kib@FreeBSD.org>

Fix style.

MFC after: 1 week


# 34ab36a3 03-Aug-2010 Konstantin Belousov <kib@FreeBSD.org>

When compat32 recvmsg(2) does not need to copy out control messages, set
msg_controllen to 0.

PR: kern/149227
Submitted by: Stef Walter <stef memberwebs com>
MFC after: 1 weeks


# 2af6e14d 27-Jul-2010 Alan Cox <alc@FreeBSD.org>

Introduce exec_alloc_args(). The objective being to encapsulate the
details of the string buffer allocation in one place.

Eliminate the portion of the string buffer that was dedicated to storing
the interpreter name. The pointer to the interpreter name can simply be
made to point to the appropriate argument string.

Reviewed by: kib


# 9e4e5114 25-Jul-2010 Alan Cox <alc@FreeBSD.org>

Change the order in which the file name, arguments, environment, and
shell command are stored in exec*()'s demand-paged string buffer. For
a "buildworld" on an 8GB amd64 multiprocessor, the new order reduces
the number of global TLB shootdowns by 31%. It also eliminates about
330k page faults on the kernel address space.

Change exec_shell_imgact() to use "args->begin_argv" consistently as
the start of the argument and environment strings. Previously, it
would sometimes use "args->buf", which is the start of the overall
buffer, but no longer the start of the argument and environment
strings. While I'm here, eliminate unnecessary passing of "&length"
to copystr(), where we don't actually care about the length of the
copied string.

Clean up the initialization of the exec map. In particular, use the
correct size for an entry, and express that size in the same way that
is used when an entry is allocated. The old size was one page too
large. (This discrepancy originated in 2004 when I rewrote
exec_map_first_page() to use sf_buf_alloc() instead of the exec map
for mapping the first page of the executable.)

Reviewed by: kib


# 0b53d156 23-Jul-2010 Konstantin Belousov <kib@FreeBSD.org>

Remove the linux_exec_copyin_args(), freebsd32_exec_copyin_args() may
server as well. COMPAT_FREEBSD32 is a prerequisite for COMPAT_LINUX32.

Reviewed by: alc
MFC after: 3 weeks


# 69a8f9e3 23-Jul-2010 Alan Cox <alc@FreeBSD.org>

Eliminate a little bit of duplicated code.


# 67322a4c 04-Jul-2010 Konstantin Belousov <kib@FreeBSD.org>

Constify source argument for siginfo_to_siginfo32().

MFC after: 1 week


# 0c740f70 28-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r207007:
Extract the code to copy-out struct rusage32 from struct rusage
into the new function.


# 9847e91b 21-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

Extract the code to copy-out struct rusage32 from struct rusage
into the new function.

Reviewed by: jhb
MFC after: 1 week


# 7f67bb1b 07-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r205327:
Remove empty line.


# db5805dd 07-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r205323:
Move SysV IPC freebsd32 compat shims from freebsd32_misc.c to corresponding
sysv_{msg,sem,shm}.c files.

Mark SysV IPC freebsd32 syscalls as NOSTD and add required
SYSCALL_INIT_HELPER/SYSCALL32_INIT_HELPERs to provide auto
register/unregister on module load.

This makes COMPAT_FREEBSD32 functional with SysV IPC compiled and loaded
as modules.


# 4ad2ce35 07-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r205322:
Move SysV IPC freebsd32 compat shims helpers from freebsd32_misc.c to
sysv_ipc.c.


# 0272ddd8 07-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r205321:
Introduce SYSCALL_INIT_HELPER and SYSCALL32_INIT_HELPER macros and
neccessary support functions to allow registering dynamically loaded
syscalls from the MOD_LOAD handlers. Helpers handle registration
failures semi-automatically.


# 6283dea5 07-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r205319:
Make freebsd32_copyiniov() available outside of freebsd32_misc.


# 4ccf64eb 06-Apr-2010 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

MFC r205014,205015:

Provide groundwork for 32-bit binary compatibility on non-x86 platforms,
for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32
option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts
of the kernel and enhances the freebsd32 compatibility code to support
big-endian platforms.

This MFC is required for MFCs of later changes to the freebsd32
compatibility from HEAD.

Requested by: kib


# 510ea843 28-Mar-2010 Ed Schouten <ed@FreeBSD.org>

Rename st_*timespec fields to st_*tim for POSIX 2008 compliance.

A nice thing about POSIX 2008 is that it finally standardizes a way to
obtain file access/modification/change times in sub-second precision,
namely using struct timespec, which we already have for a very long
time. Unfortunately POSIX uses different names.

This commit adds compatibility macros, so existing code should still
build properly. Also change all source code in the kernel to work
without any of the compatibility macros. This makes it all a less
ambiguous.

I am also renaming st_birthtime to st_birthtim, even though it was a
local extension anyway. It seems Cygwin also has a st_birthtim.


# f7ae46da 19-Mar-2010 Konstantin Belousov <kib@FreeBSD.org>

Remove empty line.

MFC after: 2 weeks


# 75d633cb 19-Mar-2010 Konstantin Belousov <kib@FreeBSD.org>

Move SysV IPC freebsd32 compat shims from freebsd32_misc.c to corresponding
sysv_{msg,sem,shm}.c files.

Mark SysV IPC freebsd32 syscalls as NOSTD and add required
SYSCALL_INIT_HELPER/SYSCALL32_INIT_HELPERs to provide auto
register/unregister on module load.

This makes COMPAT_FREEBSD32 functional with SysV IPC compiled and loaded
as modules.

Reviewed by: jhb
MFC after: 2 weeks


# 4cfc39cf 19-Mar-2010 Konstantin Belousov <kib@FreeBSD.org>

Move SysV IPC freebsd32 compat shims helpers from freebsd32_misc.c to
sysv_ipc.c.

Reviewed by: jhb
MFC after: 2 weeks


# 0687ba3e 19-Mar-2010 Konstantin Belousov <kib@FreeBSD.org>

Introduce SYSCALL_INIT_HELPER and SYSCALL32_INIT_HELPER macros and
neccessary support functions to allow registering dynamically loaded
syscalls from the MOD_LOAD handlers. Helpers handle registration
failures semi-automatically.

Reviewed by: jhb
MFC after: 2 weeks


# c5e4763d 19-Mar-2010 Konstantin Belousov <kib@FreeBSD.org>

Make freebsd32_copyiniov() available outside of freebsd32_misc.

MFC after: 2 weeks


# 841c0c7e 11-Mar-2010 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

Provide groundwork for 32-bit binary compatibility on non-x86 platforms,
for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32
option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts
of the kernel and enhances the freebsd32 compatibility code to support
big-endian platforms.

Reviewed by: kib, jhb


# 7e767511 19-Dec-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r198508, r198509:
Reimplement pselect() in kernel, making change of sigmask and sleep atomic.

MFC r198538:
Move pselect(3) man page to section 2.


# 43ba7803 19-Dec-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r198507:
Use kern_sigprocmask() instead of direct manipulation of td_sigmask to
reschedule newly blocked signals.

MFC r198590:
Trapsignal() calls kern_sigprocmask() when delivering catched signal
with proc lock held.

MFC r198670:
For trapsignal() and postsig(), kern_sigprocmask() is called with
both process lock and curproc->p_sigacts->ps_mtx locked. Prevent lock
recursion on ps_mtx in reschedule_signals().


# 3134e115 19-Dec-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r198506:
In kern_sigsuspend(), manipulate thread signal mask using
kern_sigprocmask(). Also, do cursig/postsig loop immediately after
waiting for signal, repeating the wait if wakeup was spurious due to
race with other thread fetching signal from the process queue before us.

MFC r199136:
Use cpu_set_syscall_retval(9) to set syscall result, and return
EJUSTRETURN from kern_sigsuspend() to prevent syscall return code from
modifying wrong frame.
Take care of possibility that pending SIGCONT might be cancelled by
SIGSTOP, causing postsig() not to deliver any catched signal.


# 066d836b 27-Oct-2009 Konstantin Belousov <kib@FreeBSD.org>

Current pselect(3) is implemented in usermode and thus vulnerable to
well-known race condition, which elimination was the reason for the
function appearance in first place. If sigmask supplied as argument to
pselect() enables a signal, the signal might be delivered before thread
called select(2), causing lost wakeup. Reimplement pselect() in kernel,
making change of sigmask and sleep atomic.

Since signal shall be delivered to the usermode, but sigmask restored,
set TDP_OLDMASK and save old mask in td_oldsigmask. The TDP_OLDMASK
should be cleared by ast() in case signal was not gelivered during
syscall execution.

Reviewed by: davidxu
Tested by: pho
MFC after: 1 month


# d6e029ad 27-Oct-2009 Konstantin Belousov <kib@FreeBSD.org>

In r197963, a race with thread being selected for signal delivery
while in kernel mode, and later changing signal mask to block the
signal, was fixed for sigprocmask(2) and ptread_exit(3). The same race
exists for sigreturn(2), setcontext(2) and swapcontext(2) syscalls.

Use kern_sigprocmask() instead of direct manipulation of td_sigmask to
reschedule newly blocked signals, closing the race.

Reviewed by: davidxu
Tested by: pho
MFC after: 1 month


# 84440afb 27-Oct-2009 Konstantin Belousov <kib@FreeBSD.org>

In kern_sigsuspend(), better manipulate thread signal mask using
kern_sigprocmask() to properly notify other possible candidate threads
for signal delivery.

Since sigsuspend() shall only return to usermode after a signal was
delivered, do cursig/postsig loop immediately after waiting for
signal, repeating the wait if wakeup was spurious due to race with
other thread fetching signal from the process queue before us. Add
thread_suspend_check() call to allow the thread to be stopped or killed
while in loop.

Modify last argument of kern_sigprocmask() from boolean to flags,
allowing the function to be called with locked proc. Convertion of the
callers that supplied 1 to the old argument will be done in the next
commit, and due to SIGPROCMASK_OLD value equial to 1, code is formally
correct in between.

Reviewed by: davidxu
Tested by: pho
MFC after: 1 month


# 9f1fab50 16-Sep-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r197049:
Calculate the amount of bytes to copy for select filedescriptor masks
taking into account size of fd_set for the current process ABI.

Approved by: re (kensmith)


# b55ef216 09-Sep-2009 Konstantin Belousov <kib@FreeBSD.org>

kern_select(9) copies fd_set in and out of userspace in quantities of
longs. Since 32bit processes longs are 4 bytes, 64bit kernel may copy in
or out 4 bytes more then the process expected.

Calculate the amount of bytes to copy taking into account size of fd_set
for the current process ABI.

Diagnosed and tested by: Peter Jeremy <peterjeremy acm org>
Reviewed by: jhb
MFC after: 1 week


# 8e3764c0 27-Jul-2009 John Baldwin <jhb@FreeBSD.org>

Fix the freebsd32 versions of semsys(), shmsys(), and msgsys() to use the
old ABI versions of the relevant control system call (e.g.
freebsd7_freebsd32_msgctl() instead of freebsd32_msgctl() for msgsys()).

Approved by: re (kib)


# 14961ba7 27-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Replace AUDIT_ARG() with variable argument macros with a set more more
specific macros for each audit argument type. This makes it easier to
follow call-graphs, especially for automated analysis tools (such as
fxr).

In MFC, we should leave the existing AUDIT_ARG() macros as they may be
used by third-party kernel modules.

Suggested by: brooks
Approved by: re (kib)
Obtained from: TrustedBSD Project
MFC after: 1 week


# b648d480 24-Jun-2009 John Baldwin <jhb@FreeBSD.org>

Change the ABI of some of the structures used by the SYSV IPC API:
- The uid/cuid members of struct ipc_perm are now uid_t instead of unsigned
short.
- The gid/cgid members of struct ipc_perm are now gid_t instead of unsigned
short.
- The mode member of struct ipc_perm is now mode_t instead of unsigned short
(this is merely a style bug).
- The rather dubious padding fields for ABI compat with SV/I386 have been
removed from struct msqid_ds and struct semid_ds.
- The shm_segsz member of struct shmid_ds is now a size_t instead of an
int. This removes the need for the shm_bsegsz member in struct
shmid_kernel and should allow for complete support of SYSV SHM regions
>= 2GB.
- The shm_nattch member of struct shmid_ds is now an int instead of a
short.
- The shm_internal member of struct shmid_ds is now gone. The internal
VM object pointer for SHM regions has been moved into struct
shmid_kernel.
- The existing __semctl(), msgctl(), and shmctl() system call entries are
now marked COMPAT7 and new versions of those system calls which support
the new ABI are now present.
- The new system calls are assigned to the FBSD-1.1 version in libc. The
FBSD-1.0 symbols in libc now refer to the old COMPAT7 system calls.
- A simplistic framework for tagging system calls with compatibility
symbol versions has been added to libc. Version tags are added to
system calls by adding an appropriate __sym_compat() entry to
src/lib/libc/incldue/compat.h. [1]

PR: kern/16195 kern/113218 bin/129855
Reviewed by: arch@, rwatson
Discussed with: kan, kib [1]


# 0304c731 27-May-2009 Jamie Gritton <jamie@FreeBSD.org>

Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails. Child jails may be restricted more than their parents,
but never less. Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system. Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings. The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by: bz (mentor)


# b38ff370 29-Apr-2009 Jamie Gritton <jamie@FreeBSD.org>

Introduce the extensible jail framework, using the same "name=value"
interface as nmount(2). Three new system calls are added:
* jail_set, to create jails and change the parameters of existing jails.
This replaces jail(2).
* jail_get, to read the parameters of existing jails. This replaces the
security.jail.list sysctl.
* jail_remove to kill off a jail's processes and remove the jail.
Most jail parameters may now be changed after creation, and jails may be
set to exist without any attached processes. The current jail(2) system
call still exists, though it is now a stub to jail_set(2).

Approved by: bz (mentor)


# 8571af59 27-Mar-2009 Jamie Gritton <jamie@FreeBSD.org>

Whitespace/spelling fixes in advance of upcoming functional changes.

Approved by: bz (mentor)


# f86bce5e 02-Mar-2009 Jamie Gritton <jamie@FreeBSD.org>

Extend the "vfsopt" mount options for more general use. Make struct
vfsopt and the vfs_buildopts function public, and add some new fields
to struct vfsopt (pos and seen), and new functions vfs_getopt_pos and
vfs_opterror.

Further extend the interface to allow reading options from the kernel
in addition to sending them to the kernel, with vfs_setopt and related
functions.

While this allows the "name=value" option interface to be used for more
than just FS mounts (planned use is for jails), it retains the current
"vfsopt" name and <sys/mount.h> requirement.

Approved by: bz (mentor)


# ddf9d243 28-Dec-2008 Ed Schouten <ed@FreeBSD.org>

Push down Giant inside sysctl. Also add some more assertions to the code.

In the existing code we didn't really enforce that callers hold Giant
before calling userland_sysctl(), even though there is no guarantee it
is safe. Fix this by just placing Giant locks around the call to the oid
handler. This also means we only pick up Giant for a very short period
of time. Maybe we should add MPSAFE flags to sysctl or phase it out all
together.

I've also added SYSCTL_LOCK_ASSERT(). We have to make sure sysctl_root()
and name2oid() are called with the sysctl lock held.

Reviewed by: Jille Timmermans <jille quis cx>


# 3cdf485f 03-Dec-2008 John Baldwin <jhb@FreeBSD.org>

When unloading a 32-bit system call module, restore the sysent vector in
the 32-bit system call table instead of the main system call table.


# 413628a7 29-Nov-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

MFp4:
Bring in updated jail support from bz_jail branch.

This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..

SCTP support was updated and supports IPv6 in jails as well.

Cpuset support permits jails to be bound to specific processor
sets after creation.

Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.

DDB 'show jails' command was added to aid debugging.

Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.

Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.

Bump __FreeBSD_version for the afore mentioned and in kernel changes.

Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.

Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible


# dc5aaa84 10-Nov-2008 Peter Wemm <peter@FreeBSD.org>

Sigh. Fix a pointer/int compile error.


# a22600a1 10-Nov-2008 Peter Wemm <peter@FreeBSD.org>

Fix a signal emulation bug introduced in r163018 (and present in 7.x).
This prevents 32 bit signal handlers from finding out what the faulting
address is. Both the secret 4th argument and siginfo->si_addr are zero.


# 63f8fe9e 22-Oct-2008 John Baldwin <jhb@FreeBSD.org>

Split the copyout of *base at the end of getdirentries() out leaving the
rest in kern_getdirentries(). Use kern_getdirentries() to implement
freebsd32_getdirentries(). This fixes a bug where calls to getdirentries()
in 32-bit binaries would trash the 4 bytes after the 'long base' in
userland.

Submitted by: ups
MFC after: 1 week


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 88ac915a 25-Sep-2008 John Baldwin <jhb@FreeBSD.org>

Add support for installing 32-bit system calls from kernel modules. This
includes syscall32_{de,}register() routines as well as a module handler
and wrapper macros similar to the support for native syscalls in
<sys/sysent.h>.

MFC after: 1 month


# 6e6049e9 19-Sep-2008 David E. O'Brien <obrien@FreeBSD.org>

Add freebsd32 compat shim for nmount(2).
(and quiet some compiler warnings for vfs_donmount)


# 109ea24c 15-Sep-2008 David E. O'Brien <obrien@FreeBSD.org>

style(9)


# e44f0b2a 10-Jul-2008 Brooks Davis <brooks@FreeBSD.org>

style(9): put parentheses around return values.


# a8c6d6d0 10-Jul-2008 Brooks Davis <brooks@FreeBSD.org>

id_t is a 64-bit integer and thus is passed as two arguments like off_t is.
As a result, those arguments must be recombined before calling the real
syscal implementation. This change fixes 32-bit compatibility for
cpuset_getid(), cpuset_setid(), cpuset_getaffinity(), and
cpuset_setaffinity().


# 4f1e7213 30-Mar-2008 Konstantin Belousov <kib@FreeBSD.org>

Add the freebsd32 compatibility shims for the *at() syscalls.

Reviewed by: rwatson, rdivacky
Tested by: pho


# 0a635741 10-Dec-2007 John Baldwin <jhb@FreeBSD.org>

Bah, remove last vestiges of some statfs conversion fixes that aren't quite
ready for CVS yet that snuck into 1.68.

Pointy hat to: jhb


# d637500d 07-Dec-2007 Scott Long <scottl@FreeBSD.org>

Grrr, remove an unused variable missed in the last commit.


# 7815c9e2 07-Dec-2007 Scott Long <scottl@FreeBSD.org>

Don't expect a return value from statfs_scale_blocks().


# 3c39e0d8 06-Dec-2007 John Baldwin <jhb@FreeBSD.org>

Add freebsd32 compat wrappers for msgctl() and _semctl() using
kern_msgctl() and kern_semctl().

MFC after: 1 week


# d43c6fa4 06-Dec-2007 John Baldwin <jhb@FreeBSD.org>

Move 32-bit SYSV IPC structure definitions into freebsd32_ipc.h.

MFC after: 1 week


# 74427aa4 06-Dec-2007 John Baldwin <jhb@FreeBSD.org>

Move several data structure definitions out of freebsd32_misc.c and into
freebsd32.h instead.

MFC after: 1 week


# 959a913b 04-Dec-2007 Jung-uk Kim <jkim@FreeBSD.org>

Remove redundant checks for msgsnd(3) and msgrcv(3).
COMPAT_IA32 (implicitly) requires SYSVSEM, SYSVSHM and SYSVMSG in kernel.

Pointed out by: jhb


# cc479dda 28-Aug-2007 John Baldwin <jhb@FreeBSD.org>

Rework the routines to convert a 5.x+ statfs structure (with fixed-size
64-bit counters) to a 4.x statfs structure (with long-sized counters).
- For block counters, we scale up the block size sufficiently large so
that the resulting block counts fit into a the long-sized (long for the
ABI, so 32-bit in freebsd32) counters. In 4.x the NFS client's statfs
VOP did this already. This can lie about the block size to 4.x binaries,
but it presents a more accurate picture of the ratios of free and
available space.
- For non-block counters, fix the freebsd32 stats converter to cap the
values at INT32_MAX rather than losing the upper 32-bits to match the
behavior of the 4.x statfs conversion routine in vfs_syscalls.c

Approved by: re (kensmith)


# 5aa69f9c 04-Jul-2007 Peter Wemm <peter@FreeBSD.org>

Add compat6 wrapper code for mmap/lseek/pread/pwrite/truncate/ftruncate.

Approved by: re (kensmith)


# 739c673c 16-Jun-2007 Matt Jacob <mjacob@FreeBSD.org>

Try a cheap way to get around gcc4.2 believing that user arguments
to system calls can change across intervening functions.


# 302e130e 23-May-2007 Olivier Houchard <cognet@FreeBSD.org>

Remove duplicate includes.

Submitted by: Cyril Nguyen Huu <cyril ci0 org>


# 37f3c893 01-May-2007 Alan Cox <alc@FreeBSD.org>

Eliminate the use of Giant from ia64-specific code in freebsd32_mmap().


# 127891ca 20-Dec-2006 Jung-uk Kim <jkim@FreeBSD.org>

MFP4: (part of) 110058

Fix 32-bit msgsnd(3) and msgrcv(3) emulations for amd64.


# c6511aea 04-Oct-2006 David Xu <davidxu@FreeBSD.org>

Move some declaration of 32-bit signal structures into file
freebsd32-signal.h, implement sigtimedwait and sigwaitinfo system calls.


# f645b0b5 01-Oct-2006 Poul-Henning Kamp <phk@FreeBSD.org>

First part of a little cleanup in the calendar/timezone/RTC handling.

Move relevant variables to <sys/clock.h> and fix #includes as necessary.

Use libkern's much more time- & spamce-efficient BCD routines.


# cda9a0d1 22-Sep-2006 David Xu <davidxu@FreeBSD.org>

Add compatible code to let 32bit libthr work on 64bit kernel.


# a88d050d 15-Aug-2006 Jung-uk Kim <jkim@FreeBSD.org>

Include sys/limits.h for INT_MAX. freebsd32_proto.h 1.58 does not include
sys/umtx.h any more and previously it was included from there.


# c870740e 10-Jul-2006 John Baldwin <jhb@FreeBSD.org>

- Split out kern_accept(), kern_getpeername(), and kern_getsockname() for
use by ABI emulators.
- Alter the interface of kern_recvit() somewhat. Specifically, go ahead
and hard code UIO_USERSPACE in the uio as that's what all the callers
specify. In place, add a new uioseg to indicate what type of pointer
is in mp->msg_name. Previously it was always a userland address, but
ABI emulators may pass in kernel-side sockaddrs. Also, remove the
namelenp field and instead require the two places that used it to
explicitly copy mp->msg_namelen out to userland.
- Use the patched kern_recvit() to replace svr4_recvit() and the stock
kern_sendit() to replace svr4_sendit().
- Use kern_bind() instead of stackgap use in ti_bind().
- Use kern_getpeername() and kern_getsockname() instead of stackgap in
svr4_stream_ti_ioctl().
- Use kern_connect() instead of stackgap in svr4_do_putmsg().
- Use kern_getpeername() and kern_accept() instead of stackgap in
svr4_do_getmsg().
- Retire the stackgap from SVR4 compat as it is no longer used.


# acdd09f9 10-Jul-2006 John Baldwin <jhb@FreeBSD.org>

Unexpand PTRIN() in several places and fix one instance where 0 was being
used instead of NULL.


# e8b62ee7 08-Jun-2006 Paul Saab <ps@FreeBSD.org>

Do not copy out the iovec in the 32bit recvmsg call since soreceive
calls uiomove directly.

Reviewed by: ups
MFC after: 1 week


# fbb273bc 30-Mar-2006 Paul Saab <ps@FreeBSD.org>

Properly support for FreeBSD 4 32bit System V shared memory.

Submitted by: peter
Obtained from: Yahoo!
MFC after: 3 weeks


# 68ff3c24 08-Mar-2006 Stephan Uphoff <ups@FreeBSD.org>

Fix exec_map resource leaks.

Tested by: kris@


# 6308f39d 03-Mar-2006 Paul Saab <ps@FreeBSD.org>

use strlcpy in cvtstatfs and copy_statfs instead of bcopy to ensure
the copied strings are properly terminated.

bzero the statfs32 struct in copy_statfs.


# fa545f43 28-Feb-2006 Paul Saab <ps@FreeBSD.org>

Fix 32bit sendfile by implementing kern_sendfile so that it takes
the header and trailers as iovec arguments instead of copying them
in inside of sendfile.

Reviewed by: jhb
MFC after: 3 weeks


# 8917b8d2 06-Feb-2006 John Baldwin <jhb@FreeBSD.org>

- Always call exec_free_args() in kern_execve() instead of doing it in all
the callers if the exec either succeeds or fails early.
- Move the code to call exit1() if the exec fails after the vmspace is
gone to the bottom of kern_execve() to cut down on some code duplication.


# 08a3081d 20-Jan-2006 Doug Ambrisko <ambrisko@FreeBSD.org>

Add 32bit version of lutimes so untar doesn't mess up sym-links on amd64.


# 8e7604db 08-Dec-2005 Doug Ambrisko <ambrisko@FreeBSD.org>

Add 32bit version of futimes so untar doesn't result in bad dates
(Jan 1, 1970) when run on amd64.

Reviewed by: ps


# 506df56c 06-Nov-2005 Paul Saab <ps@FreeBSD.org>

Copy out the number of iovecs in freebsd32_recvmsg, not the length
of a single iovec.


# ecc44de7 31-Oct-2005 Paul Saab <ps@FreeBSD.org>

Reformat socket control messages on input/output for 32bit compatibility
on 64bit systems.

Submitted by: ps, ups
Reviewed by: jhb


# 767dfc44 26-Oct-2005 Peter Wemm <peter@FreeBSD.org>

There is no 'freebsd3_' prefix for COMPAT_43 syscalls. Those are all
bundled under MCOMPAT and have an 'o' prefix. Adjust as appropriate.
This re-enables compiling without COMPAT_43 again.


# e7abd4a0 23-Oct-2005 Paul Saab <ps@FreeBSD.org>

Implement for FreeBSD 3 32 binaries:
sigaction, sigprocmask, sigpending, sigvec, sigblock, sigsetmask,
sigsuspend, sigstack


# a372f822 14-Oct-2005 Paul Saab <ps@FreeBSD.org>

Implement the 32bit versions of recvmsg, recvfrom, sendmsg

Partially obtained from: jhb


# f0b479cd 14-Oct-2005 Paul Saab <ps@FreeBSD.org>

Implement 32bit wrappers for clock_gettime, clock_settime, and
clock_getres.


# d5c77961 14-Oct-2005 Paul Saab <ps@FreeBSD.org>

Correct the prototype for freebsd32_nanosleep and use the proper
size when copying struct timespec32 in and out.


# f2107e8d 03-Oct-2005 John Baldwin <jhb@FreeBSD.org>

Use the constants for the syscall names from syscall.h rather than
hardcoding the numbers for the SYSVIPC syscalls.


# fa34d9b7 13-Jul-2005 John Baldwin <jhb@FreeBSD.org>

Wrap the ia64-specific freebsd32_mmap_partial() hack in Giant for now
since it calls into VFS and VM. This makes the freebsd32_mmap() routine
MP safe and the extra Giants here can be revisited later.

Glanced at by: marcel
MFC after: 3 days


# bcd9e0dd 07-Jul-2005 John Baldwin <jhb@FreeBSD.org>

- Add two new system calls: preadv() and pwritev() which are like readv()
and writev() except that they take an additional offset argument and do
not change the current file position. In SAT speak:
preadv:readv::pread:read and pwritev:writev::pwrite:write.
- Try to reduce code duplication some by merging most of the old
kern_foov() and dofilefoo() functions into new dofilefoo() functions
that are called by kern_foov() and kern_pfoov(). The non-v functions
now all generate a simple uio on the stack from the passed in arguments
and then call kern_foov(). For example, read() now just builds a uio and
calls kern_readv() and pwrite() just builds a uio and calls kern_pwritev().

PR: kern/80362
Submitted by: Marc Olzheim marcolz at stack dot nl (1)
Approved by: re (scottl)
MFC after: 1 week


# 19042f9c 29-Jun-2005 John Baldwin <jhb@FreeBSD.org>

- Change the commented out freebsd32_xxx() example to use kern_xxx() along
with a single copyin() + translate and translate + copyout() rather than
using the stackgap.
- Remove implementation of the stackgap for freebsd32 since it is no longer
used for that compat ABI.

Approved by: re (scottl)


# de1c01ad 24-Jun-2005 John Baldwin <jhb@FreeBSD.org>

Correct the amount of data to allocate in these local copies of
exec_copyin_strings() to catch up to rev 1.266 of kern_exec.c. This fixes
panics on amd64 with compat binaries since exec_free_args() was freeing
more memory than these functions were allocating and the mismatch could
cause memory to be freed out from under other concurrent execs.

Approved by: re (scottl)


# 3a996d6e 11-Jun-2005 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Do not allocate memory based on not-checked argument from userland.
It can be used to panic the kernel by giving too big value.
Fix it by moving allocation and size verification into kern_getfsstat().
This even simplifies kern_getfsstat() consumers, but destroys symmetry -
memory is allocated inside kern_getfsstat(), but has to be freed by the
caller.

Found by: FreeBSD Kernel Stress Test Suite: http://www.holm.cc/stress/
Reported by: Peter Holm <peter@holm.cc>


# 13a82b96 09-Jun-2005 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Avoid code duplication in serval places by introducing universal
kern_getfsstat() function.

Obtained from: jhb


# efe5beca 03-Jun-2005 Paul Saab <ps@FreeBSD.org>

Wrap copyin/copyout for kevent so the 32bit wrapper does not have
to malloc nchanges * sizeof(struct kevent) AND/OR nevents *
sizeof(struct kevent) on every syscall.

Glanced at by: peter, jmg
Obtained from: Yahoo!
MFC after: 2 weeks


# 473dd55f 24-May-2005 Paul Saab <ps@FreeBSD.org>

Copyout to userland if kern_sigaction succeeds


# 48052f99 31-Mar-2005 John Baldwin <jhb@FreeBSD.org>

- Use a custom version of copyinuio() to implement readv/writev using
kern_readv/writev.
- Use kern_settimeofday() and kern_adjtime() rather than stackgapping it.


# b8a4edc1 01-Mar-2005 Paul Saab <ps@FreeBSD.org>

Use kern_kevent instead of the stackgap for 32bit syscall wrapping.

Submitted by: jhb
Tested on: amd64


# 5d83706b 01-Mar-2005 Paul Saab <ps@FreeBSD.org>

Ooops. I will compile test before committing. The stackgap version
of kevent32 will be going away shortly, so this is temporary until
I commit the non-stackgap version.


# 38765a31 18-Feb-2005 John Baldwin <jhb@FreeBSD.org>

- Add a custom version of exec_copyin_args() to deal with the 32-bit
pointers in argv and envv in userland and use that together with
kern_execve() and exec_free_args() to implement freebsd32_execve()
without using the stackgap.
- Fix freebsd32_adjtime() to call adjtime() rather than utimes(). Still
uses stackgap for now.
- Use kern_setitimer(), kern_getitimer(), kern_select(), kern_utimes(),
kern_statfs(), kern_fstatfs(), kern_fhstatfs(), kern_stat(),
kern_fstat(), and kern_lstat().

Tested by: cokane (amd64)
Silence on: amd64, ia64


# 7fdf2c85 19-Jan-2005 Paul Saab <ps@FreeBSD.org>

- rename nanosleep1 to kern_nanosleep
- Add a 32bit syscall entry for nanosleep

Reviewed by: peter
Obtained from: Yahoo!


# 6004362e 26-Nov-2004 David Schultz <das@FreeBSD.org>

Don't include sys/user.h merely for its side-effect of recursively
including other headers.


# a7bc3102 11-Oct-2004 Peter Wemm <peter@FreeBSD.org>

Put on my peril sensitive sunglasses and add a flags field to the internal
sysctl routines and state. Add some code to use it for signalling the need
to downconvert a data structure to 32 bits on a 64 bit OS when requested by
a 32 bit app.

I tried to do this in a generic abi wrapper that intercepted the sysctl
oid's, or looked up the format string etc, but it was a real can of worms
that turned into a fragile mess before I even got it partially working.

With this, we can now run 'sysctl -a' on a 32 bit sysctl binary and have
it not abort. Things like netstat, ps, etc have a long way to go.

This also fixes a bug in the kern.ps_strings and kern.usrstack hacks.
These do matter very much because they are used by libc_r and other things.


# 78c85e8d 05-Oct-2004 John Baldwin <jhb@FreeBSD.org>

Rework how we store process times in the kernel such that we always store
the raw values including for child process statistics and only compute the
system and user timevals on demand.

- Fix the various kern_wait() syscall wrappers to only pass in a rusage
pointer if they are going to use the result.
- Add a kern_getrusage() function for the ABI syscalls to use so that they
don't have to play stackgap games to call getrusage().
- Fix the svr4_sys_times() syscall to just call calcru() to calculate the
times it needs rather than calling getrusage() twice with associated
stackgap, etc.
- Add a new rusage_ext structure to store raw time stats such as tick counts
for user, system, and interrupt time as well as a bintime of the total
runtime. A new p_rux field in struct proc replaces the same inline fields
from struct proc (i.e. p_[isu]ticks, p_[isu]u, and p_runtime). A new p_crux
field in struct proc contains the "raw" child time usage statistics.
ruadd() has been changed to handle adding the associated rusage_ext
structures as well as the values in rusage. Effectively, the values in
rusage_ext replace the ru_utime and ru_stime values in struct rusage. These
two fields in struct rusage are no longer used in the kernel.
- calcru() has been split into a static worker function calcru1() that
calculates appropriate timevals for user and system time as well as updating
the rux_[isu]u fields of a passed in rusage_ext structure. calcru() uses a
copy of the process' p_rux structure to compute the timevals after updating
the runtime appropriately if any of the threads in that process are
currently executing. It also now only locks sched_lock internally while
doing the rux_runtime fixup. calcru() now only requires the caller to
hold the proc lock and calcru1() only requires the proc lock internally.
calcru() also no longer allows callers to ask for an interrupt timeval
since none of them actually did.
- calcru() now correctly handles threads executing on other CPUs.
- A new calccru() function computes the child system and user timevals by
calling calcru1() on p_crux. Note that this means that any code that wants
child times must now call this function rather than reading from p_cru
directly. This function also requires the proc lock.
- This finishes the locking for rusage and friends so some of the Giant locks
in exit1() and kern_wait() are now gone.
- The locking in ttyinfo() has been tweaked so that a shared lock of the
proctree lock is used to protect the process group rather than the process
group lock. By holding this lock until the end of the function we now
ensure that the process/thread that we pick to dump info about will no
longer vanish while we are trying to output its info to the console.

Submitted by: bde (mostly)
MFC after: 1 month


# f3732fd1 17-Jun-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Second half of the dev_t cleanup.

The big lines are:
NODEV -> NULL
NOUDEV -> NODEV
udev_t -> dev_t
udev2dev() -> findcdev()

Various minor adjustments including handling of userland access to kernel
space struct cdev etc.


# c050455e 23-Apr-2004 Marcel Moolenaar <marcel@FreeBSD.org>

Fix build for non-COMPAT_FREEBSD4 configurations. Make the FreeBSD 4
statfs functions conditional upon the option.


# 0c70bced 14-Apr-2004 Peter Wemm <peter@FreeBSD.org>

Catch up to the not-so-recent statfs(2) changes.


# b7e23e82 17-Mar-2004 John Baldwin <jhb@FreeBSD.org>

- Replace wait1() with a kern_wait() function that accepts the pid,
options, status pointer and rusage pointer as arguments. It is up to
the caller to copyout the status and rusage to userland if needed. This
lets us axe the 'compat' argument and hide all that functionality in
owait(), by the way. This also cleans up some locking in kern_wait()
since it no longer has to drop locks around copyout() since all the
copyout()'s are deferred.
- Convert owait(), wait4(), and the various ABI compat wait() syscalls to
use kern_wait() rather than wait1() or wait4(). This removes a bit
more stackgap usage.

Tested on: i386
Compiled on: i386, alpha, amd64


# 996a568e 28-Jan-2004 Peter Wemm <peter@FreeBSD.org>

Regen


# 34eda634 22-Dec-2003 Peter Wemm <peter@FreeBSD.org>

Rather than screw around with the (unsafe) stackgap, call vn_stat/fo_stat
directly for stat/fstat/lstat syscall emulation. It turns out not only
safer, but the code is smaller this way too.


# 5cb0e301 22-Dec-2003 Peter Wemm <peter@FreeBSD.org>

Eliminate stackgap usage for the (woefully incomplete) path translations
since it isn't needed here anymore.
Use standard open(2)/access(2) and chflags(2) syscalls now.


# 4eeb271a 10-Dec-2003 Peter Wemm <peter@FreeBSD.org>

Just implementing a 32 bit version of gettimeofday() was smaller than
the wrapper code. And it doesn't use the stackgap as a bonus.


# 71d60843 07-Nov-2003 Peter Wemm <peter@FreeBSD.org>

Dont write to the stackgap directly in execve().


# 60a8c422 29-Oct-2003 Peter Wemm <peter@FreeBSD.org>

Add CTASSERT()'s to check that the sizes of our replicas of the 32 bit
structures come out the right size.

Fix the ones that broke. stat32 had some missing fields from the end
and statfs32 was broken due to the strange definition of MNAMELEN
(which is dependent on sizeof(long))

I'm not sure if this fixes any actual problems or not.


# 46159d1f 22-Aug-2003 Peter Wemm <peter@FreeBSD.org>

Switch to using the emulator in the common compat area.
Still work-in-progress.


# 1c7abef7 22-Aug-2003 Peter Wemm <peter@FreeBSD.org>

Initial sweep to de-i386-ify this


# 56ae44c5 25-Jul-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().

Brought to you by: a boring talk at Ottawa Linux Symposium


# d85631c4 13-May-2003 Peter Wemm <peter@FreeBSD.org>

Add BASIC i386 binary support for the amd64 kernel. This is largely
stolen from the ia64/ia32 code (indeed there was a repocopy), but I've
redone the MD parts and added and fixed a few essential syscalls. It
is sufficient to run i386 binaries like /bin/ls, /usr/bin/id (dynamic)
and p4. The ia64 code has not implemented signal delivery, so I had
to do that.

Before you say it, yes, this does need to go in a common place. But
we're in a freeze at the moment and I didn't want to risk breaking ia64.
I will sort this out after the freeze so that the common code is in a
common place.

On the AMD64 side, this required adding segment selector context switch
support and some other support infrastructure. The %fs/%gs etc code
is hairy because loading %gs will clobber the kernel's current MSR_GSBASE
setting. The segment selectors are not used by the kernel, so they're only
changed at context switch time or when changing modes. This still needs
to be optimized.

Approved by: re (amd64/* blanket)


# fe8cdcae 22-Apr-2003 John Baldwin <jhb@FreeBSD.org>

- Replace inline implementations of sigprocmask() with calls to
kern_sigprocmask() in the various binary compatibility emulators.
- Replace calls to sigsuspend(), sigaltstack(), sigaction(), and
sigprocmask() that used the stackgap with calls to the corresponding
kern_sig*() functions instead without using the stackgap.


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# f341ca98 16-Feb-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Remove #include <sys/dkstat.h>


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# d1e405c5 13-Dec-2002 Alfred Perlstein <alfred@FreeBSD.org>

SCARGS removal take II.


# bc9e75d7 13-Dec-2002 Alfred Perlstein <alfred@FreeBSD.org>

Backout removal SCARGS, the code freeze is only "selectively" over.


# 0bbe7292 13-Dec-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove SCARGS.

Reviewed by: md5


# 459e3a7a 09-Oct-2002 Peter Wemm <peter@FreeBSD.org>

Try and deal with the #ifdef COMPAT_FREEBSD4 sendfile stuff. This would
have been a lot easier if do_sendfile() was usable externally.


# 3ebc1248 19-Jul-2002 Peter Wemm <peter@FreeBSD.org>

Infrastructure tweaks to allow having both an Elf32 and an Elf64 executable
handler in the kernel at the same time. Also, allow for the
exec_new_vmspace() code to build a different sized vmspace depending on
the executable environment. This is a big help for execing i386 binaries
on ia64. The ELF exec code grows the ability to map partial pages when
there is a page size difference, eg: emulating 4K pages on 8K or 16K
hardware pages.

Flesh out the i386 emulation support for ia64. At this point, the only
binary that I know of that fails is cvsup, because the cvsup runtime
tries to execute code in pages not marked executable.

Obtained from: dfr (mostly, many tweaks from me).