History log of /freebsd-current/sys/kern/sys_generic.c
Revision Date Author Comments
# 5b3e5c6c 29-Apr-2024 Konstantin Belousov <kib@FreeBSD.org>

kcmp_pget(): do not accept TIDs

Otherwise pget() might still look up and hold the current process.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 1e01650a 29-Apr-2024 Konstantin Belousov <kib@FreeBSD.org>

kcmp_pget(): add an assert that we did not hold the current process

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 47ad4f2d 04-Mar-2024 Kyle Evans <kevans@FreeBSD.org>

ktrace: log genio events on failed write

Visibility into the contents of the buffer when a write(2) has failed
can be immensely useful in debugging IPC issues -- pushing this to
discuss the idea, or maybe an alternative where we can set a flag like
KTRFAC_ERRIO to enable it.

When a genio event is potentially raised after an error, currently we'll
just free the uio and return. However, such data can be useful when
debugging communication between processes to, e.g., understand what the
remote side should have grabbed before closing a pipe. Tap out the
entire buffer on failure rather than simply discarding it.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D43799


# b5d2165b 04-Mar-2024 Kyle Evans <kevans@FreeBSD.org>

kern: poll: tap out the pollfd array on successful return

We do this in kern_poll() to include freebsd32 but exclude the linux
compat layer. The ABI should be the same, but the POLL constants are
probably different or should be assumed so.

Reviewed by: bapt, jhb
Differential Revision: https://reviews.freebsd.org/D44158


# 61cc4830 18-Jan-2024 Alfredo Mazzinghi <am2419@cl.cam.ac.uk>

Abstract UIO allocation and deallocation.

Introduce the allocuio() and freeuio() functions to allocate and
deallocate struct uio. This hides the actual allocator interface, so it
is easier to modify the sub-allocation layout of struct uio and the
corresponding iovec array.

Obtained from: CheriBSD
Reviewed by: kib, markj
MFC after: 2 weeks
Sponsored by: CHaOS, EPSRC grant EP/V000292/1
Differential Revision: https://reviews.freebsd.org/D43711


# f28526e9 19-Jan-2024 Konstantin Belousov <kib@FreeBSD.org>

kcmp(2): implement for generic file types

Reviewed by: brooks, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D43518


# d8decc9a 19-Jan-2024 Konstantin Belousov <kib@FreeBSD.org>

Add kcmp(2) kernel bits

This is based purely on reading the Linux kcmp(2) man page.
In addition to the Linux set of comparators, I also added KCMP_FILEOBJ to
compare underlying file' objects.

Tested by: manu
Reviewed by: brooks, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D43518


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 7a2c93b8 14-Dec-2022 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: provide sousrsend() that does socket specific error handling

Sockets have special handling for EPIPE on a write, that was spread out
into several places. Treating transient errors is also special - if
protocol is atomic, than we should ignore any changes to uio_resid, a
transient error means the write had completely failed (see d2b3a0ed31e).

- Provide sousrsend() that expects a valid uio, and leave sosend() for
kernel consumers only. Do all special error handling right here.
- In dofilewrite() don't do special handling of error for DTYPE_SOCKET.
- For send(2), write(2) and aio_write(2) call into sousrsend() and remove
error handling for kern_sendit(), soo_write() and soaio_process_job().

PR: 265087
Reported by: rz-rpi03 at h-ka.de
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35863


# c6d31b83 18-Jul-2022 Konstantin Belousov <kib@FreeBSD.org>

AST: rework

Make most AST handlers dynamically registered. This allows to have
subsystem-specific handler source located in the subsystem files,
instead of making subr_trap.c aware of it. For instance, signal
delivery code on return to userspace is now moved to kern_sig.c.

Also, it allows to have some handlers designated as the cleanup (kclear)
type, which are called both at AST and on thread/process exit. For
instance, ast(), exit1(), and NFS server no longer need to be aware
about UFS softdep processing.

The dynamic registration also allows third-party modules to register AST
handlers if needed. There is one caveat with loadable modules: the
code does not make any effort to ensure that the module is not unloaded
before all threads processed through AST handler in it. In fact, this
is already present behavior for hwpmc.ko and ufs.ko. I do not think it
is worth the efforts and the runtime overhead to try to fix it.

Reviewed by: markj
Tested by: emaste (arm64), pho
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35888


# 91e7bdcd 25-Apr-2022 Dmitry Chagin <dchagin@FreeBSD.org>

Add timespecvalid_interval macro and use it.

Reviewed by: jhb, imp (early rev)
Differential revision: https://reviews.freebsd.org/D34848
MFC after: 2 weeks


# f17ef286 22-Feb-2022 Mateusz Guzik <mjg@FreeBSD.org>

fd: rename fget*_locked to fget*_noref

This gets rid of the error prone naming where fget_unlocked returns with
a ref held, while fget_locked requires a lock but provides nothing in
terms of making sure the file lives past unlock.

No functional changes.


# 513c7a6e 10-Feb-2022 Mateusz Guzik <mjg@FreeBSD.org>

fd: make fget_unlocked take a thread argument

Just like other fget routines. This enables embedding fd table pointer
in struct thread, avoiding taking a trip through proc.


# 04c91ac4 13-Oct-2021 Brooks Davis <brooks@FreeBSD.org>

selsocket: handle sopoll() errors correctly

Without this change, unmounting smbfs filesystems with an INVARIANTS
kernel would panic after 10e64782ed59727e8c9fe4a5c7e17f497903c8eb.

Found by: markj
Reviewed by: markj, jhb
Obtained from: CheriBSD
MFC after: 3 days
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D32492


# 0dc332bf 05-Aug-2021 Ka Ho Ng <khng@FreeBSD.org>

Add fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9).

fspacectl(2) is a system call to provide space management support to
userspace applications. VOP_DEALLOCATE(9) is a VOP call to perform the
deallocation. vn_deallocate(9) is a public KPI for kmods' use.

The purpose of proposing a new system call, a KPI and a VOP call is to
allow bhyve or other hypervisor monitors to emulate the behavior of SCSI
UNMAP/NVMe DEALLOCATE on a plain file.

fspacectl(2) comprises of cmd and flags parameters to specify the
space management operation to be performed. Currently cmd has to be
SPACECTL_DEALLOC, and flags has to be 0.

fo_fspacectl is added to fileops.
VOP_DEALLOCATE(9) is added as a new VOP call. A trivial implementation
of VOP_DEALLOCATE(9) is provided.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D28347


# cae3f9dd 23-Jul-2021 Mark Johnston <markj@FreeBSD.org>

select: Define select_flags[] as const

MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# e884512a 10-Jun-2021 Dmitry Chagin <dchagin@FreeBSD.org>

Split kern_poll() on two counterparts.

The kern_poll_kfds() operates on clear kernel data, kfds points to an
array in the kernel, while kern_poll() operates on user supplied pollfd.
Move nfds check to kern_poll_maxfds().

No functional changes, it's for future use in the Linux emulation layer.

Reviewd by: kib
Differential Revision: https://reviews.freebsd.org/D30690
MFC after: 2 weeks


# 45e1f854 28-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

poll: use fget_unlocked or fget_only_user when feasible

This follows select by eleminating the use of filedesc lock.
This is a win for single-threaded processes and a mixed bag for others
as at small concurrency it is faster to take the lock instead of
refing/unrefing each file descriptor.

Nonetheless, removal of shared lock usage is a step towards a
mtx-protected fd table.


# 6affe1b7 28-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

select: employ fget_only_user

Since most select users are single-threaded this avoid a lot of work
in the common case.

For example select of 16 fds (ops/s):
before: 2114536
after: 2991010


# 7a202823 23-Dec-2020 Konstantin Belousov <kib@FreeBSD.org>

Expose eventfd in the native API/ABI using a new __specialfd syscall

eventfd is a Linux system call that produces special file descriptors
for event notification. When porting Linux software, it is currently
usually emulated by epoll-shim on top of kqueues. Unfortunately, kqueues
are not passable between processes. And, as noted by the author of
epoll-shim, even if they were, the library state would also have to be
passed somehow. This came up when debugging strange HW video decode
failures in Firefox. A native implementation would avoid these problems
and help with porting Linux software.

Since we now already have an eventfd implementation in the kernel (for
the Linuxulator), it's pretty easy to expose it natively, which is what
this patch does.

Submitted by: greg@unrelenting.technology
Reviewed by: markj (previous version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26668


# 10e64782 01-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

select: make sure there are no wakeup attempts after selfdfree returns

Prior to the patch returning selfdfree could still be racing against doselwakeup
which set sf_si = NULL and now locks stp to wake up the other thread.

A sufficiently unlucky pair can end up going all the way down to freeing
select-related structures before the lock/wakeup/unlock finishes.

This started manifesting itself as crashes since select data started getting
freed in r367714.


# 31b2ac4b 15-Nov-2020 Mateusz Guzik <mjg@FreeBSD.org>

select: replace reference counting with memory barriers in selfd

Refcounting was added to combat a race between selfdfree and doselwakup,
but it adds avoidable overhead.

selfdfree detects it can free the object by ->sf_si == NULL, thus we can
ensure that the condition only holds after all accesses are completed.


# ea33cca9 04-Nov-2020 Mateusz Guzik <mjg@FreeBSD.org>

poll/select: change selfd_zone into a malloc type

On a sample box vmstat -z shows:

ITEM SIZE LIMIT USED FREE REQ
64: 64, 0, 1043784, 4367538,3698187229
selfd: 64, 0, 1520, 13726,182729008

But at the same time:
vm.uma.selfd.keg.domain.1.pages: 121
vm.uma.selfd.keg.domain.0.pages: 121

Thus 242 pages got pulled even though the malloc zone would likely accomodate
the load without using extra memory.


# 6fed89b1 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

kern: clean up empty lines in .c and .h files


# b1607c87 15-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

poll: factor fd lookup out of scan and rescan


# d8bc2a17 15-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

fd: remove fd_lastfile

It keeps recalculated way more often than it is needed.

Provide a routine (fdlastfile) to get it if necessary.

Consumers may be better off with a bitmap iterator instead.


# a90fb6cf 15-Apr-2020 Hans Petter Selasky <hselasky@FreeBSD.org>

Cast all ioctl command arguments through uint32_t internally.

Hide debug print showing use of sign extended ioctl command argument
under INVARIANTS. The print is available to all and can easily fill
up the logs.

No functional change intended.

MFC after: 1 week
Sponsored by: Mellanox Technologies


# 52604ed7 03-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

fd: remove the seq argument from fget_unlocked

It is almost always NULL.


# 3ff65f71 30-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

Remove duplicated empty lines from kern/*.c

No functional changes.


# 2856d85e 08-Jan-2020 Kyle Evans <kevans@FreeBSD.org>

posix_fallocate: push vnop implementation into the fileop layer

This opens the door for other descriptor types to implement
posix_fallocate(2) as needed.

Reviewed by: kib, bcr (manpages)
Differential Revision: https://reviews.freebsd.org/D23042


# dfa8dae4 05-Oct-2019 Mateusz Guzik <mjg@FreeBSD.org>

devfs: plug redundant bwillwrite avoidance

vn_write already checks for vnode type to see if bwillwrite should be called.

This effectively reverts r244643.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21905


# f1cf2b9d 21-Jul-2019 Konstantin Belousov <kib@FreeBSD.org>

Check and avoid overflow when incrementing fp->f_count in
fget_unlocked() and fhold().

On sufficiently large machine, f_count can be legitimately very large,
e.g. malicious code can dup same fd up to the per-process
filedescriptors limit, and then fork as much as it can.
On some smaller machine, I see
kern.maxfilesperproc: 939132
kern.maxprocperuid: 34203
which already overflows u_int. More, the malicious code can create
transient references by sending fds over unix sockets.

I realized that this check is missed after reading
https://secfault-security.com/blog/FreeBSD-SA-1902.fd.html

Reviewed by: markj (previous version), mjg
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D20947


# 78c2a980 01-Nov-2018 Conrad Meyer <cem@FreeBSD.org>

kern_poll: Restore explanatory comment removed in r177374

The comment isn't stale. The check is bogus in the sense that poll(2)
does not require pollfd entries to be unique in fd space, so there is no
reason there cannot be more pollfd entries than open or even allowed
fds. The check is mostly a seatbelt against accidental misuse or
abuse. FD_SETSIZE, while usually unrelated to poll, is used as an
arbitrary floor for systems with very low kern.maxfilesperproc.

Additionally, document this possible EINVAL condition in the poll.2
manual.

No functional change.

Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D17671


# 2384981b 26-Oct-2018 Conrad Meyer <cem@FreeBSD.org>

poll: Unify userspace pollfd pointer name

Some of the poll code used 'fds' and some used 'ufds' to refer to the
uap->fds userspace pointer that was passed around to subroutines. Some of
the poll code used 'fds' to refer to the kernel memory pollfd arrays, which
seemed unnecessarily confusing.

Unify on 'ufds' to refer to the userspace pollfd array.

Additionally, 'bits' is not an accurate description of the kernel pollfd
array in kern_poll, so rename that to 'kfds'. Finally, clean up some logic
with mallocarray() and nitems().

No functional change.

Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D17670


# d6fda03a 21-Sep-2018 Mateusz Guzik <mjg@FreeBSD.org>

select: stop doing zero-sized memsets

Approved by: re (kib)


# d5cdcc3a 10-May-2018 Andrew Gallatin <gallatin@FreeBSD.org>

Fix the build after r333457

In r333457, the arguments to kern_pwritev() were accidentally
re-ordered as part of ANSIfication, breaking the build.


# cc3c9df8 10-May-2018 Ed Maste <emaste@FreeBSD.org>

ANSIfy sys_generic.c


# cbd92ce6 09-May-2018 Matt Macy <mmacy@FreeBSD.org>

Eliminate the overhead of gratuitous repeated reinitialization of cap_rights

- Add macros to allow preinitialization of cap_rights_t.

- Convert most commonly used code paths to use preinitialized cap_rights_t.
A 3.6% speedup in fstat was measured with this change.

Reported by: mjg
Reviewed by: oshogbo
Approved by: sbruno
MFC after: 1 month


# 6469bdcd 06-Apr-2018 Brooks Davis <brooks@FreeBSD.org>

Move most of the contents of opt_compat.h to opt_global.h.

opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941


# 63901c01 23-Feb-2018 Conrad Meyer <cem@FreeBSD.org>

kern/sys_generic.c: style(9) return(foo) -> return (foo)

No functional change.

Sponsored by: Dell EMC Isilon


# 36bce27b 01-Dec-2017 Konstantin Belousov <kib@FreeBSD.org>

Destroy seltd st_mtx and st_wait in seltdfini().

A correct destruction is important for WITNESS(4) and LOCK_PROFILING(9).

Submitted by: Sebastian Huber <sebastian.huber@embedded-brains.de>
MFC after: 1 week


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# 32a1fb0d 10-Mar-2017 Mahdi Mokhtari <mmokhi@FreeBSD.org>

Fix NULL pointer dereference and panic with shm file pread/pwrite.

PR: 217429
Reported by: Tim Newsham <tim.newsham@nccgroup.trust>
Reviewed by: kib
Approved by: dchagin
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D9844


# b38b22b0 31-Jan-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern_pread() and kern_pwrite(), and use it in compats instead
of their sys_*() counterparts. The svr4 is left unchanged.

Reviewed by: kib@
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9379


# 69a28758 15-Sep-2016 Ed Maste <emaste@FreeBSD.org>

Renumber license clauses in sys/kern to avoid skipping #3


# 5cba398b 18-Aug-2016 George V. Neville-Neil <gnn@FreeBSD.org>

Remove unusedd and obsolete openbsd_poll system call. (Phase 1)

Reported by: brooks
Reviewed by: brooks,jhb
Differential Revision: https://reviews.freebsd.org/D7548


# 51d1f690 10-Jul-2016 Robert Watson <rwatson@FreeBSD.org>

Audit file-descriptor arguments to I/O system calls such as
read(2), write(2), dup(2), and mmap(2). This auditing is not
required by the Common Criteria (and hence was not being
performed), but is valuable in both contemporary live analysis
and forensic use cases.

MFC after: 3 days
Sponsored by: DARPA, AFRL


# 611fcff9 01-Apr-2016 John Baldwin <jhb@FreeBSD.org>

Cap IOSIZE_MAX to INT_MAX for 32-bit processes.

Previously, freebsd32 binaries could submit read/write requests with lengths
greater than INT_MAX that a native kernel would have rejected.

Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D5788


# 0acf5d0b 25-Feb-2016 Mark Johnston <markj@FreeBSD.org>

Improve error handling for posix_fallocate(2) and posix_fadvise(2).

- Set td_errno so that ktrace and dtrace can obtain the syscall error
number in the usual way.
- Pass negative error numbers directly to the syscall layer, as they're
not intended to be returned to userland.

Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D5425


# fcb5b3a4 09-Jul-2015 Konstantin Belousov <kib@FreeBSD.org>

Cover a race between doselwakeup() and selfdfree(). If doselwakeup()
loop finds the selfd entry and clears its sf_si pointer, which is
handled by selfdfree() in parallel, NULL sf_si makes selfdfree() free
the memory. The result is the race and accesses to the freed memory.

Refcount the selfd ownership. One reference is for the sf_link
linkage, which is unconditionally dereferenced by selfdfree().
Another reference is for sf_threads, both selfdfree() and
doselwakeup() race to deref it, the winner unlinks and than frees the
selfd entry.

Reported by: Larry Rosenman <ler@lerctr.org>
Tested by: Larry Rosenman <ler@lerctr.org>, pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 0538aafc 18-Apr-2015 Konstantin Belousov <kib@FreeBSD.org>

The lseek(2), mmap(2), truncate(2), ftruncate(2), pread(2), and
pwrite(2) syscalls are wrapped to provide compatibility with pre-7.x
kernels which required padding before the off_t parameter. The
fcntl(2) contains compatibility code to handle kernels before the
struct flock was changed during the 8.x CURRENT development. The
shims were reasonable to allow easier revert to the older kernel at
that time.

Now, two or three major releases later, shims do not serve any
purpose. Such old kernels cannot handle current libc, so revert the
compatibility code.

Make padded syscalls support conditional under the COMPAT6 config
option. For COMPAT32, the syscalls were under COMPAT6 already.

Remove WITHOUT_SYSCALL_COMPAT build option, which only purpose was to
(partially) disable the removed shims.

Reviewed by: jhb, imp (previous versions)
Discussed with: peter
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# b7a39e9e 17-Feb-2015 Mateusz Guzik <mjg@FreeBSD.org>

filedesc: simplify fget_unlocked & friends

Introduce fget_fcntl which performs appropriate checks when needed.
This removes a branch from fget_unlocked.

Introduce fget_mmap dealing with cap_rights_to_vmprot conversion.
This removes a branch from _fget.

Modify fget_unlocked to pass sequence counter to interested callers so
that they can perform their own checks and make sure the result was
otained from stable & current state.

Reviewed by: silence on -hackers


# 50ae6690 28-Nov-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Style changes:
- Move two IOCTL related defines to the top of the C-file
- Add more comments describing the recently added IOCTL small size and
small align macros


# 186d9c34 12-Nov-2014 Dmitry Chagin <dchagin@FreeBSD.org>

Add the ppoll() system call.
Export kern_poll() needed by an upcoming Linuxulator change.

Differential Revision: https://reviews.freebsd.org/D1133
Reviewed by: kib, wblock
MFC after: 1 month


# 0ecd606b 04-Nov-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Simplify logic a bit. Ensure data buffer is properly aligned,
especially for platforms where unaligned access is not allowed. Make
it possible to override the small buffer size.

A simple continuous read string test using libusb showed a reduction
in CPU usage from roughly 10% to less than 1% using a dual-core GHz
CPU, when the malloc() operation was skipped for small buffers.

MFC after: 2 weeks


# 5cbf44bf 03-Nov-2014 Mateusz Guzik <mjg@FreeBSD.org>

Provide an on-stack temporary buffer for small ioctl requests.


# ffc5ce7b 23-Oct-2014 Mateusz Guzik <mjg@FreeBSD.org>

In selfdfree re-evaulate sf_si after takin the lock.

Otherwise we can race with doselwakeup.

This is a fixup to r273549

Reviewed by: jhb
Reported by: everyone and their dog


# 73f2e5f7 23-Oct-2014 Mateusz Guzik <mjg@FreeBSD.org>

Avoid taking the lock in selfdfree when not needed.


# adf87ab0 21-Jun-2014 Mateusz Guzik <mjg@FreeBSD.org>

fd: replace fd_nfiles with fd_lastfile where appropriate

fd_lastfile is guaranteed to be the biggest open fd, so when the intent
is to iterate over active fds or lookup one, there is no point in looking
beyond that limit.

Few places are left unpatched for now.

MFC after: 1 week


# 4bc38a5a 12-Apr-2014 Davide Italiano <davide@FreeBSD.org>

Hide internal details of sbintime_t implementation wrapping INT64_MAX into
SBT_MAX, to make it more robust in case internal type representation will
change in the future. All the consumers were migrated to SBT_MAX and
every new consumer (if any) should from now use this interface.

Requested by: bapt, jmg, Ryan Lortie (implictly)
Reviewed by: mav, bde


# 4a144410 16-Mar-2014 Robert Watson <rwatson@FreeBSD.org>

Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after: 3 weeks


# ed5848c8 15-Nov-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Replace CAP_POLL_EVENT and CAP_POST_EVENT capability rights (which I had
a very hard time to fully understand) with much more intuitive rights:

CAP_EVENT - when set on descriptor, the descriptor can be monitored
with syscalls like select(2), poll(2), kevent(2).

CAP_KQUEUE_EVENT - When set on a kqueue descriptor, the kevent(2)
syscall can be called on this kqueue to with the eventlist
argument set to non-NULL value; in other words the given
kqueue descriptor can be used to monitor other descriptors.
CAP_KQUEUE_CHANGE - When set on a kqueue descriptor, the kevent(2)
syscall can be called on this kqueue to with the changelist
argument set to non-NULL value; in other words it allows to
modify events monitored with the given kqueue descriptor.

Add alias CAP_KQUEUE, which allows for both CAP_KQUEUE_EVENT and
CAP_KQUEUE_CHANGE.

Add backward compatibility define CAP_POLL_EVENT which is equal to CAP_EVENT.

Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# cd4dd444 15-Oct-2013 Konstantin Belousov <kib@FreeBSD.org>

By default, allow up to SSIZE_MAX i/o for non-devfs files.

Sponsored by: The FreeBSD Foundation
Reminded by: Dmitry Sivachenko <trtrmitya@gmail.com>
MFC after: 1 month
X-MFC-note: stable/10 only


# bf3e483b 15-Oct-2013 Konstantin Belousov <kib@FreeBSD.org>

Similar to debug.iosize_max_clamp sysctl, introduce
devfs_iosize_max_clamp sysctl, which allows/disables SSIZE_MAX-sized
i/o requests on the devfs files.

Sponsored by: The FreeBSD Foundation
Reminded by: Dmitry Sivachenko <trtrmitya@gmail.com>
MFC after: 1 week


# 8e076eff 04-Sep-2013 Sean Bruno <sbruno@FreeBSD.org>

Restore builds on architectures that don't support CAPABILITIES (mips).


# 7008be5b 04-Sep-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)

#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);

bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

cap_rights_t rights;

cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by: The FreeBSD Foundation


# 76665df9 28-Jun-2013 Peter Wemm <peter@FreeBSD.org>

Help out gcc. clang understands.

sys_generic.c:1510: warning: 'precision' may be used uninitialized
*** [sys_generic.o] Error code 1


# 237abf0c 28-Jun-2013 Davide Italiano <davide@FreeBSD.org>

- Trim an unused and bogus Makefile for mount_smbfs.
- Reconnect with some minor modifications, in particular now selsocket()
internals are adapted to use sbintime units after recent'ish calloutng
switch.


# 21a37a71 09-Mar-2013 Alexander Motin <mav@FreeBSD.org>

Rework overflow checks of r247898 to not let too "intelligent" compiler to
optimize it out.

Submitted by: bde


# 980c545d 06-Mar-2013 Alexander Motin <mav@FreeBSD.org>

Fix time math overflows and improve zero intervals handling in poll(),
select(), nanosleep() and kevent() functions after calloutng changes.

Reported by: bde


# cf5e4fe6 04-Mar-2013 Davide Italiano <davide@FreeBSD.org>

MFcalloutng:
Fix kern_select() and sys_poll() so that they can handle sub-tick
precision for timeouts (in the same fashion it was done for nanosleep()
in r247797).

Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo, marius, ian, markj, Fabian Keil


# 2609222a 01-Mar-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Merge Capsicum overhaul:

- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:

CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.

Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).

Added CAP_SYMLINKAT:
- Allow for symlinkat(2).

Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).

Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.

Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.

Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.

Removed CAP_MAPEXEC.

CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.

Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.

CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).

CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).

Added convinient defines:

#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE

#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)

Added defines for backward API compatibility:

#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib


# ad9789f6 23-Dec-2012 Konstantin Belousov <kib@FreeBSD.org>

Do not force a writer to the devfs file to drain the buffer writes.

Requested and tested by: Ian Lepore <freebsd@damnhippie.dyndns.org>
MFC after: 2 weeks


# 2e564269 17-Oct-2012 Attilio Rao <attilio@FreeBSD.org>

Disconnect non-MPSAFE SMBFS from the build in preparation for dropping
GIANT from VFS. In addition, disconnect also netsmb, which is a base
requirement for SMBFS.

In the while SMBFS regular users can use FUSE interface and smbnetfs
port to work with their SMBFS partitions.

Also, there are ongoing efforts by vendor to support in-kernel smbfs,
so there are good chances that it will get relinked once properly locked.

This is not targeted for MFC.


# 68cefd64 17-Jun-2012 Davide Italiano <davide@FreeBSD.org>

The variable 'error' in sys_poll() is initialized in declaration to value
zero but in any case is overwritten by successive copyin(), making the
previous initialization useless. Remove this.
As an added bonus this fixes a style(9) bug.

Discussed with: kib
Approved by: gnn (mentor)
MFC after: 3 days


# 8bb9a904 04-Mar-2012 Konstantin Belousov <kib@FreeBSD.org>

Instead of incomplete handling of read(2)/write(2) return values that
does not fit into registers, declare that we do not support this case
using CTASSERT(), and remove endianess-unsafe code to split return value
into td_retval.

While there, change the style of the sysctl debug.iosize_max_clamp
definition.

Requested by: bde
MFC after: 3 weeks


# 526d0bd5 20-Feb-2012 Konstantin Belousov <kib@FreeBSD.org>

Fix found places where uio_resid is truncated to int.

Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.

Discussed with: bde, das (previous versions)
MFC after: 1 month


# 56be1b9a 13-Nov-2011 Konstantin Belousov <kib@FreeBSD.org>

To limit amount of the kernel memory allocated, and to optimize the
iteration over the fdsets, kern_select() limits the length of the
fdsets copied in by the last valid file descriptor index. If any bit
is set in a mask above the limit, current implementation ignores the
filedescriptor, instead of returning EBADF.

Fix the issue by scanning the tails of fdset before entering the
select loop and returning EBADF if any bit above last valid
filedescriptor index is set. The performance impact of the additional
check is only imposed on the (somewhat) buggy applications that pass
bad file descriptors to select(2) or pselect(2).

PR: kern/155606, kern/162379
Discussed with: cognet, glebius
Tested by: andreast (powerpc, all 64/32bit ABI combinations, big-endian),
marius (sparc64, big-endian)
MFC after: 2 weeks


# 8451d0dd 16-Sep-2011 Kip Macy <kmacy@FreeBSD.org>

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


# 6aba400a 25-Aug-2011 Attilio Rao <attilio@FreeBSD.org>

Fix a deficiency in the selinfo interface:
If a selinfo object is recorded (via selrecord()) and then it is
quickly destroyed, with the waiters missing the opportunity to awake,
at the next iteration they will find the selinfo object destroyed,
causing a PF#.

That happens because the selinfo interface has no way to drain the
waiters before to destroy the registered selinfo object. Also this
race is quite rare to get in practice, because it would require a
selrecord(), a poll request by another thread and a quick destruction
of the selrecord()'ed selinfo object.

Fix this by adding the seldrain() routine which should be called
before to destroy the selinfo objects (in order to avoid such case),
and fix the present cases where it might have already been called.
Sometimes, the context is safe enough to prevent this type of race,
like it happens in device drivers which installs selinfo objects on
poll callbacks. There, the destruction of the selinfo object happens
at driver detach time, when all the filedescriptors should be already
closed, thus there cannot be a race.
For this case, mfi(4) device driver can be set as an example, as it
implements a full correct logic for preventing this from happening.

Sponsored by: Sandvine Incorporated
Reported by: rstone
Tested by: pluknet
Reviewed by: jhb, kib
Approved by: re (bz)
MFC after: 3 weeks


# d6f72489 16-Aug-2011 Jonathan Anderson <jonathan@FreeBSD.org>

poll(2) implementation for capabilities.

When calling poll(2) on a capability, unwrap first and then poll the
underlying object.

Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc


# d1b6899e 12-Aug-2011 Jonathan Anderson <jonathan@FreeBSD.org>

Rename CAP_*_KEVENT to CAP_*_EVENT.

Change the names of a couple of capability rights to be less
FreeBSD-specific.

Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc


# a9d2f8d8 10-Aug-2011 Robert Watson <rwatson@FreeBSD.org>

Second-to-last commit implementing Capsicum capabilities in the FreeBSD
kernel for FreeBSD 9.0:

Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *. With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.

Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.

In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.

Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.

Approved by: re (bz)
Submitted by: jonathan
Sponsored by: Google Inc


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 6d8fedda 28-Aug-2010 Konstantin Belousov <kib@FreeBSD.org>

For some file types, select code registers two selfd structures. E.g.,
for socket, when specified POLLIN|POLLOUT in events, you would have one
selfd registered for receiving socket buffer, and one for sending. Now,
if both events are not ready to fire at the time of the initial scan,
but are simultaneously ready after the sleep, pollrescan() would iterate
over the pollfd struct twice. Since both times revents is not zero,
returned value would be off by one.

Fix this by recalculating the return value in pollout().

PR: kern/143029
MFC after: 2 weeks


# 7a6f3d78 29-Jun-2010 John Baldwin <jhb@FreeBSD.org>

Send SIGPIPE to the thread that issued the offending system call
rather than to the entire process.

Reported by: Anit Chakraborty
Reviewed by: kib, deischen (concept)
MFC after: 1 week


# f0f2f0a9 04-Jun-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r208374:
Remove POLLHUP from the flags used to test for to set exceptfsd
fd_set bits in select(2). It seems that historical behaviour is to not
reporting exception on EOF, and several applications are broken.

Approved by: re (kensmith)


# 61e53a38 21-May-2010 Konstantin Belousov <kib@FreeBSD.org>

Remove PIOLLHUP from the flags used to test for to set exceptfsd
fd_set bits in select(2). It seems that historical behaviour is to not
reporting exception on EOF, and several applications are broken.

Reported by: Yoshihiko Sarumaru <ysarumaru gmail com>
Discussed with: bde
PR: ports/140934
MFC after: 2 weeks


# 4ccf64eb 06-Apr-2010 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

MFC r205014,205015:

Provide groundwork for 32-bit binary compatibility on non-x86 platforms,
for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32
option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts
of the kernel and enhances the freebsd32 compatibility code to support
big-endian platforms.

This MFC is required for MFCs of later changes to the freebsd32
compatibility from HEAD.

Requested by: kib


# 841c0c7e 11-Mar-2010 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

Provide groundwork for 32-bit binary compatibility on non-x86 platforms,
for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32
option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts
of the kernel and enhances the freebsd32 compatibility code to support
big-endian platforms.

Reviewed by: kib, jhb


# 7e767511 19-Dec-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r198508, r198509:
Reimplement pselect() in kernel, making change of sigmask and sleep atomic.

MFC r198538:
Move pselect(3) man page to section 2.


# 066d836b 27-Oct-2009 Konstantin Belousov <kib@FreeBSD.org>

Current pselect(3) is implemented in usermode and thus vulnerable to
well-known race condition, which elimination was the reason for the
function appearance in first place. If sigmask supplied as argument to
pselect() enables a signal, the signal might be delivered before thread
called select(2), causing lost wakeup. Reimplement pselect() in kernel,
making change of sigmask and sleep atomic.

Since signal shall be delivered to the usermode, but sigmask restored,
set TDP_OLDMASK and save old mask in td_oldsigmask. The TDP_OLDMASK
should be cleared by ast() in case signal was not gelivered during
syscall execution.

Reviewed by: davidxu
Tested by: pho
MFC after: 1 month


# 9f1fab50 16-Sep-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r197049:
Calculate the amount of bytes to copy for select filedescriptor masks
taking into account size of fd_set for the current process ABI.

Approved by: re (kensmith)


# b55ef216 09-Sep-2009 Konstantin Belousov <kib@FreeBSD.org>

kern_select(9) copies fd_set in and out of userspace in quantities of
longs. Since 32bit processes longs are 4 bytes, 64bit kernel may copy in
or out 4 bytes more then the process expected.

Calculate the amount of bytes to copy taking into account size of fd_set
for the current process ABI.

Diagnosed and tested by: Peter Jeremy <peterjeremy acm org>
Reviewed by: jhb
MFC after: 1 week


# 33656b91 01-Sep-2009 Jilles Tjoelker <jilles@FreeBSD.org>

MFC r196460

Fix the conformance of poll(2) for sockets after r195423 by
returning POLLHUP instead of POLLIN for several cases. Now, the
tools/regression/poll results for FreeBSD are closer to that of the
Solaris and Linux.

Also, improve the POSIX conformance by explicitely clearing POLLOUT
when POLLHUP is reported in pollscan(), making the fix global.

Submitted by: bde
Reviewed by: rwatson

MFC r196556

Fix poll() on half-closed sockets, while retaining POLLHUP for fifos.

This reverts part of r196460, so that sockets only return POLLHUP if both
directions are closed/error. Fifos get POLLHUP by closing the unused
direction immediately after creating the sockets.

The tools/regression/poll/*poll.c tests now pass except for two other
things:
- if POLLHUP is returned, POLLIN is always returned as well instead of
only when there is data left in the buffer to be read
- fifo old/new reader distinction does not work the way POSIX specs it

Reviewed by: kib, bde

MFC r196554

Add some tests for poll(2)/shutdown(2) interaction.

Approved by: re (kensmith)


# f2159cc7 22-Aug-2009 Konstantin Belousov <kib@FreeBSD.org>

Fix the conformance of poll(2) for sockets after r195423 by
returning POLLHUP instead of POLLIN for several cases. Now, the
tools/regression/poll results for FreeBSD are closer to that of the
Solaris and Linux.

Also, improve the POSIX conformance by explicitely clearing POLLOUT
when POLLHUP is reported in pollscan(), making the fix global.

Submitted by: bde
Reviewed by: rwatson
MFC after: 1 week


# 64c9a4d9 02-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Audit file descriptor and command arguments to ioctl(2).

Approved by: re (audit argument blanket)
MFC after: 1 week


# 2141453e 01-Jul-2009 Jeff Roberson <jeff@FreeBSD.org>

- Use fd_lastfile + 1 as the upper bound on nd. This is more correct than
using the size of the descriptor array.
- A lock is not needed to fetch fd_lastfile. The results are stale the
instant it is dropped.
- Use a private mutex pool for select since the pool mutex is not used
as a leaf.
- Fetch the si_mtx pointer first before resorting to hashing to compute
the mutex address.

Reviewed by: McKusick
Approved by: re (kib)


# 14961ba7 27-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Replace AUDIT_ARG() with variable argument macros with a set more more
specific macros for each audit argument type. This makes it easier to
follow call-graphs, especially for automated analysis tools (such as
fxr).

In MFC, we should leave the existing AUDIT_ARG() macros as they may be
used by third-party kernel modules.

Suggested by: brooks
Approved by: re (kib)
Obtained from: TrustedBSD Project
MFC after: 1 week


# bf422e5f 13-May-2009 Jeff Roberson <jeff@FreeBSD.org>

- Implement a lockless file descriptor lookup algorithm in
fget_unlocked().
- Save old file descriptor tables created on expansion until
the entire descriptor table is freed so that pointers may be
followed without regard for expanders.
- Mark the file zone as NOFREE so we may attempt to reference
potentially freed files.
- Convert several fget_locked() users to fget_unlocked(). This
requires us to manage reference counts explicitly but reduces
locking overhead in the common case.


# ae81968f 11-Mar-2009 Robert Watson <rwatson@FreeBSD.org>

When writing out updated pollfd records when returning from
poll(), only copy out the revents field, not the whole pollfd
structure. Otherwise, if the events field is updated
concurrently by another thread, that update may be lost.

This issue apparently causes problems for the JDK on FreeBSD,
which expects the Linux behavior of not updating all fields
(somewhat oddly, Solaris does not implement the required
behavior, but presumably our adaptation of the JDK is based
on the Linux port?).

MFC after: 2 weeks
PR: kern/130924
Submitted by: Kurt Miller <kurt @ intricatesoftware.com>
Discussed with: kib


# 125dcf8c 06-Mar-2009 Konstantin Belousov <kib@FreeBSD.org>

Extract the no_poll() and vop_nopoll() code into the common routine
poll_no_poll().
Return a poll_no_poll() result from devfs_poll_f() when
filedescriptor does not reference the live cdev, instead of ENXIO.

Noted and tested by: hps
MFC after: 1 week


# 60b7f468 01-Feb-2009 Stephane E. Potvin <sepotvin@FreeBSD.org>

Fix select on platforms where sizeof(long) != sizeof(int). This used
to work by accident before the cleanup done in revision 187693.

Approved by: kan (mentor)


# 9cdacff1 25-Jan-2009 Jeff Roberson <jeff@FreeBSD.org>

- bit has to be fd_mask to work properly on 64bit platforms. Constants
must also be cast even though the result ultimately is promoted
to 64bit.
- Correct a loop index upper bound in selscan().


# 748b9df6 25-Jan-2009 Jeff Roberson <jeff@FreeBSD.org>

- Correct a typo in a comment.

Noticed by: danger


# 11b763df 25-Jan-2009 Jeff Roberson <jeff@FreeBSD.org>

Fix errors introduced when I rewrote select.
- Restructure selscan() and selrescan() to avoid producing extra selfps
when we have a fd in multiple sets. As described below multiple selfps
may still exist for other reasons.
- Make selrescan() tolerate multiple selfds for a given descriptor
set since sockets use two selinfos per fd. If an event on each selinfo
fires selrescan() will see the descriptor twice. This could result in
select() returning 2x the number of fds actually existing in fd sets.

Reported by: mgleason@ncftp.com


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 715457f6 23-Sep-2008 David E. O'Brien <obrien@FreeBSD.org>

Reverse if() logic to improve readability.

Reviewed by: ru


# bd4e1535 19-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Remove stale comment.
- In the last revision the code was changed to use maxfilesperproc rather than
the per-process file limit to restrict the size of the poll array. This
eliminates a significant source of process lock contention in multithreaded
programs and is cheaper. This had been committed with the wrong batch of
changes.


# 374ae2a3 19-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from
requiring the per-process spinlock to only requiring the process lock.
- Reflect these changes in the proc.h documentation and consumers throughout
the kernel. This is a substantial reduction in locking cost for these
fields and was made possible by recent changes to threading support.


# e4650294 07-Jan-2008 John Baldwin <jhb@FreeBSD.org>

Make ftruncate a 'struct file' operation rather than a vnode operation.
This makes it possible to support ftruncate() on non-vnode file types in
the future.
- 'struct fileops' grows a 'fo_truncate' method to handle an ftruncate() on
a given file descriptor.
- ftruncate() moves to kern/sys_generic.c and now just fetches a file
object and invokes fo_truncate().
- The vnode-specific portions of ftruncate() move to vn_truncate() in
vfs_vnops.c which implements fo_truncate() for vnode file types.
- Non-vnode file types return EINVAL in their fo_truncate() method.

Submitted by: rwatson


# 397c19d1 29-Dec-2007 Jeff Roberson <jeff@FreeBSD.org>

Remove explicit locking of struct file.
- Introduce a finit() which is used to initailize the fields of struct file
in such a way that the ops vector is only valid after the data, type,
and flags are valid.
- Protect f_flag and f_count with atomic operations.
- Remove the global list of all files and associated accounting.
- Rewrite the unp garbage collection such that it no longer requires
the global list of all files and instead uses a list of all unp sockets.
- Mark sockets in the accept queue so we don't incorrectly gc them.

Tested by: kris, pho


# ace8398d 15-Dec-2007 Jeff Roberson <jeff@FreeBSD.org>

Refactor select to reduce contention and hide internal implementation
details from consumers.

- Track individual selecters on a per-descriptor basis such that there
are no longer collisions and after sleeping for events only those
descriptors which triggered events must be rescaned.
- Protect the selinfo (per descriptor) structure with a mtx pool mutex.
mtx pool mutexes were chosen to preserve api compatibility with
existing code which does nothing but bzero() to setup selinfo
structures.
- Use a per-thread wait channel rather than a global wait channel.
- Hide select implementation details in a seltd structure which is
opaque to the rest of the kernel.
- Provide a 'selsocket' interface for those kernel consumers who wish to
select on a socket when they have no fd so they no longer have to
be aware of select implementation details.

Tested by: kris
Reviewed on: arch


# 431f8906 13-Nov-2007 Julian Elischer <julian@FreeBSD.org>

generally we are interested in what thread did something as
opposed to what process. Since threads by default have teh name of the
process unless over-written with more useful information, just print the
thread name instead.


# c2815ad5 04-Jul-2007 Peter Wemm <peter@FreeBSD.org>

Add freebsd6_ wrappers for mmap/lseek/pread/pwrite/truncate/ftruncate

Approved by: re (kensmith)


# 982d11f8 04-Jun-2007 Jeff Roberson <jeff@FreeBSD.org>

Commit 14/14 of sched_lock decomposition.
- Use thread_lock() rather than sched_lock for per-thread scheduling
sychronization.
- Use the per-process spinlock rather than the sched_lock for per-process
scheduling synchronization.

Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)


# fa75abb0 01-May-2007 Alan Cox <alc@FreeBSD.org>

Remove unneeded include files.


# 5e3f7694 04-Apr-2007 Robert Watson <rwatson@FreeBSD.org>

Replace custom file descriptor array sleep lock constructed using a mutex
and flags with an sxlock. This leads to a significant and measurable
performance improvement as a result of access to shared locking for
frequent lookup operations, reduced general overhead, and reduced overhead
in the event of contention. All of these are imported for threaded
applications where simultaneous access to a shared file descriptor array
occurs frequently. Kris has reported 2x-4x transaction rate improvements
on 8-core MySQL benchmarks; smaller improvements can be expected for many
workloads as a result of reduced overhead.

- Generally eliminate the distinction between "fast" and regular
acquisisition of the filedesc lock; the plan is that they will now all
be fast. Change all locking instances to either shared or exclusive
locks.

- Correct a bug (pointed out by kib) in fdfree() where previously msleep()
was called without the mutex held; sx_sleep() is now always called with
the sxlock held exclusively.

- Universally hold the struct file lock over changes to struct file,
rather than the filedesc lock or no lock. Always update the f_ops
field last. A further memory barrier is required here in the future
(discussed with jhb).

- Improve locking and reference management in linux_at(), which fails to
properly acquire vnode references before using vnode pointers. Annotate
improper use of vn_fullpath(), which will be replaced at a future date.

In fcntl(), we conservatively acquire an exclusive lock, even though in
some cases a shared lock may be sufficient, which should be revisited.
The dropping of the filedesc lock in fdgrowtable() is no longer required
as the sxlock can be held over the sleep operation; we should consider
removing that (pointed out by attilio).

Tested by: kris
Discussed with: jhb, kris, attilio, jeff


# 873fbcd7 05-Mar-2007 Robert Watson <rwatson@FreeBSD.org>

Further system call comment cleanup:

- Remove also "MP SAFE" after prior "MPSAFE" pass. (suggested by bde)
- Remove extra blank lines in some cases.
- Add extra blank lines in some cases.
- Remove no-op comments consisting solely of the function name, the word
"syscall", or the system call name.
- Add punctuation.
- Re-wrap some comments.


# 0c14ff0e 04-Mar-2007 Robert Watson <rwatson@FreeBSD.org>

Remove 'MPSAFE' annotations from the comments above most system calls: all
system calls now enter without Giant held, and then in some cases, acquire
Giant explicitly.

Remove a number of other MPSAFE annotations in the credential code and
tweak one or two other adjacent comments.


# 6f7ca813 01-Mar-2007 Bruce M Simpson <bms@FreeBSD.org>

Do not dispatch SIGPIPE from the generic write path for a socket; with
this patch the code behaves according to the comment on the line above.

Without this patch, a socket could cause SIGPIPE to be delivered to its
process, once with SO_NOSIGPIPE set, and twice without.

With this patch, the kernel now passes the sigpipe regression test.

Tested by: Anton Yuzhaninov
MFC after: 1 week


# a1b0a180 14-Oct-2006 Ruslan Ermilov <ru@FreeBSD.org>

Prevent IOC_IN with zero size argument (this is only supported
if backward copatibility options are present) from attempting
to free memory that wasn't allocated. This is an old bug, and
previously it would attempt to free a null pointer. I noticed
this bug when working on the previous revision, but forgot to
fix it.

Security: local DoS
Reported by: Peter Holm
MFC after: 3 days


# 9fddcc66 27-Sep-2006 Ruslan Ermilov <ru@FreeBSD.org>

Fix our ioctl(2) implementation when the argument is "int". New
ioctls passing integer arguments should use the _IOWINT() macro.
This fixes a lot of ioctl's not working on sparc64, most notable
being keyboard/syscons ioctls.

Full ABI compatibility is provided, with the bonus of fixing the
handling of old ioctls on sparc64.

Reviewed by: bde (with contributions)
Tested by: emax, marius
MFC after: 1 week


# d9f46233 08-Jul-2006 John Baldwin <jhb@FreeBSD.org>

- Split ioctl() up into ioctl() and kern_ioctl(). The kern_ioctl() assumes
that the 'data' pointer is already setup to point to a valid KVM buffer
or contains the copied-in data from userland as appropriate (ioctl(2)
still does this). kern_ioctl() takes care of looking up a file pointer,
implementing FIONCLEX and FIOCLEX, and calling fi_ioctl().
- Use kern_ioctl() to implement xenix_rdchk() instead of using the stackgap
and mark xenix_rdchk() MPSAFE.


# af56abaa 06-Jan-2006 John Baldwin <jhb@FreeBSD.org>

Return error from fget_write() rather than hardcoding EBADF now that
fget_write() DTRT.

Requested by: bde


# e730167f 05-Jan-2006 John Baldwin <jhb@FreeBSD.org>

Remove XXX comments complaining that write(2) on a read-only descriptor
returns EBADF. That errno is correct and is mandated by POSIX. It also
goes back to revision 1.1 of our CVS history (i.e. 4.4BSD).

The _fget() function should probably also be upated as it currently returns
EINVAL in that case rather than EBADF. (It does return EBADF for reads
on a write-only descriptor without any XXX comments oddly enough.)

Discussed with: scottl, grog, mjacob, bde


# bcd9e0dd 07-Jul-2005 John Baldwin <jhb@FreeBSD.org>

- Add two new system calls: preadv() and pwritev() which are like readv()
and writev() except that they take an additional offset argument and do
not change the current file position. In SAT speak:
preadv:readv::pread:read and pwritev:writev::pwrite:write.
- Try to reduce code duplication some by merging most of the old
kern_foov() and dofilefoo() functions into new dofilefoo() functions
that are called by kern_foov() and kern_pfoov(). The non-v functions
now all generate a simple uio on the stack from the passed in arguments
and then call kern_foov(). For example, read() now just builds a uio and
calls kern_readv() and pwrite() just builds a uio and calls kern_pwritev().

PR: kern/80362
Submitted by: Marc Olzheim marcolz at stack dot nl (1)
Approved by: re (scottl)
MFC after: 1 week


# 2de92a38 29-Jun-2005 Peter Wemm <peter@FreeBSD.org>

Conditionally weaken sys_generic.c rev 1.136 to allow certain dubious
ioctl numbers in backwards compatability mode. eg: an IOC_IN ioctl with
a size of zero. Traditionally this was what you did before IOC_VOID
existed, and we had some established users of this in the tree, namely
procfs. Certain 3rd party drivers with binary userland components also
have this too.

This is necessary to have 4.x and 5.x binaries use these ioctl's. We
found this at work when trying to run 4.x binaries.

Approved by: re


# b88ec951 31-Mar-2005 John Baldwin <jhb@FreeBSD.org>

Implement kern_adjtime(), kern_readv(), kern_sched_rr_get_interval(),
kern_settimeofday(), and kern_writev() to allow for further stackgap
reduction in the compat ABIs.


# e5e6a464 10-Feb-2005 Colin Percival <cperciva@FreeBSD.org>

Declare "cnt" (a number of bytes to read or write) as an "ssize_t", not
as a "long" in dofileread() and dofilewrite().

Discussed with: jhb


# 4f8d23d6 25-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Previously a read of zero bytes got handled in devfs:vop_read() but I
missed that when the vnode bypass was introduced.

Deal with zero length transfers before we even get to fo_ops->fo_read().

Found by: Slawa Olhovchenkov <slwzxy.spb.ru@zxy.spb.ru>
PR: 75758


# 9fc6aa06 18-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Detect sign-extension bugs in the ioctl(2) command argument: Truncate
to 32 bits and print warning.


# 9454b2d8 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for copyright notices, minor format tweaks as necessary


# a0fbccc9 17-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Push Giant down through ioctl.

Don't grab Giant in the upper syscall/wrapper code

NET_LOCK_GIANT in the socket code (sockets/fifos).

mtx_lock(&Giant) in the vnode code.

mtx_lock(&Giant) in the opencrypto code. (This may actually not be
needed, but better safe than sorry).

Devfs grabs Giant if the driver is marked as needing Giant.


# db446e30 17-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Push Giant down through select and poll.

Don't grab Giant in the upper syscall/wrapper code

NET_LOCK_GIANT in the socket code (sockets/fifos).

mtx_lock(&Giant) in the vnode code.

Devfs grabs Giant if the driver is marked as needing Giant.


# 8ccf264f 16-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Polish code to correctly reflect structure.


# ca51b19b 14-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Rearrange memory management for ioctl arguments to use stronger checks
for illegal values and don't store them on the stack any more.


# 3e15c66f 13-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

style polish.


# 124e4c3b 13-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce an alias for FILEDESC_{UN}LOCK() with the suffix _FAST.

Use this in all the places where sleeping with the lock held is not
an issue.

The distinction will become significant once we finalize the exact
lock-type to use for this kind of case.


# 2580f4e5 27-Aug-2004 Andre Oppermann <andre@FreeBSD.org>

Poll() uses the array smallbits that is big enough to hold 32 struct
pollfd's to avoid calling malloc() on small numbers of fd's. Because
smalltype's members have type char, its address might be misaligned
for a struct pollfd. Change the array of char to an array of struct
pollfd.

PR: kern/58214
Submitted by: Stefan Farfeleder <stefan@fafoe.narf.at>
Reviewed by: bde (a long time ago)
MFC after: 3 days


# 552afd9c 10-Jul-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Clean up and wash struct iovec and struct uio handling.

Add copyiniov() which copies a struct iovec array in from userland into
a malloc'ed struct iovec. Caller frees.

Change uiofromiov() to malloc the uio (caller frees) and name it
copyinuio() which is more appropriate.

Add cloneuio() which returns a malloc'ed copy. Caller frees.

Use them throughout.


# 7f8a436f 05-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# 5d8dd01d 12-Mar-2004 Robert Watson <rwatson@FreeBSD.org>

Add annotations to mtx_lock(&Giant) in kern_select() and poll() that
we always grab Giant, even if we're actually only polling objects that
don't require giant. Once socket locking is merged, there will be
strong motivation to fix this.


# 44f3b092 27-Feb-2004 John Baldwin <jhb@FreeBSD.org>

Switch the sleep/wakeup and condition variable implementations to use the
sleep queue interface:
- Sleep queues attempt to merge some of the benefits of both sleep queues
and condition variables. Having sleep qeueus in a hash table avoids
having to allocate a queue head for each wait channel. Thus, struct cv
has shrunk down to just a single char * pointer now. However, the
hash table does not hold threads directly, but queue heads. This means
that once you have located a queue in the hash bucket, you no longer have
to walk the rest of the hash chain looking for threads. Instead, you have
a list of all the threads sleeping on that wait channel.
- Outside of the sleepq code and the sleep/cv code the kernel no longer
differentiates between cv's and sleep/wakeup. For example, calls to
abortsleep() and cv_abort() are replaced with a call to sleepq_abort().
Thus, the TDF_CVWAITQ flag is removed. Also, calls to unsleep() and
cv_waitq_remove() have been replaced with calls to sleepq_remove().
- The sched_sleep() function no longer accepts a priority argument as
sleep's no longer inherently bump the priority. Instead, this is soley
a propery of msleep() which explicitly calls sched_prio() before
blocking.
- The TDF_ONSLEEPQ flag has been dropped as it was never used. The
associated TDF_SET_ONSLEEPQ and TDF_CLR_ON_SLEEPQ macros have also been
dropped and replaced with a single explicit clearing of td_wchan.
TD_SET_ONSLEEPQ() would really have only made sense if it had taken
the wait channel and message as arguments anyway. Now that that only
happens in one place, a macro would be overkill.


# 91d5354a 04-Feb-2004 John Baldwin <jhb@FreeBSD.org>

Locking for the per-process resource limits structure.
- struct plimit includes a mutex to protect a reference count. The plimit
structure is treated similarly to struct ucred in that is is always copy
on write, so having a reference to a structure is sufficient to read from
it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
limits from a process to keep the limit structure from changing out from
under you while reading from it.
- Various global limits that are ints are not protected by a lock since
int writes are atomic on all the archs we support and thus a lock
wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
either an rlimit, or the current or max individual limit of the specified
resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
(it didn't used the stackgap when it should have) but uses lim_rlimit()
and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits. It
also no longer uses the stackgap for accessing sysctl's for the
ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result,
ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.

Submitted by: mtm (mostly, I only did a few cleanups and catchups)
Tested on: i386
Compiled on: alpha, amd64


# 9bbee259 19-Jan-2004 Andrey A. Chernov <ache@FreeBSD.org>

pread/pwrite:
follow lseek spirit - return EINVAL on negative offset for non-VCHR


# 512824f8 09-Nov-2003 Seigo Tanimura <tanimura@FreeBSD.org>

- Implement selwakeuppri() which allows raising the priority of a
thread being waken up. The thread waken up can run at a priority as
high as after tsleep().

- Replace selwakeup()s with selwakeuppri()s and pass appropriate
priorities.

- Add cv_broadcastpri() which raises the priority of the broadcast
threads. Used by selwakeuppri() if collision occurs.

Not objected in: -arch, -current


# b2941431 26-Sep-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce no_poll() default method for device drivers. Have it
do exactly the same as vop_nopoll() for consistency and put a
comment in the two pointing at each other.

Retire seltrue() in favour of no_poll().

Create private default functions in kern_conf.c instead of public
ones.

Change default strategy to return the bio with ENODEV instead of
doing nothing which would lead the bio stranded.

Retire public nullopen() and nullclose() as well as the entire band
of public no{read,write,ioctl,mmap,kqfilter,strategy,poll,dump}
funtions, they are the default actions now.

Move the final two trivial functions from subr_xxx.c to kern_conf.c
and retire the now empty subr_xxx.c


# 882d8469 31-Jul-2003 Alan Cox <alc@FreeBSD.org>

Remove Giant from writev(2). Eliminate trivial style differences between
writev(2) and readv(2).


# 2db4b023 18-Jun-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce a new flag on a file descriptor: DFLAG_SEEKABLE and use that
rather than assume that only DTYPE_VNODE is seekable.


# 677b542e 10-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# 104a9b7e 29-Apr-2003 Alexander Kabaev <kan@FreeBSD.org>

Deprecate machine/limits.h in favor of new sys/limits.h.
Change all in-tree consumers to include <sys/limits.h>

Discussed on: standards@
Partially submitted by: Craig Rodrigues <rodrigc@attbi.com>


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# d1e405c5 13-Dec-2002 Alfred Perlstein <alfred@FreeBSD.org>

SCARGS removal take II.


# bc9e75d7 13-Dec-2002 Alfred Perlstein <alfred@FreeBSD.org>

Backout removal SCARGS, the code freeze is only "selectively" over.


# 0bbe7292 13-Dec-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove SCARGS.

Reviewed by: md5


# 37c84183 28-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Be consistent about "static" functions: if the function is marked
static in its prototype, mark it static at the definition too.

Inspired by: FlexeLint warning #512


# 2e45c1b1 20-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

We don't need the <sys/disklabel.h> include for alpha anymore.

Sponsored by: DARPA & NAI Labs.


# 71fad9fd 11-Sep-2002 Julian Elischer <julian@FreeBSD.org>

Completely redo thread states.

Reviewed by: davidxu@freebsd.org


# 8f19eb88 01-Sep-2002 Ian Dowse <iedowse@FreeBSD.org>

Split out a number of mostly VFS and signal related syscalls into
a kernel-internal kern_*() version and a wrapper that is called via
the syscall vector table. For paths and structure pointers, the
internal version either takes a uio_seg parameter or requires the
caller to copyin() the data to kernel memory as appropiate. This
will permit emulation layers to use these syscalls without having
to copy out translated arguments to the stack gap.

Discussed on: -arch
Review/suggestions: bde, jhb, peter, marcel


# 2149c527 23-Aug-2002 Peter Wemm <peter@FreeBSD.org>

Move the TAILQ_INIT(&td->td_selq) before the retry: label. Otherwise in
some circumstances when we get a select collision, we can end up with
cases where we do not clear some sip->si_thread on the way out, leading to
page faults in selwakeup(). This should solve the problem where postfix
can crash the kernel during select collisions.

Reviewed by: alfred


# d49fa1ca 16-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

In continuation of early fileop credential changes, modify fo_ioctl() to
accept an 'active_cred' argument reflecting the credential of the thread
initiating the ioctl operation.

- Change fo_ioctl() to accept active_cred; change consumers of the
fo_ioctl() interface to generally pass active_cred from td->td_ucred.
- In fifofs, initialize filetmp.f_cred to ap->a_cred so that the
invocations of soo_ioctl() are provided access to the calling f_cred.
Pass ap->a_td->td_ucred as the active_cred, but note that this is
required because we don't yet distinguish file_cred and active_cred
in invoking VOP's.
- Update kqueue_ioctl() for its new argument.
- Update pipe_ioctl() for its new argument, pass active_cred rather
than td_ucred to MAC for authorization.
- Update soo_ioctl() for its new argument.
- Update vn_ioctl() for its new argument, use active_cred rather than
td->td_ucred to authorize VOP_IOCTL() and the associated VOP_GETATTR().

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# ea6027a8 15-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Make similar changes to fo_stat() and fo_poll() as made earlier to
fo_read() and fo_write(): explicitly use the cred argument to fo_poll()
as "active_cred" using the passed file descriptor's f_cred reference
to provide access to the file credential. Add an active_cred
argument to fo_stat() so that implementers have access to the active
credential as well as the file credential. Generally modify callers
of fo_stat() to pass in td->td_ucred rather than fp->f_cred, which
was redundantly provided via the fp argument. This set of modifications
also permits threads to perform these operations on behalf of another
thread without modifying their credential.

Trickle this change down into fo_stat/poll() implementations:

- badfo_poll(), badfo_stat(): modify/add arguments.
- kqueue_poll(), kqueue_stat(): modify arguments.
- pipe_poll(), pipe_stat(): modify/add arguments, pass active_cred to
MAC checks rather than td->td_ucred.
- soo_poll(), soo_stat(): modify/add arguments, pass fp->f_cred rather
than cred to pru_sopoll() to maintain current semantics.
- sopoll(): moidfy arguments.
- vn_poll(), vn_statfile(): modify/add arguments, pass new arguments
to vn_stat(). Pass active_cred to MAC and fp->f_cred to VOP_POLL()
to maintian current semantics.
- vn_close(): rename cred to file_cred to reflect reality while I'm here.
- vn_stat(): Add active_cred and file_cred arguments to vn_stat()
and consumers so that this distinction is maintained at the VFS
as well as 'struct file' layer. Pass active_cred instead of
td->td_ucred to MAC and to VOP_GETATTR() to maintain current semantics.

- fifofs: modify the creation of a "filetemp" so that the file
credential is properly initialized and can be used in the socket
code if desired. Pass ap->a_td->td_ucred as the active
credential to soo_poll(). If we teach the vnop interface about
the distinction between file and active credentials, we would use
the active credential here.

Note that current inconsistent passing of active_cred vs. file_cred to
VOP's is maintained. It's not clear why GETATTR would be authorized
using active_cred while POLL would be authorized using file_cred at
the file system level.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 9ca43589 15-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

In order to better support flexible and extensible access control,
make a series of modifications to the credential arguments relating
to file read and write operations to cliarfy which credential is
used for what:

- Change fo_read() and fo_write() to accept "active_cred" instead of
"cred", and change the semantics of consumers of fo_read() and
fo_write() to pass the active credential of the thread requesting
an operation rather than the cached file cred. The cached file
cred is still available in fo_read() and fo_write() consumers
via fp->f_cred. These changes largely in sys_generic.c.

For each implementation of fo_read() and fo_write(), update cred
usage to reflect this change and maintain current semantics:

- badfo_readwrite() unchanged
- kqueue_read/write() unchanged
pipe_read/write() now authorize MAC using active_cred rather
than td->td_ucred
- soo_read/write() unchanged
- vn_read/write() now authorize MAC using active_cred but
VOP_READ/WRITE() with fp->f_cred

Modify vn_rdwr() to accept two credential arguments instead of a
single credential: active_cred and file_cred. Use active_cred
for MAC authorization, and select a credential for use in
VOP_READ/WRITE() based on whether file_cred is NULL or not. If
file_cred is provided, authorize the VOP using that cred,
otherwise the active credential, matching current semantics.

Modify current vn_rdwr() consumers to pass a file_cred if used
in the context of a struct file, and to always pass active_cred.
When vn_rdwr() is used without a file_cred, pass NOCRED.

These changes should maintain current semantics for read/write,
but avoid a redundant passing of fp->f_cred, as well as making
it more clear what the origin of each credential is in file
descriptor read/write operations.

Follow-up commits will make similar changes to other file descriptor
operations, and modify the MAC framework to pass both credentials
to MAC policy modules so they can implement either semantic for
revocation.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# b605b54c 23-Jul-2002 Alfred Perlstein <alfred@FreeBSD.org>

Attempt to clarify comment in selrecord.


# d452ec95 22-Jul-2002 Alfred Perlstein <alfred@FreeBSD.org>

remove caddr_t from fo_ioctl calls


# 0a3e28cf 22-Jul-2002 Alfred Perlstein <alfred@FreeBSD.org>

remove caddr_t


# e602ba25 29-Jun-2002 Julian Elischer <julian@FreeBSD.org>

Part 1 of KSE-III

The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)

Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)

NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..


# c33c8251 20-Jun-2002 Alfred Perlstein <alfred@FreeBSD.org>

Implement SO_NOSIGPIPE option for sockets. This allows one to request that
an EPIPE error return not generate SIGPIPE on sockets.

Submitted by: lioux
Inspired by: Darwin


# c4bacc18 19-Jun-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Remove the compat bits for the mis-aligned struct disklabel on alpha,
people got three times longer than I promised.

Sponsored by: DARPA & NAI Labs.


# 9ae6d334 11-Jun-2002 Kelly Yancey <kbyanc@FreeBSD.org>

Make nselcol, the number of select collisions since boot, unsigned as
negative collisions simply doesn't make sense.

PR: (one small part of) 19720
Approved by: alfred


# 60a9bb19 06-Jun-2002 John Baldwin <jhb@FreeBSD.org>

Catch up to changes in ktrace API.


# 82641acd 08-May-2002 Alan Cox <alc@FreeBSD.org>

o Correct an error made in revision 1.65: In readv(), if uap->iovcnt is
out-of-range, drop the file reference before returning. (This error
also exists in the RELENG_4 branch.)
o Eliminate the acquisition and release of Giant in readv()
now that malloc() and free() are callable without Giant.


# 0b5d880d 02-May-2002 Poul-Henning Kamp <phk@FreeBSD.org>

As promised make the hack for sizeof(struct disklabel) on alpha annoying.

Run make world (or recompile whatever program whines) to get rid of warning.

Compat bits will be removed entirely in about two weeks.


# 6008862b 04-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on: i386, alpha, sparc64


# f67ad03a 04-Apr-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Delete the bogus d_boot[01] fields from struct disklabel.

This shrinks the size 4 bytes on alpha, down to the same 276 bytes
as all other platforms.

Construct a hack to make old ioctls work on new kernels.

Once world is recompiled only the new and correct sysctls will be
used.

This hack will become annoying around 1st of may to make people
rebuild their worlds and it will be gone before 5.0.


# 4d77a549 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# 628abf6c 15-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Giant pushdown for read/write/pread/pwrite syscalls.

kern/kern_descrip.c:
Aquire Giant in fdrop_locked when file refcount hits zero, this removes
the requirement for the caller to own Giant for the most part.

kern/kern_ktrace.c:
Aquire Giant in ktrgenio, simplifies locking in upper read/write syscalls.

kern/vfs_bio.c:
Aquire Giant in bwillwrite if needed.

kern/sys_generic.c
Giant pushdown, remove Giant for:
read, pread, write and pwrite.
readv and writev aren't done yet because of the possible malloc calls
for iov to uio processing.

kern/sys_socket.c
Grab giant in the socket fo_read/write functions.

kern/vfs_vnops.c
Grab giant in the vnode fo_read/write functions.


# 85f190e4 13-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Fixes to make select/poll mpsafe.

Problem:
selwakeup required calling pfind which would cause lock order
reversals with the allproc_lock and the per-process filedesc lock.
Solution:
Instead of recording the pid of the select()'ing process into the
selinfo structure, actually record a pointer to the thread. To
avoid dereferencing a bad address all the selinfo structures that
are in use by a thread are kept in a list hung off the thread
(protected by sellock). When a selwakeup occurs the selinfo is
removed from that threads list, it is also removed on the way out
of select or poll where the thread will traverse its list removing
all the selinfos from its own list.

Problem:
Previously the PROC_LOCK was used to provide the mutual exclusion
needed to ensure proper locking, this couldn't work because there
was a single condvar used for select and poll and condvars can
only be used with a single mutex.
Solution:
Introduce a global mutex 'sellock' which is used to provide mutual
exclusion when recording events to wait on as well as performing
notification when an event occurs.

Interesting note:
schedlock is required to manipulate the per-thread TDF_SELECT
flag, however if given its own field it would not need schedlock,
also because TDF_SELECT is only manipulated under sellock one
doesn't actually use schedlock for syncronization, only to protect
against corruption.

Proc locks are no longer used in select/poll.

Portions contributed by: davidc


# bbbb04ce 09-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P


# 4658f926 30-Jan-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove unused variables in select(2) from previous delta.

Pointed out by: bde


# eb209311 29-Jan-2002 Alfred Perlstein <alfred@FreeBSD.org>

Attempt to fixup select(2) and poll(2), this should fix some races with
other threads as well as speed up the interfaces.

To fix the race and accomplish the speedup, remove selholddrop and
pollholddrop. The entire concept is somewhat bogus because holding
the individual struct file pointers offers us no guarantees that
another thread context won't close it on us thereby removing our
access to our own reference.

Selholddrop and pollholddrop also would do multiple locks and unlocks
of mutexes _per-file_ in the fd arrays to be scanned, this needed to
be sped up.

Instead of using selholddrop and pollholddrop, simply hold the
filedesc lock over the selscan and pollscan functions. This should
protect us against close(2)'s on the files as reduce the multiple
lock/unlock pairs per fd into a single lock over the filedesc.


# 97fa4397 23-Jan-2002 Alfred Perlstein <alfred@FreeBSD.org>

make pread use fget_read instead of holdfp.


# aa11a498 18-Jan-2002 Alfred Perlstein <alfred@FreeBSD.org>

undo a bit of the Giant pushdown.

fdrop isn't SMP safe as it may call into the file's close routine which
definetly is not SMP safe right now, so we hold Giant over calls to
fdrop now.


# b5c93a56 16-Jan-2002 Alfred Perlstein <alfred@FreeBSD.org>

Fix giant handling in pwrite(2), I forgot to release it when finishing
the syscall.


# a4db4953 13-Jan-2002 Alfred Perlstein <alfred@FreeBSD.org>

Replace ffind_* with fget calls.

Make fget MPsafe.

Make fgetvp and fgetsock use the fget subsystem to reduce code bloat.

Push giant down in fpathconf().


# 426da3bc 13-Jan-2002 Alfred Perlstein <alfred@FreeBSD.org>

SMP Lock struct file, filedesc and the global file list.

Seigo Tanimura (tanimura) posted the initial delta.

I've polished it quite a bit reducing the need for locking and
adapting it for KSE.

Locks:

1 mutex in each filedesc
protects all the fields.
protects "struct file" initialization, while a struct file
is being changed from &badfileops -> &pipeops or something
the filedesc should be locked.

1 mutex in each struct file
protects the refcount fields.
doesn't protect anything else.
the flags used for garbage collection have been moved to
f_gcflag which was the FILLER short, this doesn't need
locking because the garbage collection is a single threaded
container.
could likely be made to use a pool mutex.

1 sx lock for the global filelist.

struct file * fhold(struct file *fp);
/* increments reference count on a file */

struct file * fhold_locked(struct file *fp);
/* like fhold but expects file to locked */

struct file * ffind_hold(struct thread *, int fd);
/* finds the struct file in thread, adds one reference and
returns it unlocked */

struct file * ffind_lock(struct thread *, int fd);
/* ffind_hold, but returns file locked */

I still have to smp-safe the fget cruft, I'll get to that asap.


# b064d43d 13-Nov-2001 Matthew Dillon <dillon@FreeBSD.org>

remove holdfp()

Replace uses of holdfp() with fget*() or fgetvp*() calls as appropriate

introduce fget(), fget_read(), fget_write() - these functions will take
a thread and file descriptor and return a file pointer with its ref
count bumped.

introduce fgetvp(), fgetvp_read(), fgetvp_write() - these functions will
take a thread and file descriptor and return a vref()'d vnode.

*_read() requires that the file pointer be FREAD, *_write that it be
FWRITE.

This continues the cleanup of struct filedesc and struct file access
routines which, when are all through with it, will allow us to then
make the API calls MP safe and be able to move Giant down into the fo_*
functions.


# fea2ab83 21-Sep-2001 John Baldwin <jhb@FreeBSD.org>

The P_SELECT flag was moved from p->p_flag to td->td_flags, but p_flag
was locked by the proc lock and td_flags is locked by the sched_lock.
The places that read, set, and cleared TDF_SELECT weren't updated, so they
read and modified td_flags w/o holding the sched_lock, meaning that they
could corrupt the per-thread flags field. As an immediate band-aid,
grab sched_lock while reading and manipulating td_flags in relation to
TDF_SELECT. This will probably be cleaned up some later on.


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# ad2edad9 01-Sep-2001 Matthew Dillon <dillon@FreeBSD.org>

Giant Pushdown:
read() pread() readv() write () pwrite() writev() ioctl() select ()
poll() openbsd_poll()


# 1b369704 15-May-2001 Seigo Tanimura <tanimura@FreeBSD.org>

Back out scanning file descriptors with holding a process lock.
selrecord() requires allproc sx in pfind(), resulting in lock order
reversal between allproc and a process lock.


# 265fc98f 13-May-2001 Seigo Tanimura <tanimura@FreeBSD.org>

- Convert msleep(9) in select(2) and poll(2) to cv_*wait*(9).

- Since polling should not involve sleeping, keep holding a
process lock upon scanning file descriptors.

- Hold a reference to every file descriptor prior to entering
polling loop in order to avoid lock order reversal between
lockmgr and p_mtx upon calling fdrop() in fo_poll().
(NOTE: this work has not been done for netncp and netsmb
yet because a socket itself has no reference counts.)

Reviewed by: jhb


# 33a9ed9d 23-Apr-2001 John Baldwin <jhb@FreeBSD.org>

Change the pfind() and zpfind() functions to lock the process that they
find before releasing the allproc lock and returning.

Reviewed by: -smp, dfr, jake


# 19eb87d2 06-Mar-2001 John Baldwin <jhb@FreeBSD.org>

Grab the process lock while calling psignal and before calling psignal.


# ea0237ed 27-Feb-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Correctly declare variables as u_int rather than doing typecasts.
Kill some register declarations while I'm here.

Submitted by: bde (1)


# 0b7088c4 26-Feb-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Cast nfds to u_int before range checking it in order to catch negative
values.

PR: 25393


# 2bd5ac33 09-Feb-2001 Peter Wemm <peter@FreeBSD.org>

poll(2) array limits (take 2) - after some input from bde.


# 9ed346ba 08-Feb-2001 Bosko Milekic <bmilekic@FreeBSD.org>

Change and clean the mutex lock interface.

mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)


# 89b71647 07-Feb-2001 Peter Wemm <peter@FreeBSD.org>

The code I picked up from NetBSD in '97 had a nasty bug. It limited
the index of the pollfd array to the number of fd's currently open, not
the maximum number of fd's. ie: if you had 0,1,2 open, you could not
use pollfd slots higher than 20. The specs say we only have to support
OPEN_MAX [64] entries but we allow way more than that.


# e04ac2fe 24-Jan-2001 John Baldwin <jhb@FreeBSD.org>

- Catch up to proc flag changes.
- Add proc locking for selwakeup() and selrecord().


# 0a2c3d48 08-Jan-2001 Garrett Wollman <wollman@FreeBSD.org>

select() DKI is now in <sys/selinfo.h>.


# a41ce5d3 07-Dec-2000 Matthew Dillon <dillon@FreeBSD.org>

Only call bwillwrite() for vnodes. Do not penalize devices or pipes.


# 9440653d 06-Dec-2000 Matthew Dillon <dillon@FreeBSD.org>

Add necessary bwillwrite() in writev() entry point.

Deal with excessive dirty buffers when msync() syncs non-contiguous
dirty buffers by checking for the case in UFS *before* checking for
clusterability.


# c6ab5768 30-Nov-2000 Alfred Perlstein <alfred@FreeBSD.org>

only call bwillwrite() to stall on IO when dealing with VNODEs otherwise
we will stall on non-disk IO for things like fifos and sockets


# 4a476efa 21-Nov-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Protect p_wchan with sched_lock in selwakeup().


# 279d7226 18-Nov-2000 Matthew Dillon <dillon@FreeBSD.org>

This patchset fixes a large number of file descriptor race conditions.
Pre-rfork code assumed inherent locking of a process's file descriptor
array. However, with the advent of rfork() the file descriptor table
could be shared between processes. This patch closes over a dozen
serious race conditions related to one thread manipulating the table
(e.g. closing or dup()ing a descriptor) while another is blocked in
an open(), close(), fcntl(), read(), write(), etc...

PR: kern/11629
Discussed with: Alexander Viro <viro@math.psu.edu>


# b31ae1ad 28-Jul-2000 Peter Wemm <peter@FreeBSD.org>

Fix a warning that has been annoying me for some time:
"kern/sys_generic.c:358: warning: cast discards qualifiers from pointer
target type"
The idea for using the uintptr_t intermediate cast for de-constifying
a pointer was hinted at by bde some time ago.


# 3c89e357 26-Jul-2000 Brian Feldman <green@FreeBSD.org>

Distinguish between whether ktraceing was enabled before an IO
operation or after it. If the ktrace operation was enabled while the
process was blocked doing IO, the race would allow it to pass down
invalid (uninitialized) data and panic later down the call stack.


# 9c386f6b 12-Jul-2000 John Baldwin <jhb@FreeBSD.org>

For infinite timeouts, set both the tv_sec and tv_usec fields to zero in
poll() and select().

Noticed by: Wesley Morgan <morganw@chemicals.tacorp.com>


# 4da144c0 12-Jul-2000 John Baldwin <jhb@FreeBSD.org>

Fix a very obscure bug in select() and poll() where the timeout would
never expire if poll() or select() was called before the system had been
in multiuser for 1 second. This was caused by only checking to see if
tv_sec was zero rather than checking both tv_sec and tv_usec.


# 7ceba2d7 07-Jul-2000 Brian Feldman <green@FreeBSD.org>

Remove two micro-pessimizations I made. Bruce is teaching me well :)
KTRPOINT(p, KTR_GENIO) is more uncommon than error == 0, so it should
be first in the && statement.


# 42ebfbf2 02-Jul-2000 Brian Feldman <green@FreeBSD.org>

Modify ktrace's general I/O tracing, ktrgenio(), to use a struct uio *
instead of a struct iovec * array and int len. Get rid of stupidly trying
to allocate all of the memory and copyin()ing the entire iovec[], and
instead just do the proper VOP_WRITE() in ktrwrite() using a copy of
the struct uio that the syscall originally used.

This solves the DoS which could easily be performed; to work around the
DoS, one could also remove "options KTRACE" from the kernel. This is
a very strong MFC candidate for 4.1.

Found by: art@OpenBSD.org


# 8757e5bb 12-Jun-2000 Alfred Perlstein <alfred@FreeBSD.org>

unstatic getfp() so that other subsystems can use it.

make sendfile() use it.

Approved by: dg


# d2ba455c 09-May-2000 Matthew Dillon <dillon@FreeBSD.org>

Some ioctl routines assume that the ioctl buffer is aligned, but a
char[] declaration makes no such guarentee. A union is used to force
alignment of the char buffer.


# f082218c 20-Feb-2000 Peter Wemm <peter@FreeBSD.org>

Fix select(2) for the Alpha. (!!) It was never returning true for
fd's in the range of 32-63, 96-127 etc. The first problem was the
FD_*() macros were shifting a 32 bit integer "1" left by more than
32 bits. The same problem happened in selscan(). ffs() also takes
an int argument and causes failure. For cases where int == long
(ie: the usual case for x86, but not always as gcc can have long
being a 64 bit quantity) ffs() could be used.

Reported by: Marian Stagarescu <marian@bile.skycache.com>
Reviewed by: dfr, gallatin (sys/types.h only)
Approved by: jkh


# bfbbc4aa 13-Jan-2000 Jason Evans <jasone@FreeBSD.org>

Add aio_waitcomplete(). Make aio work correctly for socket descriptors.
Make gratuitous style(9) fixes (me, not the submitter) to make the aio
code more readable.

PR: kern/12053
Submitted by: Chris Sedore <cmsedore@maxwell.syr.edu>


# 8cb96f20 05-Jan-2000 Peter Wemm <peter@FreeBSD.org>

Export the nselcoll counter via the kern.nselcoll sysctl so we can see
just how bad it gets in various situations.

Reminded by: adrian


# d8177437 14-Oct-1999 Brian Feldman <green@FreeBSD.org>

Missed the second argument of fdrop().

Submitted by: jhay


# 1aa3e7dd 13-Oct-1999 Brian Feldman <green@FreeBSD.org>

Fix a race condition with shared fd tables and writev(). It's
still not safe to consider file table sharing secure.
Submitted by: Ville-Pertti Keinonen <will@iki.fi>


# d1f088da 11-Oct-1999 Peter Wemm <peter@FreeBSD.org>

Trim unused options (or #ifdef for undoc options).

Submitted by: phk


# 13ccadd4 19-Sep-1999 Brian Feldman <green@FreeBSD.org>

This is what was "fdfix2.patch," a fix for fd sharing. It's pretty
far-reaching in fd-land, so you'll want to consult the code for
changes. The biggest change is that now, you don't use
fp->f_ops->fo_foo(fp, bar)
but instead
fo_foo(fp, bar),
which increments and decrements the fp refcount upon entry and exit.
Two new calls, fhold() and fdrop(), are provided. Each does what it
seems like it should, and if fdrop() brings the refcount to zero, the
fd is freed as well.

Thanks to peter ("to hell with it, it looks ok to me.") for his review.
Thanks to msmith for keeping me from putting locks everywhere :)

Reviewed by: peter


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# 8fe387ab 04-Apr-1999 Dmitrij Tejblum <dt@FreeBSD.org>

Add standard padding argument to pread and pwrite syscall. That should make them
NetBSD compatible.

Add parameter to fo_read and fo_write. (The only flag FOF_OFFSET mean that
the offset is set in the struct uio).

Factor out some common code from read/pread/write/pwrite syscalls.


# 4160ccd9 27-Mar-1999 Alan Cox <alc@FreeBSD.org>

Added pread and pwrite. These functions are defined by the X/Open
Threads Extension. (Note: We use the same syscall numbers as NetBSD.)

Submitted by: John Plevyak <jplevyak@inktomi.com>


# 425c50cf 29-Jan-1999 Bruce Evans <bde@FreeBSD.org>

Removed a bogus cast to c_caddr_t. This is part of terminating
c_caddr_t with extreme prejudice. Here the point of the original
cast to caddr_t was to break the warning about the const mismatch
between write(2)'s `const void *buf' and `struct uio's `char
*iov_base' (previous bitrot gave a gratuitous dependency on caddr_t
being char *). Compiling with -Wcast-qual made the cast a full
no-op.

This change has no effect on the warning for discarding `const'
on assignment to iov_base. The warning should not be fixed by
splitting `struct iovec' into a non-const version for read()
and a const version for write(), since correct const poisoning
would affect all pointers to i/o addresses. Const'ness should
probably be forgotten by not declaring it in syscalls.master.


# d254af07 27-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# 337c9691 09-Dec-1998 Jordan K. Hubbard <jkh@FreeBSD.org>

poll(2) sets POLLNVAL for descriptors passed in that are less than
0. This makes it difficult to do efficient manipulation of the
struct pollfd since you can't leave a slot empty.

PR: 8599
Submitted-by: Marc Slemko <marcs@znep.com>


# 831d27a9 11-Nov-1998 Don Lewis <truckman@FreeBSD.org>

Installed the second patch attached to kern/7899 with some changes suggested
by bde, a few other tweaks to get the patch to apply cleanly again and
some improvements to the comments.

This change closes some fairly minor security holes associated with
F_SETOWN, fixes a few bugs, and removes some limitations that F_SETOWN
had on tty devices. For more details, see the description on the PR.

Because this patch increases the size of the proc and pgrp structures,
it is necessary to re-install the includes and recompile libkvm,
the vinum lkm, fstat, gcore, gdb, ipfilter, ps, top, and w.

PR: kern/7899
Reviewed by: bde, elvind


# 134e06fe 05-Sep-1998 Bruce Evans <bde@FreeBSD.org>

Fixed bogotification of pseudocode for syscall args by rev.1.53 of
syscalls.master.


# 069e9bc1 24-Aug-1998 Doug Rabson <dfr@FreeBSD.org>

Change various syscalls to use size_t arguments instead of u_int.

Add some overflow checks to read/write (from bde).

Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags
and vm_object::paging_in_progress to use operations which are not
interruptable.

Reviewed by: Bruce Evans <bde@zeta.org.au>


# 831b9ef2 10-Jun-1998 Doug Rabson <dfr@FreeBSD.org>

64bit fixes: use u_long not int for ioctl command.


# c21410e1 17-May-1998 Poul-Henning Kamp <phk@FreeBSD.org>

s/nanoruntime/nanouptime/g
s/microruntime/microuptime/g

Reviewed by: bde


# 80a39463 05-Apr-1998 Andrey A. Chernov <ache@FreeBSD.org>

Remove unused atv.tv_usec = 0; from select/poll code


# 00af9731 04-Apr-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Time changes mark 2:

* Figure out UTC relative to boottime. Four new functions provide
time relative to boottime.

* move "runtime" into struct proc. This helps fix the calcru()
problem in SMP.

* kill mono_time.

* add timespec{add|sub|cmp} macros to time.h. (XXX: These may change!)

* nanosleep, select & poll takes long sleeps one day at a time

Reviewed by: bde
Tested by: ache and others


# 4ff16568 02-Apr-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Try to fix poll & select after I broke them.


# 227ee8a1 30-Mar-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Eradicate the variable "time" from the kernel, using various measures.
"time" wasn't a atomic variable, so splfoo() protection were needed
around any access to it, unless you just wanted the seconds part.

Most uses of time.tv_sec now uses the new variable time_second instead.

gettime() changed to getmicrotime(0.

Remove a couple of unneeded splfoo() protections, the new getmicrotime()
is atomic, (until Bruce sets a breakpoint in it).

A couple of places needed random data, so use read_random() instead
of mucking about with time which isn't random.

Add a new nfs_curusec() function.

Mark a couple of bogosities involving the now disappeard time variable.

Update ffs_update() to avoid the weird "== &time" checks, by fixing the
one remaining call that passwd &time as args.

Change profiling in ncr.c to use ticks instead of time. Resolution is
the same.

Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call
hzto() which subtracts time" sequences.

Reviewed by: bde


# 2087c896 23-Nov-1997 Bruce Evans <bde@FreeBSD.org>

Fixed some style bugs in the poll() code.

Removed dead code to "Avoid inadvertently sleeping forever". hzto()
never returns 0.


# cb226aaa 06-Nov-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Move the "retval" (3rd) parameter from all syscall functions and put
it in struct proc instead.

This fixes a boatload of compiler warning, and removes a lot of cruft
from the sources.

I have not removed the /*ARGSUSED*/, they will require some looking at.

libkvm, ps and other userland struct proc frobbing programs will need
recompiled.


# a1c995b6 12-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Last major round (Unless Bruce thinks of somthing :-) of malloc changes.

Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.

A couple of finer points by: bde


# 55166637 11-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Distribute and statizice a lot of the malloc M_* types.

Substantial input from: bde


# 42d11757 13-Sep-1997 Peter Wemm <peter@FreeBSD.org>

Implement poll(2). This is mostly taken from the NetBSD implementation
(from some time ago) but with a few tweaks along the way.

Obtained from: NetBSD


# e4ba6a82 02-Sep-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# 2c1011f7 15-Jun-1997 John Dyson <dyson@FreeBSD.org>

Modifications to existing files to support the initial AIO/LIO and
kernel based threading support.


# 20982410 24-Mar-1997 Bruce Evans <bde@FreeBSD.org>

Don't include <sys/ioctl.h> in the kernel. Stage 4: include
<sys/ttycom.h> and sometimes <sys/filio.h> instead of <sys/ioctl.h>
in miscellaneous files. Most of these files have nothing to do
with ttys but need to include <sys/ttycom.h> to get the definitions
of TIOC[SG]PGRP which are (ab)used to convert F[SG]ETOWN fcntls into
ioctls.


# 3ac4d1ef 22-Mar-1997 Bruce Evans <bde@FreeBSD.org>

Don't #include <sys/fcntl.h> in <sys/file.h> if KERNEL is defined.
Fixed everything that depended on getting fcntl.h stuff from the wrong
place. Most things don't depend on file.h stuff at all.


# 774fce94 22-Mar-1997 Bruce Evans <bde@FreeBSD.org>

Removed `volatile' from declaration of `time', and removed the resulting
null casts. `time' is nonvolatile for accesses within a region locked
by splclock()/splx(). Accesses outside such a region are invalid, and
splx() must have the side effect of potentially changing all global
variables (since there are hundreds of sort of volatile variables like
`time'), so declaring `time' as volatile didn't have any real benefits.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# d5e4d7e1 20-Feb-1997 Bruce Evans <bde@FreeBSD.org>

Improved select():
- avoid malloc() if the number of fds is small.
- pack the bits better so that `small' is quite large.
- don't waste time generating zero bits for null fd_set pointers or
scanning these bits.

Possibly improved select():
- free malloc()ed storage before returning. This is simpler and I
think huge select()s aren't worth optimizing since they are rare,
relative gain would be small and there would be tiny costs for all
selects().

Reviewed by: ache (first version by him too)


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# acbfbfea 20-Aug-1996 Sujal Patel <smpatel@FreeBSD.org>

Fix a minor style error in my code.


# b08f7993 20-Aug-1996 Sujal Patel <smpatel@FreeBSD.org>

Remove the kernel FD_SETSIZE limit for select().
Make select()'s first argument 'int' not 'u_int'.

Reviewed by: bde


# edbfedac 11-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all
files are off the vendor branch, so this should not change anything.

A "U" marker generally means that the file was not changed in between
the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally
means that there was a change.
[note new unused (in this form) syscalls.conf, to be 'cvs rm'ed]


# db6a20e2 03-Jan-1996 Garrett Wollman <wollman@FreeBSD.org>

Converted two options over to the new scheme: USER_LDT and KTRACE.


# 87b6de2b 14-Dec-1995 Poul-Henning Kamp <phk@FreeBSD.org>

A Major staticize sweep. Generates a couple of warnings that I'll deal
with later.
A number of unused vars removed.
A number of unused procs removed or #ifdefed.


# d2d3e875 11-Nov-1995 Bruce Evans <bde@FreeBSD.org>

Included <sys/sysproto.h> to get central declarations for syscall args
structs and prototypes for syscalls.

Ifdefed duplicated decentralized declarations of args structs. It's
convenient to have this visible but they are hard to maintain. Some
are already different from the central declarations. 4.4lite2 puts
them in comments in the function headers but I wanted to avoid the
large changes for that.


# 7147b19d 10-Nov-1995 Bruce Evans <bde@FreeBSD.org>

Fixed the type of readv(). An args struct member name conflicted with the
machine-generated one in <sys/sysproto.h>.


# c3e5be20 10-Oct-1995 Steven Wallace <swallace@FreeBSD.org>

Remove the ugly COMPAT_IBCS2 hack to hide a return value through
magic numbers. The new socksys support does not need this hack.

I am against any magic practicing.


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# f153fb6e 13-Apr-1995 David Greenman <dg@FreeBSD.org>

Backed out previous change - it reduces performance. (oops).


# cf0ec51a 13-Apr-1995 David Greenman <dg@FreeBSD.org>

Slight optimization to select().


# 2b101991 13-Oct-1994 Søren Schmidt <sos@FreeBSD.org>

Damn, check in the wrong version, fixed.
Reviewed by:
Submitted by:
Obtained from:


# 8117efaa 13-Oct-1994 Søren Schmidt <sos@FreeBSD.org>

Made it possible for ioctl to return a value.
Ifdef by COMPAT_IBCS2 (used by the socksys system).
Submitted by: Mostyn Lewis (mostyn@mrl.com)


# d93f860c 09-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

Cosmetics. related to getting prototypes into view.


# 797f2d22 02-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

All of this is cosmetic. prototypes, #includes, printfs and so on. Makes
GCC a lot more silent.


# bb56ec4a 25-Sep-1994 Poul-Henning Kamp <phk@FreeBSD.org>

While in the real world, I had a bad case of being swapped out for a lot of
cycles. While waiting there I added a lot of the extra ()'s I have, (I have
never used LISP to any extent). So I compiled the kernel with -Wall and
shut up a lot of "suggest you add ()'s", removed a bunch of unused var's
and added a couple of declarations here and there. Having a lap-top is
highly recommended. My kernel still runs, yell at me if you kernel breaks.


# 90324b07 02-Sep-1994 David Greenman <dg@FreeBSD.org>

Whoops, accidently left out some pieces of the munmapfd patch.


# dd968eae 02-Sep-1994 David Greenman <dg@FreeBSD.org>

Make sure that uio_resid isn't negative in read().


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources