History log of /freebsd-current/sys/kern/vfs_aio.c
Revision Date Author Comments
# e4b7bbd6 13-Jan-2024 Konstantin Belousov <kib@FreeBSD.org>

lio_listio(2): add LIO_FOFFSET flag to ignore aiocb aio_offset

and use the current file offset instead.

Requested by: Vinícius dos Santos Oliveira <vini.ipsmaker@gmail.com>
Reviewed by: jhb
Discussed with: asomers
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D43448


# 61cc4830 18-Jan-2024 Alfredo Mazzinghi <am2419@cl.cam.ac.uk>

Abstract UIO allocation and deallocation.

Introduce the allocuio() and freeuio() functions to allocate and
deallocate struct uio. This hides the actual allocator interface, so it
is easier to modify the sub-allocation layout of struct uio and the
corresponding iovec array.

Obtained from: CheriBSD
Reviewed by: kib, markj
MFC after: 2 weeks
Sponsored by: CHaOS, EPSRC grant EP/V000292/1
Differential Revision: https://reviews.freebsd.org/D43711


# b068bb09 07-Jan-2024 Konstantin Belousov <kib@FreeBSD.org>

Add vnode_pager_clean_{a,}sync(9)

Bump __FreeBSD_version for ZFS use.

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D43356


# fdafd315 24-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Automated cleanup of cdefs and other formatting

Apply the following automated changes to try to eliminate
no-longer-needed sys/cdefs.h includes as well as now-empty
blank lines in a row.

Remove /^#if.*\n#endif.*\n#include\s+<sys/cdefs.h>.*\n/
Remove /\n+#include\s+<sys/cdefs.h>.*\n+#if.*\n#endif.*\n+/
Remove /\n+#if.*\n#endif.*\n+/
Remove /^#if.*\n#endif.*\n/
Remove /\n+#include\s+<sys/cdefs.h>\n#include\s+<sys/types.h>/
Remove /\n+#include\s+<sys/cdefs.h>\n#include\s+<sys/param.h>/
Remove /\n+#include\s+<sys/cdefs.h>\n#include\s+<sys/capsicum.h>/

Sponsored by: Netflix


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 4d846d26 10-May-2023 Warner Losh <imp@FreeBSD.org>

spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD

The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix


# 98844e99 15-Feb-2023 John Baldwin <jhb@FreeBSD.org>

aio: Fix more synchronization issues in aio_biowakeup.

- Use atomic_store to set job->error. atomic_set does an or
operation, not assignment.

- Use refcount_* to manage job->nbio.

This ensures proper memory barriers are present so that the last bio
won't see a possibly stale value of job->error.

- Don't re-read job->error after reading it via atomic_load.

Reported by: markj (1)
Reviewed by: mjg, markj
Differential Revision: https://reviews.freebsd.org/D38611


# cca6d616 15-Feb-2023 John Baldwin <jhb@FreeBSD.org>

aio_biowakeup: Various style fixes.


# 40734fc5 15-Feb-2023 Keith Reynolds <keith.reynolds@hpe.com>

aio: Fix a test and set race in aio_biowakeup.

Use atomic_fetchadd in place of separate atomic_subtract / atomic_load.

Reviewed by: markj
Sponsored by: HPE TidalScale
Differential Revision: https://reviews.freebsd.org/D38559


# a75d1ddd 17-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: introduce V_PCATCH to stop abusing PCATCH


# 9553bc89 19-Jun-2022 Mark Johnston <markj@FreeBSD.org>

aio: Improve UMA usage

- Remove the AIO proc zone. This zone gets one allocation per AIO
daemon process, which isn't enough to warrant a dedicated zone. Plus,
unlike other AIO structures, aiops are small (32 bytes with LP64), so
UMA doesn't provide better space efficiency than malloc(9). Change
one of the malloc types in vfs_aio.c to make it more general.

- Don't set the NOFREE flag on the other AIO zones. This flag means
that memory allocated to the AIO subsystem is never freed back to the
VM, so it's always preferable to avoid using it when possible. NOFREE
was set without explanation when AIO was converted to use UMA 20 years
ago, but it does not appear to be required; all of the structures
allocated from UMA (per-process kaioinfo, kaiocb, and aioliojob) keep
track of references and get freed only when none exist. Plus, these
structures will contain dangling pointer after they're freed (e.g.,
the "cred", "fd_file" and "uiop" fields of struct kaiocb), so
use-after-frees are dangerous even when the structures themselves are
type-stable.

Reviewed by: asomers
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35493


# 31d1b816 28-May-2022 Dmitry Chagin <dchagin@FreeBSD.org>

sysent: Get rid of bogus sys/sysent.h include.

Where appropriate hide sysent.h under proper condition.

MFC after: 2 weeks


# e9c7ec22 14-Nov-2021 Mateusz Guzik <mjg@FreeBSD.org>

aio: whack "set but not used" warnings


# 45c2c7c4 23-Sep-2021 Konstantin Belousov <kib@FreeBSD.org>

aio_aqueue(): avoid ucred leak on failure path

PR: 258698
Submitted by: sigsys@gmail.com
MFC after: 1 week


# 2933a7ca 19-Sep-2021 Konstantin Belousov <kib@FreeBSD.org>

aio_fsync_vnode: handle ERELOOKUP after VOP_FSYNC()

Reported by: tmunro
Reviewed by: jhb, tmunro
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32023


# 922bee44 19-Sep-2021 Konstantin Belousov <kib@FreeBSD.org>

aio_fsync_vnode: use for(;;) loop instead of label

Reviewed by: jhb, tmunro
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32023


# 2884918c 10-Sep-2021 Mark Johnston <markj@FreeBSD.org>

aio: Fix up the opcode in aiocb32_copyin()

With lio_listio(2), the opcode is specified by userspace rather than
being hard-coded by the system call (e.g., aio_readv() -> LIO_READV).
kern_lio_listio() calls aio_aqueue() with an opcode of LIO_NOP, which
gets fixed up when the aiocb is copied in.

When copying in a job request for vectored I/O, we need to dynamically
allocate a uio to wrap an iovec. So aiocb_copyin() needs to get the
opcode from the aiocb and then decide whether an allocation is required.
We failed to do this in the COMPAT_FREEBSD32 case. Fix it.

Reported by: syzbot+27eab6f2c2162f2885ee@syzkaller.appspotmail.com
Reviewed by: kib, asomers
Fixes: f30a1ae8d529 ("lio_listio(2): Allow LIO_READV and LIO_WRITEV.")
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31914


# f30a1ae8 22-Aug-2021 Thomas Munro <tmunro@FreeBSD.org>

lio_listio(2): Allow LIO_READV and LIO_WRITEV.

Allow multiple vector IOs to be started with one system call.
aio_readv() and aio_writev() already used these opcodes under the
covers. This commit makes them available to user space.

Being non-standard extensions, they're only visible if __BSD_VISIBLE is
defined, like the functions.

Reviewed by: asomers, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D31627


# 2e5f6152 15-Jul-2021 Mark Johnston <markj@FreeBSD.org>

lio_listio: Don't post a completion notification if none was requested

One is allowed to use LIO_NOWAIT without specifying a sigevent. In this
case, lj->lioj_signal is left uninitialized, but several code paths
examine liov_signal.sigev_notify to figure out which notification to
post. Unconditionally initialize that field to SIGEV_NONE.

Add a dumb test case which triggers the bug.

Reported by: KMSAN+syzkaller
Reviewed by: asomers
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31197


# 8d9ed174 17-Mar-2021 Konstantin Belousov <kib@FreeBSD.org>

open(2): Implement O_PATH

Reviewed by: markj
Tested by: pho
Discussed with: walker.aj325_gmail.com, wulf
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29323


# 2247f489 02-Jan-2021 Alan Somers <asomers@FreeBSD.org>

aio: micro-optimize the lio_opcode assignments

This allows slightly more efficient opcode testing in-kernel. It is
transparent to userland, except to applications that sneakily submit
aio fsync or aio mlock operations via lio_listio, which has never been
documented, requires the use of deliberately undefined constants
(LIO_SYNC and LIO_MLOCK), and is arguably a bug.

Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D27942


# ff1a3078 09-Jan-2021 Alan Somers <asomers@FreeBSD.org>

lio_listio: validate aio_lio_opcode

Previously, we would accept any kind of LIO_* opcode, including ones
that were intended for in-kernel use only like LIO_SYNC (which is not
defined in userland). The situation became more serious with
022ca2fc7fe08d51f33a1d23a9be49e6d132914e. After that revision, setting
aio_lio_opcode to LIO_WRITEV or LIO_READV would trigger an assertion.

Note that POSIX does not specify what should happen if aio_lio_opcode is
invalid.

MFC-with: 022ca2fc7fe08d51f33a1d23a9be49e6d132914e
Reviewed by: jhb, tmunro, 0mp
Differential Revision: <https://reviews.freebsd.org/D28078


# 801ac943 07-Jan-2021 Thomas Munro <tmunro@FreeBSD.org>

aio_fsync(2): Support O_DSYNC.

aio_fsync(O_DSYNC, ...) is the asynchronous version of fdatasync(2).

Reviewed by: kib, asomers, jhb
Differential Review: https://reviews.freebsd.org/D25071


# 022ca2fc 02-Jan-2021 Alan Somers <asomers@FreeBSD.org>

Add aio_writev and aio_readv

POSIX AIO is great, but it lacks vectored I/O functions. This commit
fixes that shortcoming by adding aio_writev and aio_readv. They aren't
part of the standard, but they're an obvious extension. They work just
like their synchronous equivalents pwritev and preadv.

It isn't yet possible to use vectored aiocbs with lio_listio, but that
could be added in the future.

Reviewed by: jhb, kib, bcr
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D27743


# 01206038 21-Dec-2020 Alan Somers <asomers@FreeBSD.org>

AIO: remove the kaiocb->bio linkage

Vectored aio will require each aiocb to be associated with multiple
bios, so we can't store a link to the latter from the former. But we
don't really need to. aio_biowakeup already knows the bio it's using,
and the other fields can be stored within the bio and/or buf itself.

Also, remove the unused kaiocb.backend2 field.

Reviewed By: kib
Differential Revision: https://reviews.freebsd.org/D27682


# 6814c2da 01-Dec-2020 Konstantin Belousov <kib@FreeBSD.org>

lio_listio(2): send signal even if number of jobs is zero.

Right now, if lio registered zero jobs, syscall frees lio job
structure, cleaning up queued ksi. As result, the realtime signal is
dequeued and never delivered.

Fix it by allowing sendsig() to copy ksi when job count is zero.

PR: 220398
Reported and reviewed by: asomers
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27421


# 29331656 01-Dec-2020 Konstantin Belousov <kib@FreeBSD.org>

vfs_aio.c: style.

Mostly re-wrap conditions to split after binary ops.

Reviewed by: asomers
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27421


# 5c5005ec 01-Dec-2020 Konstantin Belousov <kib@FreeBSD.org>

vfs_aio.c: correct comment.

Reviewed by: asomers
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27421


# a9d4fe97 29-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

bio aio: Destroy ephemeral mapping before unwiring page.

Apparently some architectures, like ppc in its hashed page tables
variants, account mappings by pmap_qenter() in the response from
pmap_is_page_mapped().

While there, eliminate useless userp variable.

Noted and reviewed by: alc (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27409


# cd853791 27-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

Make MAXPHYS tunable. Bump MAXPHYS to 1M.

Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225


# f7db0c95 04-Nov-2020 Mark Johnston <markj@FreeBSD.org>

vmspace: Convert to refcount(9)

This is mostly mechanical except for vmspace_exit(). There, use the new
refcount_release_if_last() to avoid switching to vmspace0 unless other
processes are sharing the vmspace. In that case, upon switching to
vmspace0 we can unconditionally release the reference.

Remove the volatile qualifier from vm_refcnt now that accesses are
protected using refcount(9) KPIs.

Reviewed by: alc, kib, mmel
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27057


# 8128c65b 30-Sep-2020 John Baldwin <jhb@FreeBSD.org>

Avoid a dubious assignment to bio_data in aio_qbio().

A user pointer is not a suitable value for bio_data and the next block
of code always overwrites bio_data anyway. Just use cb->aio_buf
directly in the call to vm_fault_quick_hold_pages().

Reviewed by: kib
Obtained from: CheriBSD
MFC after: 1 month
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D26595


# 7ad2a82d 18-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the error parameter from vn_isdisk, introduce vn_isdisk_error

Most consumers pass NULL.


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# b249ce48 03-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the mostly unused flags argument from VOP_UNLOCK

Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427


# 849aef49 21-Nov-2019 Andrew Turner <andrew@FreeBSD.org>

Port the NetBSD KCSAN runtime to FreeBSD.

Update the NetBSD Kernel Concurrency Sanitizer (KCSAN) runtime to work in
the FreeBSD kernel. It is a useful tool for finding data races between
threads executing on different CPUs.

This can be enabled by enabling KCSAN in the kernel config, or by using the
GENERIC-KCSAN amd64 kernel. It works on amd64 and arm64, however the later
needs a compiler change to allow -fsanitize=thread that KCSAN uses.

Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D22315


# 756a5412 14-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.

o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.

Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.

The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.

o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.

This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.

Together with: gallatin
Tested by: pho


# 72bce9ff 26-Nov-2018 Alan Somers <asomers@FreeBSD.org>

vfs_aio.c: rename "physio" symbols to "bio".

aio has two paths: an asynchronous "physio" path and a synchronous path.
Confusingly, physio(9) isn't actually used by the "physio" path, and never
has been. In fact, it may even be called by the synchronous path! Rename
the "physio" path to the "bio" path to reflect what it actually does:
directly compose BIOs and send them to character devices.

MFC after: 2 weeks


# 792843c3 24-Nov-2018 Mark Johnston <markj@FreeBSD.org>

Pass malloc flags directly through kevent(2) subroutines.

Some kevent functions have a boolean "waitok" parameter for use when
calling malloc(9). Replace them with the corresponding malloc() flags:
the desired behaviour is known at compile-time, so this eliminates a
couple of conditional branches, and makes the code easier to read.

No functional change intended.

Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18318


# 36c4960e 24-Nov-2018 Mark Johnston <markj@FreeBSD.org>

Plug some kernel memory disclosures via kevent(2).

The kernel may register for events on behalf of a userspace process,
in which case it must be careful to zero the kevent struct that will be
copied out to userspace.

Reviewed by: kib
MFC after: 3 days
Security: kernel stack memory disclosure
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18317


# cbd92ce6 09-May-2018 Matt Macy <mmacy@FreeBSD.org>

Eliminate the overhead of gratuitous repeated reinitialization of cap_rights

- Add macros to allow preinitialization of cap_rights_t.

- Convert most commonly used code paths to use preinitialized cap_rights_t.
A 3.6% speedup in fstat was measured with this change.

Reported by: mjg
Reviewed by: oshogbo
Approved by: sbruno
MFC after: 1 month


# 52c09831 16-Apr-2018 Alan Somers <asomers@FreeBSD.org>

lio_listio: return EAGAIN instead of EIO when out of resources

This behavior is already documented by the man page, and suggested by POSIX.

Reviewed by: jhb
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D15099


# 6469bdcd 06-Apr-2018 Brooks Davis <brooks@FreeBSD.org>

Move most of the contents of opt_compat.h to opt_global.h.

opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941


# 86bbef43 10-Jan-2018 John Baldwin <jhb@FreeBSD.org>

Don't store shadow copies of per-process AIO limits.

Previously the AIO subsystem would save a snapshot of the currently
configured per-process limits the first time a process used AIO. The
process would continue to use the snapshotted limits ignoring any
changes to the global limits during the rest of its lifetime. This
change removes the snapshotted values and changes the AIO code to
always check the global values which can be toggled at runtime.
This means an administrator can now change the effective limits of
existing processes. This is more consistent with how other limits
configured via sysctl work in FreeBSD.

Reviewed by: asomers, kib
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D13819


# f54c5606 09-Jan-2018 John Baldwin <jhb@FreeBSD.org>

Allow the fast-path for disk AIO requests to fail requests.

- If aio_qphysio() returns a non-zero error code, fail the request rather
than queueing it to the AIO kproc pool to be retried via the slow path.
Currently this means that if vm_fault_quick_hold_pages() reports an
error, EFAULT is returned from the fast-path rather than retrying the
request in the slow path where it will still fail with EFAULT.
- If aio_qphysio() wishes to use the fast path for a device that doesn't
support unmapped I/O but there are already the maximum number of
such requests in flight, fail with EAGAIN as we do for other AIO
resource limits rather than queueing the request to the AIO kproc pool.
- Move the opcode check for aio_qphysio() out of the caller and into
aio_qphysio() to simplify some logic and remove two goto's while here.
It also uses a whitelist (only supported for LIO_READ / LIO_WRITE)
rather than a blacklist (skipped for LIO_SYNC).

PR: 217261
Submitted by: jkim (an earlier version)
MFC after: 2 weeks
Sponsored by: Chelsio Communications


# 7e409184 09-Jan-2018 John Baldwin <jhb@FreeBSD.org>

Simplify some logic by merging an if test with a subsequent switch.

Specifically, in aio_queue_file() the code was doing this:

if (opcode == LIO_SYNC) {
...
}

switch (opcode) {
...
case LIO_SYNC:
...
}

This moves the body of the if statement into the LIO_SYNC case of the
switch statement.

MFC after: 2 weeks
Sponsored by: Chelsio Communications


# 8091e52b 09-Jan-2018 John Baldwin <jhb@FreeBSD.org>

Add a counter to track in-flight AIO requests using unmapped I/O.

MFC after: 2 weeks
Sponsored by: Chelsio Communications


# 151ba793 24-Dec-2017 Alexander Kabaev <kan@FreeBSD.org>

Do pass removing some write-only variables from the kernel.

This reduces noise when kernel is compiled by newer GCC versions,
such as one used by external toolchain ports.

Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial)
Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c)
Differential Revision: https://reviews.freebsd.org/D10385


# 8a36da99 27-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/kern: adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.


# df485bdb 26-Oct-2017 Alan Somers <asomers@FreeBSD.org>

Fix aio_suspend in 32-bit emulation

An off-by-one error has been present since the system call was first present
in 185878. It additionally became a memory corruption bug after change
324941. The failure is actually revealed by our existing AIO tests.
However, apparently nobody's been running those in 32-bit emulation mode.

Reported by: Coverity, cem
CID: 1382114
MFC after: 18 days
X-MFC-With: 324941
Sponsored by: Spectra Logic Corp


# 913b9329 23-Oct-2017 Alan Somers <asomers@FreeBSD.org>

Remove artificial restriction on lio_listio's operation count

In r322258 I made p1003_1b.aio_listio_max a tunable. However, further
investigation shows that there was never any good reason for that limit to
exist in the first place. It's used in two completely different ways:

* To size a UMA zone, which globally limits the number of concurrent
aio_suspend calls.

* To artifically limit the number of operations in a single lio_listio call.
There doesn't seem to be any memory allocation associated with this limit.

This change does two things:

* Properly names aio_suspend's UMA zone, and sizes it based on a new constant.

* Eliminates the artifical restriction on lio_listio. Instead, lio_listio
calls will now be limited by the more generous max_aio_queue_per_proc. The
old p1003_1b.aio_listio_max is now an alias for
vfs.aio.max_aio_queue_per_proc, so sysconf(3) will still work with
_SC_AIO_LISTIO_MAX.

Reported by: bde
Reviewed by: jhb
MFC after: 3 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D12120


# c45796d5 08-Aug-2017 Alan Somers <asomers@FreeBSD.org>

Make p1003_1b.aio_listio_max a tunable

p1003_1b.aio_listio_max is now a tunable. Its value is reflected in the
sysctl of the same name, and the sysconf(3) variable _SC_AIO_LISTIO_MAX.
Its value will be bounded from below by the compile-time constant
AIO_LISTIO_MAX and from above by the compile-time constant
MAX_AIO_QUEUE_PER_PROC and the tunable vfs.aio.max_aio_queue.

Reviewed by: jhb, kib
MFC after: 3 weeks
Relnotes: yes
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D11601


# 711dba24 19-Jun-2017 Konstantin Belousov <kib@FreeBSD.org>

Allow negative aio_offset only for the read and write LIO ops on
device nodes.

Otherwise, the current check of aio_offset == -1LL makes it possible
to pass negative file offsets down to the filesystems. This trips
assertions and is even unsafe for e.g. FFS which keeps metadata at
negative offsets.

Reported and tested by: pho
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11266


# 2b34e843 16-Jun-2017 Konstantin Belousov <kib@FreeBSD.org>

Add abstime kqueue(2) timers and expand struct kevent members.

This change implements NOTE_ABSTIME flag for EVFILT_TIMER, which
specifies that the data field contains absolute time to fire the
event.

To make this useful, data member of the struct kevent must be extended
to 64bit. Using the opportunity, I also added ext members. This
changes struct kevent almost to Apple struct kevent64, except I did
not changed type of ident and udata, the later would cause serious API
incompatibilities.

The type of ident was kept uintptr_t since EVFILT_AIO returns a
pointer in this field, and e.g. CHERI is sensitive to the type
(discussed with brooks, jhb).

Unlike Apple kevent64, symbol versioning allows us to claim ABI
compatibility and still name the new syscall kevent(2). Compat shims
are provided for both host native and compat32.

Requested by: bapt
Reviewed by: bapt, brooks, ngie (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D11025


# 496ab053 13-Feb-2017 Konstantin Belousov <kib@FreeBSD.org>

Rework r313352.

Rename kern_vm_* functions to kern_*. Move the prototypes to
syscallsubr.h. Also change Mach VM types to uintptr_t/size_t as
needed, to avoid headers pollution.

Requested by: alc, jhb
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D9535


# e2a18110 17-Aug-2016 Konstantin Belousov <kib@FreeBSD.org>

Remove duplicated code.

aio_aqueue() calls aio_init_aioinfo() as the first action. There is no
need to duplicate the code in kern_aio_fsync().

Also fix indent for aio_aqueue() definition.

Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D7523


# 005ce8e4 29-Jul-2016 John Baldwin <jhb@FreeBSD.org>

Fix locking issues with aio_fsync().

- Use correct lock in aio_cancel_sync when dequeueing job.
- Add _locked variants of aio_set/clear_cancel_function and use those
to avoid lock recursion when adding and removing fsync jobs to the
per-process sync queue.
- While here, add a basic test for aio_fsync().

PR: 211390
Reported by: Randy Westlund <rwestlun@gmail.com>
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D7339


# b9a53e16 27-Jul-2016 John Baldwin <jhb@FreeBSD.org>

Adjust tests in fsync job scheduling loop to reduce indentation.


# 9c20dc99 21-Jul-2016 John Baldwin <jhb@FreeBSD.org>

Add more documentation regarding unsafe AIO requests.

The asynchronous I/O changes made previously result in different
behavior out of the box. Previously all AIO requests failed with
ENOSYS / SIGSYS unless aio.ko was explicitly loaded. Now, some AIO
requests complete and others ("unsafe" requests) fail with EOPNOTSUPP.

Reword the introductory paragraph in aio(4) to add a general
description of AIO before describing the vfs.aio.enable_unsafe sysctl.

Remove the ENOSYS error description from aio_fsync(2), aio_read(2),
and aio_write(2) and replace it with a description of EOPNOTSUPP.

Remove the ENOSYS error description from aio_mlock(2).

Log a message to the system log the first time a process requests an
"unsafe" AIO request that fails with EOPNOTSUPP. This is modeled on
the log message used for processes using the legacy pty devices.

Reviewed by: kib (earlier version)
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D7151


# 9fe297bb 21-Jul-2016 Konstantin Belousov <kib@FreeBSD.org>

Declare aio requests on files from local filesystems safe.
Two notes:
- I allow AIO on reclaimed vnodes, since it is deterministically terminated
fast.
- devfs mounts are marked as MNT_LOCAL, but device vnodes have type
VCHR, so the slow device io is not allowed.

Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D7273


# b1012d80 21-Jun-2016 John Baldwin <jhb@FreeBSD.org>

Account for AIO socket operations in thread/process resource usage.

File and disk-backed I/O requests store counts of read/written disk
blocks in each AIO job so that they can be charged to the thread that
completes an AIO request via aio_return() or aio_waitcomplete(). This
change extends AIO jobs to store counts of received/sent messages and
updates socket backends to set these counts accordingly. Note that
the socket backends are careful to only charge a single messages for
each AIO request even though a single request on a blocking socket might
invoke sosend or soreceive multiple times. This is to mimic the
resource accounting of synchronous read/write.

Adjust the UNIX socketpair AIO test to verify that the message resource
usage counts update accordingly for aio_read and aio_write.

Approved by: re (hrs)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D6911


# fe0bdd1d 15-Jun-2016 John Baldwin <jhb@FreeBSD.org>

Move backend-specific fields of kaiocb into a union.

This reduces the size of kaiocb slightly. I've also added some generic
fields that other backends can use in place of the BIO-specific fields.

Change the socket and Chelsio DDP backends to use 'backend3' instead of
abusing _aiocb_private.status directly. This confines the use of
_aiocb_private to the AIO internals in vfs_aio.c.

Reviewed by: kib (earlier version)
Approved by: re (gjb)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D6547


# f0ec1740 20-May-2016 John Baldwin <jhb@FreeBSD.org>

Consistently set status to -1 when completing an AIO request with an error.

Sponsored by: Chelsio Communications


# 4d805eac 31-Mar-2016 John Baldwin <jhb@FreeBSD.org>

Tidy up the unmapped I/O code in qphysio.

- Move some blocks around to reduce the number of 'if (unmap)' checks.
- Use 'pbuf == NULL' instead of 'unmap'.
- Use nitems.
- Pull an assignment out of an if expression.

Reviewed by: kib
Sponsored by: Chelsio Communications


# bb430bc7 21-Mar-2016 John Baldwin <jhb@FreeBSD.org>

Fully handle size_t lengths in AIO requests.

First, update the return types of aio_return() and aio_waitcomplete() to
ssize_t.

POSIX requires aio_return() to return a ssize_t so that it can represent
all return values from read() and write(). aio_waitcomplete() should use
ssize_t for the same reason.

aio_return() has used ssize_t in <aio.h> since r31620 but the manpage and
system call entry were not updated. aio_waitcomplete() has always
returned int.

Note that this does not require new system call stubs as this is
effectively only an API change in how the compiler interprets the return
value.

Second, allow aio_nbytes values up to IOSIZE_MAX instead of just INT_MAX.

aio_read/write should now honor the same length limits as normal read/write.

Third, use longs instead of ints in the aio_return() and aio_waitcomplete()
system call functions so that the 64-bit size_t in the in-kernel aiocb
isn't truncated to 32-bits before being copied out to userland or
being returned.

Finally, a simple test has been added to verify the bounds checking on the
maximum read size from a file.


# 5166fdde 18-Mar-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

aio_qphysio(): Avoid uninitialized pointer read on error.

For the !unmap case it may happen that pbuf gets called unreferenced
when vm_fault_quick_hold_pages() fails.
Initialize it so it doesn't cause trouble.

CID: 1352776
Reviewed by: jhb
MFC after: 1 week


# 399e8c17 09-Mar-2016 John Baldwin <jhb@FreeBSD.org>

Simplify AIO initialization now that it is standard.

- Mark AIO system calls as STD and remove the helpers to dynamically
register them.
- Use COMPAT6 for the old system calls with the older sigevent instead of
an 'o' prefix.
- Simplify the POSIX configuration to note that AIO is always available.
- Handle AIO in the default VOP_PATHCONF instead of special casing it in
the pathconf() system call. fpathconf() is still hackish.
- Remove freebsd32_aio_cancel() as it just called the native one directly.

Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D5589


# f3215338 01-Mar-2016 John Baldwin <jhb@FreeBSD.org>

Refactor the AIO subsystem to permit file-type-specific handling and
improve cancellation robustness.

Introduce a new file operation, fo_aio_queue, which is responsible for
queueing and completing an asynchronous I/O request for a given file.
The AIO subystem now exports library of routines to manipulate AIO
requests as well as the ability to run a handler function in the
"default" pool of AIO daemons to service a request.

A default implementation for file types which do not include an
fo_aio_queue method queues requests to the "default" pool invoking the
fo_read or fo_write methods as before.

The AIO subsystem permits file types to install a private "cancel"
routine when a request is queued to permit safe dequeueing and cleanup
of cancelled requests.

Sockets now use their own pool of AIO daemons and service per-socket
requests in FIFO order. Socket requests will not block indefinitely
permitting timely cancellation of all requests.

Due to the now-tight coupling of the AIO subsystem with file types,
the AIO subsystem is now a standard part of all kernels. The VFS_AIO
kernel option and aio.ko module are gone.

Many file types may block indefinitely in their fo_read or fo_write
callbacks resulting in a hung AIO daemon. This can result in hung
user processes (when processes attempt to cancel all outstanding
requests during exit) or a hung system. To protect against this, AIO
requests are only permitted for known "safe" files by default. AIO
requests for all file types can be enabled by setting the new
vfs.aio.enable_usafe sysctl to a non-zero value. The AIO tests have
been updated to skip operations on unsafe file types if the sysctl is
zero.

Currently, AIO requests on sockets and raw disks are considered safe
and are enabled by default. aio_mlock() is also enabled by default.

Reviewed by: cem, jilles
Discussed with: kib (earlier version)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D5289


# 5652770d 05-Feb-2016 John Baldwin <jhb@FreeBSD.org>

Rename aiocblist to kaiocb and use consistent variable names.

Typically <foo>list is used for a structure that holds a list head in
FreeBSD, not for members of a list. As such, rename 'struct aiocblist'
to 'struct kaiocb' (the kernel version of 'struct aiocb').

While here, use more consistent variable names for AIO control blocks:

- Use 'job' instead of 'aiocbe', 'cb', 'cbe', or 'iocb' for kernel job
objects.
- Use 'jobn' instead of 'cbn' for use with TAILQ_FOREACH_SAFE().
- Use 'sjob' and 'sjobn' instead of 'scb' and 'scbn' for fsync jobs.
- Use 'ujob' instead of 'aiocbp', 'job', 'uaiocb', or 'uuaiocb' to hold
a user pointer to a 'struct aiocb'.
- Use 'ujobp' instead of 'aiocbp' for a user pointer to a 'struct aiocb *'.

Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D5125


# 0dd6c035 26-Jan-2016 John Baldwin <jhb@FreeBSD.org>

Various style fixes.

- Wrap long lines.
- Fix indentation.
- Remove excessive parens.
- Whitespace fixes in struct definitions.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D5025


# 39314b7d 20-Jan-2016 John Baldwin <jhb@FreeBSD.org>

AIO daemons have always been kernel processes to facilitate switching to
user VM spaces while servicing jobs. Update various comments and data
structures that refer to AIO daemons as threads to refer to processes
instead.

Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D4999


# 4429f0e2 20-Jan-2016 John Baldwin <jhb@FreeBSD.org>

Remove unused variables for socket AIO.

In r55943, a per-process queue of pending socket AIO requests (requests
waiting for the socket to become ready) was added so that they could be
cancelled during process rundown. In r154765, the rundown code was
changed to handle jobs in this state (JOBST_JOBQSOCK) directly removing
the need for the extra queue. However, the per-process queue head and
global lock were never removed.

Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D4997


# 8a4dc40f 19-Jan-2016 John Baldwin <jhb@FreeBSD.org>

Various cleanups to the main function for AIO kernel processes:

- Pull the vmspace logic out into helper functions and reduce duplication.
Operations on the vmspace are all isolated to vm_map.c, but it now exports
a new 'vmspace_switch_aio' for use by AIO kernel processes.
- When an AIO kernel process wants to exit, break out of the main loop and
perform cleanup after the loop end. This reduces a lot of indentation and
allows cleanup to more closely mirror setup actions before the loop starts.
- Convert a DIAGNOSTIC to KASSERT().
- Replace mycp with more typical 'p'.

Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D4990


# f2e7f06a 19-Jan-2016 John Baldwin <jhb@FreeBSD.org>

Don't create a dedicated session for each AIO kernel process.

This code dates back to the initial AIO support and the commit log does
not explain why it is needed. However, I cannot find anything in the
AIO code or the various file methods (fo_read/fo_write) that would change
behavior due to using a private session instead of proc0's session.

Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D4988


# 6c8fd022 14-Jan-2016 John Baldwin <jhb@FreeBSD.org>

Remove aiod_timeout.

It hasn't been used since the AIO code was made MPSAFE 10 years ago.

Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D4946


# c85650ca 14-Jan-2016 John Baldwin <jhb@FreeBSD.org>

Rename aiod_bio taskqueue to aiod_kick.

This taskqueue is not used to handle bio requests. It is only used to
run aio_kick_nowait() to spin up new aio daemon processes.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D4904


# 38d68e2d 25-Oct-2015 Pawel Jakub Dawidek <pjd@FreeBSD.org>

The aio_waitcomplete(2) syscall should not sleep when the given timeout
is 0. Without this change it was sleeping for one tick. Maybe not a big
deal, but it makes share/dtrace/blocking script to report that.

Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D3814
Sponsored by: Wheel Systems, http://wheelsystems.com


# 9889bbac 06-Jul-2015 Konstantin Belousov <kib@FreeBSD.org>

Mutex memory is not zeroed, add MTX_NEW.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# f131759f 05-Jul-2015 Mateusz Guzik <mjg@FreeBSD.org>

fd: make 'rights' a manadatory argument to fget* functions


# f743d981 22-Apr-2015 Alexander Motin <mav@FreeBSD.org>

Make AIO to not allocate pbufs for unmapped I/O like r281825.

While there, make few more performance optimizations.

On 40-core system doing many 512-byte AIO reads from array of raw SSDs
this change removes lock congestions inside pbuf allocator and devfs,
and bottleneck on single AIO completion taskqueue thread. It improves
peak AIO performance from ~600K to ~1.3M IOPS.

MFC after: 2 weeks


# e015b1ab 26-Oct-2014 Mateusz Guzik <mjg@FreeBSD.org>

Avoid dynamic syscall overhead for statically compiled modules.

The kernel tracks syscall users so that modules can safely unregister them.

But if the module is not unloadable or was compiled into the kernel, there is
no need to do this.

Achieve this by adding SY_THR_STATIC_KLD macro which expands to SY_THR_STATIC
during kernel build and 0 otherwise.

Reviewed by: kib (previous version)
MFC after: 2 weeks


# 4a144410 16-Mar-2014 Robert Watson <rwatson@FreeBSD.org>

Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after: 3 weeks


# 96a62209 05-Sep-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

The fget() function now takes pointer to cap_rights_t, so change 0 to NULL.


# 7008be5b 04-Sep-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)

#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);

bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

cap_rights_t rights;

cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by: The FreeBSD Foundation


# ce625ec7 15-Aug-2013 Kenneth D. Merry <ken@FreeBSD.org>

Change the way that unmapped I/O capability is advertised.

The previous method was to set the D_UNMAPPED_IO flag in the cdevsw
for the driver. The problem with this is that in many cases (e.g.
sa(4)) there may be some instances of the driver that can handle
unmapped I/O and some that can't. The isp(4) driver can handle
unmapped I/O, but the esp(4) driver currently cannot. The cdevsw
is shared among all driver instances.

So instead of setting a flag on the cdevsw, set a flag on the cdev.
This allows drivers to indicate support for unmapped I/O on a
per-instance basis.

sys/conf.h: Remove the D_UNMAPPED_IO cdevsw flag and replace it
with an SI_UNMAPPED cdev flag.

kern_physio.c: Look at the cdev SI_UNMAPPED flag to determine
whether or not a particular driver can handle
unmapped I/O.

geom_dev.c: Set the SI_UNMAPPED flag for all GEOM cdevs.
Since GEOM will create a temporary mapping when
needed, setting SI_UNMAPPED unconditionally will
work.

Remove the D_UNMAPPED_IO flag.

nvme_ns.c: Set the SI_UNMAPPED flag on cdevs created here
if NVME_UNMAPPED_BIO_SUPPORT is enabled.

vfs_aio.c: In aio_qphysio(), check the SI_UNMAPPED flag on a
cdev instead of the D_UNMAPPED_IO flag on the cdevsw.

sys/param.h: Bump __FreeBSD_version to 1000045 for the switch from
setting the D_UNMAPPED_IO flag in the cdevsw to setting
SI_UNMAPPED in the cdev.

Reviewed by: kib, jimharris
MFC after: 1 week
Sponsored by: Spectra Logic


# 977c7043 02-Aug-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Remove extra zeroing after M_ZERO allocation.


# 97319989 21-Jul-2013 Konstantin Belousov <kib@FreeBSD.org>

Move the convert_sigevent32() utility function into freebsd32_misc.c
for consumption outside the vfs_aio.c.

For SIGEV_THREAD_ID and SIGEV_SIGNAL notification delivery methods,
also copy in the sigev_value, since librt event pumping loop compares
note generation number with the value passed through sigev_value.

Tested by: Petr Salinger <Petr.Salinger@seznam.cz>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 6160e12c 08-Jun-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Add new system call - aio_mlock(). The name speaks for itself. It allows
to perform the mlock(2) operation, which can consume a lot of time, under
control of aio(4).

Reviewed by: kib, jilles
Sponsored by: Nginx, Inc.


# f95c13db 08-Jun-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Separate LIO_SYNC processing into a separate function aio_process_sync(),
and rename aio_process() into aio_process_rw().

Reviewed by: kib
Sponsored by: Nginx, Inc.


# f3215a60 27-Mar-2013 Konstantin Belousov <kib@FreeBSD.org>

Fix a race with the vnode reclamation in the aio_qphysio(). Obtain
the thread reference on the vp->v_rdev and use the returned struct
cdev *dev instead of using vp->v_rdev. Call dev_strategy_csw()
instead of dev_strategy(), since we now own the reference.

Since the csw was already calculated, test d_flags to avoid mapping
the buffer if the driver supports unmapped requests [*].

Suggested by: kan [*]
Reviewed by: kan (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# e81ff91e 19-Mar-2013 Konstantin Belousov <kib@FreeBSD.org>

Do not remap usermode pages into KVA for physio.

Sponsored by: The FreeBSD Foundation
Tested by: pho


# 89f6b863 08-Mar-2013 Attilio Rao <attilio@FreeBSD.org>

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


# 2609222a 01-Mar-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Merge Capsicum overhaul:

- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:

CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.

Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).

Added CAP_SYMLINKAT:
- Allow for symlinkat(2).

Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).

Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.

Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.

Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.

Removed CAP_MAPEXEC.

CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.

Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.

CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).

CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).

Added convinient defines:

#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE

#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)

Added defines for backward API compatibility:

#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib


# 5050aa86 22-Oct-2012 Konstantin Belousov <kib@FreeBSD.org>

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


# d56e058a 04-Feb-2012 David Xu <davidxu@FreeBSD.org>

Add 32-bit compat code for AIO kevent flags introduced in revision 230857.


# fde80935 31-Jan-2012 David Xu <davidxu@FreeBSD.org>

If multiple threads call kevent() to get AIO events on same kqueue fd,
it is possible that a single AIO event will be reported to multiple
threads, it is not threading friendly, and the existing API can not
control this behavior.
Allocate a kevent flags field sigev_notify_kevent_flags for AIO event
notification in sigevent, and allow user to pass EV_CLEAR, EV_DISPATCH
or EV_ONESHOT to AIO kernel code, user can control whether the event
should be cleared once it is retrieved by a thread. This change should
be comptaible with existing application, because the field should have
already been zero-filled, and no additional action will be taken by
kernel.

PR: kern/156567


# 8e9fc278 30-Jan-2012 Doug Ambrisko <ambrisko@FreeBSD.org>

When detaching an AIO or LIO requests grab the lock and tell knlist_remove
that we have the lock now. This cleans up a locking panic ASSERT when
knlist_empty is called without a lock when INVARIANTS etc. are turned.

Reviewed by: kib jhb
MFC after: 1 week


# 94fce847 27-Jan-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Fix size check, that prevents getting negative after casting
to a signed type

Reviewed by: bde


# 434ea137 26-Jan-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Although aio_nbytes is size_t, later is is signed to
casted types: to ssize_t in filesystem code and to
int in buf code, thus supplying a negative argument
leads to kernel panic later. To fix that check user
supplied argument in the beginning of syscall.

Submitted by: Maxim Dounin <mdounin mdounin.ru>, maxim@


# 8451d0dd 16-Sep-2011 Kip Macy <kmacy@FreeBSD.org>

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


# a9d2f8d8 10-Aug-2011 Robert Watson <rwatson@FreeBSD.org>

Second-to-last commit implementing Capsicum capabilities in the FreeBSD
kernel for FreeBSD 9.0:

Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *. With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.

Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.

In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.

Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.

Approved by: re (bz)
Submitted by: jonathan
Sponsored by: Google Inc


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# cf7d9a8c 08-Oct-2010 David Xu <davidxu@FreeBSD.org>

Create a global thread hash table to speed up thread lookup, use
rwlock to protect the table. In old code, thread lookup is done with
process lock held, to find a thread, kernel has to iterate through
process and thread list, this is quite inefficient.
With this change, test shows in extreme case performance is
dramatically improved.

Earlier patch was reviewed by: jhb, julian


# 36131b48 07-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r205326:
Convert aio syscall registration to SYSCALL_INIT_HELPER.


# 4ccf64eb 06-Apr-2010 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

MFC r205014,205015:

Provide groundwork for 32-bit binary compatibility on non-x86 platforms,
for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32
option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts
of the kernel and enhances the freebsd32 compatibility code to support
big-endian platforms.

This MFC is required for MFCs of later changes to the freebsd32
compatibility from HEAD.

Requested by: kib


# 723d37c0 19-Mar-2010 Konstantin Belousov <kib@FreeBSD.org>

Convert aio syscall registration to SYSCALL_INIT_HELPER.

Reviewed by: jhb
MFC after: 2 weeks


# 841c0c7e 11-Mar-2010 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

Provide groundwork for 32-bit binary compatibility on non-x86 platforms,
for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32
option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts
of the kernel and enhances the freebsd32 compatibility code to support
big-endian platforms.

Reviewed by: kib, jhb


# e76d823b 12-Sep-2009 Robert Watson <rwatson@FreeBSD.org>

Use C99 initialization for struct filterops.

Obtained from: Mac OS X
Sponsored by: Apple Inc.
MFC after: 3 weeks


# d8b0556c 10-Jun-2009 Konstantin Belousov <kib@FreeBSD.org>

Adapt vfs kqfilter to the shared vnode lock used by zfs write vop. Use
vnode interlock to protect the knote fields [1]. The locking assumes
that shared vnode lock is held, thus we get exclusive access to knote
either by exclusive vnode lock protection, or by shared vnode lock +
vnode interlock.

Do not use kl_locked() method to assert either lock ownership or the
fact that curthread does not own the lock. For shared locks, ownership
is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared
lock not owned by curthread, causing false positives in kqueue subsystem
assertions about knlist lock.

Remove kl_locked method from knlist lock vector, and add two separate
assertion methods kl_assert_locked and kl_assert_unlocked, that are
supposed to use proper asserts. Change knlist_init accordingly.

Add convenience function knlist_init_mtx to reduce number of arguments
for typical knlist initialization.

Submitted by: jhb [1]
Noted by: jhb [2]
Reviewed by: jhb
Tested by: rnoland


# 74fb0ba7 01-Jun-2009 John Baldwin <jhb@FreeBSD.org>

Rework socket upcalls to close some races with setup/teardown of upcalls.
- Each socket upcall is now invoked with the appropriate socket buffer
locked. It is not permissible to call soisconnected() with this lock
held; however, so socket upcalls now return an integer value. The two
possible values are SU_OK and SU_ISCONNECTED. If an upcall returns
SU_ISCONNECTED, then the soisconnected() will be invoked on the
socket after the socket buffer lock is dropped.
- A new API is provided for setting and clearing socket upcalls. The
API consists of soupcall_set() and soupcall_clear().
- To simplify locking, each socket buffer now has a separate upcall.
- When a socket upcall returns SU_ISCONNECTED, the upcall is cleared from
the receive socket buffer automatically. Note that a SO_SND upcall
should never return SU_ISCONNECTED.
- All this means that accept filters should now return SU_ISCONNECTED
instead of calling soisconnected() directly. They also no longer need
to explicitly clear the upcall on the new socket.
- The HTTP accept filter still uses soupcall_set() to manage its internal
state machine, but other accept filters no longer have any explicit
knowlege of socket upcall internals aside from their return value.
- The various RPC client upcalls currently drop the socket buffer lock
while invoking soreceive() as a temporary band-aid. The plan for
the future is to add a new flag to allow soreceive() to be called with
the socket buffer locked.
- The AIO callback for socket I/O is now also invoked with the socket
buffer locked. Previously sowakeup() would drop the socket buffer
lock only to call aio_swake() which immediately re-acquired the socket
buffer lock for the duration of the function call.

Discussed with: rwatson, rmacklem


# e588eeb1 23-Jan-2009 John Baldwin <jhb@FreeBSD.org>

Use the correct type for the timeout parameter to the 32-bit
compat version aio_waitcomplete().

Reminded by: bz
Submitted by: jamie
MFC after: 3 days


# 3858a1f4 10-Dec-2008 John Baldwin <jhb@FreeBSD.org>

- Add 32-bit compat system calls for VFS_AIO. The system calls live in the
aio code and are registered via the recently added SYSCALL32_*() helpers.
- Since the aio code likes to invoke fuword and suword a lot down in the
"bowels" of system calls, add a structure holding a set of operations for
things like storing errors, copying in the aiocb structure, storing
status, etc. The 32-bit system calls use a separate operations vector to
handle fuword32 vs fuword, etc. Also, the oldsigevent handling is now
done by having seperate operation vectors with different aiocb copyin
routines.
- Split out kern_foo() functions for the various AIO system calls so the
32-bit front ends can manage things like copying in and converting
timespec structures, etc.
- For both the native and 32-bit aio_suspend() and lio_listio() calls,
just use copyin() to read the array of aiocb pointers instead of using
a for loop that iterated over fuword/fuword32. The error handling in
the old case was incomplete (lio_listio() just ignored any aiocb's that
it got an EFAULT trying to read rather than reporting an error), and
possibly slower.

MFC after: 1 month


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 22035f47 21-Jun-2008 Oleksandr Tymoshenko <gonzo@FreeBSD.org>

Use minimum of max_aio_procs and target_aio_procs when spawning new
aiod since there should be no more then max_aio_procs processes.


# e603be7a 01-Feb-2008 Robert Watson <rwatson@FreeBSD.org>

Use FEATURE() macro to advertise aio availability.


# a8afa221 24-Jan-2008 Jean-Sébastien Pédron <dumbbell@FreeBSD.org>

When asked to use kqueue, AIO stores its internal state in the
`kn_sdata' member of the newly registered knote. The problem is that
this member is overwritten by a call to kevent(2) with the EV_ADD flag,
targetted at the same kevent/knote. For instance, a userland application
may set the pointer to NULL, leading to a panic.

A testcase was provided by the submitter.

PR: kern/118911
Submitted by: MOROHOSHI Akihiko <moro@remus.dti.ne.jp>
MFC after: 1 day


# 22db15c0 13-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# cb05b60a 09-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


# 3745c395 20-Oct-2007 Julian Elischer <julian@FreeBSD.org>

Rename the kthread_xxx (e.g. kthread_create()) calls
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.

I'd LOVE to do this rename in 7.0 so that we can eventually MFC the
new kthread_xxx() calls.


# 5114048b 20-Aug-2007 Konstantin Belousov <kib@FreeBSD.org>

Destroy the kaio_mtx on the freeing the struct kaioinfo in the
aio_proc_rundown.

Do not allow for zero-length read to be passed to the fo_read file method
by aio.

Reported and tested by: Peter Holm
Approved by: re (kensmith)


# a659386c 09-Jun-2007 Matt Jacob <mjacob@FreeBSD.org>

Remove unused variable.


# 1c4bcd05 31-May-2007 Jeff Roberson <jeff@FreeBSD.org>

- Move rusage from being per-process in struct pstats to per-thread in
td_ru. This removes the requirement for per-process synchronization in
statclock() and mi_switch(). This was previously supported by
sched_lock which is going away. All modifications to rusage are now
done in the context of the owning thread. reads proceed without locks.
- Aggregate exiting threads rusage in thread_exit() such that the exiting
thread's rusage is not lost.
- Provide a new routine, rufetch() to fetch an aggregate of all rusage
structures from all threads in a process. This routine must be used
in any place requiring a rusage from a process prior to it's exit. The
exited process's rusage is still available via p_ru.
- Aggregate tick statistics only on demand via rufetch() or when a thread
exits. Tick statistics are kept in the thread and protected by sched_lock
until it exits.

Initial patch by: attilio
Reviewed by: attilio, bde (some objections), arch (mostly silent)


# 873fbcd7 05-Mar-2007 Robert Watson <rwatson@FreeBSD.org>

Further system call comment cleanup:

- Remove also "MP SAFE" after prior "MPSAFE" pass. (suggested by bde)
- Remove extra blank lines in some cases.
- Add extra blank lines in some cases.
- Remove no-op comments consisting solely of the function name, the word
"syscall", or the system call name.
- Add punctuation.
- Re-wrap some comments.


# 6aeb05d7 11-Nov-2006 Tom Rhodes <trhodes@FreeBSD.org>

Merge posix4/* into normal kernel hierarchy.

Reviewed by: glanced at by jhb
Approved by: silence on -arch@ and -standards@


# 6a1162d4 15-Oct-2006 Alexander Leidinger <netchild@FreeBSD.org>

MFP4 (with some minor changes):

Implement the linux_io_* syscalls (AIO). They are only enabled if the native
AIO code is available (either compiled in to the kernel or as a module) at
the time the functions are used. If the AIO stuff is not available there
will be a ENOSYS.

From the submitter:
---snip---
DESIGN NOTES:

1. Linux permits a process to own multiple AIO queues (distinguished by
"context"), but FreeBSD creates only one single AIO queue per process.
My code maintains a request queue (STAILQ of queue(3)) per "context",
and throws all AIO requests of all contexts owned by a process into
the single FreeBSD per-process AIO queue.

When the process calls io_destroy(2), io_getevents(2), io_submit(2) and
io_cancel(2), my code can pick out requests owned by the specified context
from the single FreeBSD per-process AIO queue according to the per-context
request queues maintained by my code.

2. The request queue maintained by my code stores contrast information between
Linux IO control blocks (struct linux_iocb) and FreeBSD IO control blocks
(struct aiocb). FreeBSD IO control block actually exists in userland memory
space, required by FreeBSD native aio_XXXXXX(2).

3. It is quite troubling that the function io_getevents() of libaio-0.3.105
needs to use Linux-specific "struct aio_ring", which is a partial mirror
of context in user space. I would rather take the address of context in
kernel as the context ID, but the io_getevents() of libaio forces me to
take the address of the "ring" in user space as the context ID.

To my surprise, one comment line in the file "io_getevents.c" of
libaio-0.3.105 reads:

Ben will hate me for this

REFERENCE:

1. Linux kernel source code: http://www.kernel.org/pub/linux/kernel/v2.6/
(include/linux/aio_abi.h, fs/aio.c)

2. Linux manual pages: http://www.kernel.org/pub/linux/docs/manpages/
(io_setup(2), io_destroy(2), io_getevents(2), io_submit(2), io_cancel(2))

3. Linux Scalability Effort: http://lse.sourceforge.net/io/aio.html
The design notes: http://lse.sourceforge.net/io/aionotes.txt

4. The package libaio, both source and binary:
http://rpmfind.net/linux/rpm2html/search.php?query=libaio
Simple transparent interface to Linux AIO system calls.

5. Libaio-oracle: http://oss.oracle.com/projects/libaio-oracle/
POSIX AIO implementation based on Linux AIO system calls (depending on
libaio).
---snip---

Submitted by: Li, Xiao <intron@intron.ac>


# 4db71d27 23-Sep-2006 John-Mark Gurney <jmg@FreeBSD.org>

hide kqueue_register from public view, and replace it w/ kqfd_register...
this eliminates a possible race in aio registering a kevent..


# f6d004d5 06-Sep-2006 Mark Peek <mp@FreeBSD.org>

Remove call to fdfree() for the AIO daemons to prevent kernel panics
with linprocfs. This call is not needed since file descriptor sharing
was removed in v1.125.

Reviewed by: alc, davidxu, ambrisko
MFC after: 3 days


# 993182e5 14-Aug-2006 Alexander Leidinger <netchild@FreeBSD.org>

- Change process_exec function handlers prototype to include struct
image_params arg.
- Change struct image_params to include struct sysentvec pointer and
initialize it.
- Change all consumers of process_exit/process_exec eventhandlers to
new prototypes (includes splitting up into distinct exec/exit functions).
- Add eventhandler to userret.

Sponsored by: Google SoC 2006
Submitted by: rdivacky
Parts suggested by: jhb (on hackers@)


# 51e37c7f 02-Jun-2006 Doug Ambrisko <ambrisko@FreeBSD.org>

Make lio ident more consistant with aio ident.


# 759cccca 08-May-2006 David Xu <davidxu@FreeBSD.org>

Use a dedicated mutex to protect aio queues, the movation is to reduce
lock contention with other parts.


# dbbccfe9 23-Mar-2006 David Xu <davidxu@FreeBSD.org>

1. Move code for scanning pending I/O from aio_fsync to aio_aqueue,
it has less overhead.
2. Avoid scheduling task if maximum number of I/O threads is reached.


# 99eee864 23-Mar-2006 David Xu <davidxu@FreeBSD.org>

Implement aio_fsync() syscall.


# 27b8220d 25-Feb-2006 David Xu <davidxu@FreeBSD.org>

1. Remove aio entry from lists earlier in aio_free_entry,
so other threads can not see it if we unlock the proc
lock (this can happen in knlist_delete). Don't do wakeup,
it is not necessary.

2. Decrease kaio_buffer_count in biohelper rather than
doing it in aio_bio_done_notify.

3. In aio_bio_done_notify, don't send notification if KAIO_RUNDOWN
was set, because the process is already in single thread mode.

4. Use assignment to initialize aiothreadflags.

5. AIOCBLIST_RUNDOWN is not useful, axe the code using it.

6. use LIO_NOP instead of zero.


# ad8de0f2 21-Feb-2006 David Xu <davidxu@FreeBSD.org>

If block size is zero, use normal file operations to do I/O,
this eliminates a divided-by-zero fault.

Recommended by: phk


# 6d53aa62 27-Jan-2006 David Xu <davidxu@FreeBSD.org>

Just like dofilewrite(), call bwillwrite before fo_write.


# 03d66b36 26-Jan-2006 David Xu <davidxu@FreeBSD.org>

return final error code in aio_return rather than a hardcoded 0.


# 55a122bf 26-Jan-2006 David Xu <davidxu@FreeBSD.org>

in aio_aqueue, store same return code into job->_aiocb_private.error.
in aio_return, unlock proc lock before suword.


# 1aa4c324 24-Jan-2006 David Xu <davidxu@FreeBSD.org>

Add locking annotation and comments about socket, pipe, fifo problem.
Temporarily fix a locking problem for socket I/O.


# e6bdc05f 23-Jan-2006 David Xu <davidxu@FreeBSD.org>

Er, rescure a deleted comment line.


# bd793be3 23-Jan-2006 David Xu <davidxu@FreeBSD.org>

More cleanup for aio code:
1) unregsiter kqueue filter for EVFILT_LIO.
2) free uma_zones.
3) call setsid directly to enter another session rather than
implementing by itself.

Submitted by: jhb


# 7f34b521 23-Jan-2006 David Xu <davidxu@FreeBSD.org>

Add bracket.


# 68d71118 23-Jan-2006 David Xu <davidxu@FreeBSD.org>

Verify all supported notification types.


# a9bf5e37 22-Jan-2006 David Xu <davidxu@FreeBSD.org>

1) Merge _aio_aqueue and aio_aqueue, check quota in aio_aqueue,
so that lio_listio won't exceed the quota.
2) Remove lio_ref_count, it is no longer used.


# 8c0d9af5 22-Jan-2006 David Xu <davidxu@FreeBSD.org>

Fix a bogus panic.


# 9b84335c 22-Jan-2006 David Xu <davidxu@FreeBSD.org>

Decrease kaio_active_count first, because user process may go away
after we notified it.


# 1ce91824 21-Jan-2006 David Xu <davidxu@FreeBSD.org>

Make aio code MP safe.


# 8213baf0 14-Jan-2006 Christian S.J. Peron <csjp@FreeBSD.org>

Initialize ki to p->p_aioinfo after we know it's going to be referencing
a valid kaioinfo structure. This avoids a potential NULL pointer dereference.

Found with: Coverity Prevent(tm)
MFC after: 2 weeks


# af56abaa 06-Jan-2006 John Baldwin <jhb@FreeBSD.org>

Return error from fget_write() rather than hardcoding EBADF now that
fget_write() DTRT.

Requested by: bde


# 323fe565 08-Nov-2005 David Xu <davidxu@FreeBSD.org>

In aio_waitcomplete, do not return EAGAIN if no other threads
have started aio, instead, initialize aio management structure
if it hasn't been done, the reason to adjust this behavior is
to make it a bit friendly for threaded program, consider two
threads, one submits aio_write, and another just calls
aio_waitcomplete to wait any I/O to be completed and recycle the
aio requests, before submitter doing any I/O, the recycler wants
to wait in kernel. This also fixes inconsistency with other aio
syscalls.


# 2a522eb9 08-Nov-2005 John Baldwin <jhb@FreeBSD.org>

Various and sundry cleanups:
- Use curthread for calls to knlist_delete() and add a big comment
explaining why as well as appropriate assertions.
- Use TAILQ_FOREACH and TAILQ_FOREACH_SAFE instead of handrolling them.
- Use fget() family of functions to lookup file objects instead of
grovelling around in file descriptor tables.
- Destroy the aio_freeproc mutex if we are unloaded.

Tested on: i386


# 8f0371f1 04-Nov-2005 David Xu <davidxu@FreeBSD.org>

Fix name compatible problem with POSIX standard. the sigval_ptr and
sigval_int really should be sival_ptr and sival_int.
Also sigev_notify_function accepts a union sigval value but not a
pointer.


# 4c0fb2cf 02-Nov-2005 David Xu <davidxu@FreeBSD.org>

Support sending realtime signal information via signal queue, realtime
signal memory is pre-allocated, so kernel can always notify user code.


# 68a17869 01-Nov-2005 John Baldwin <jhb@FreeBSD.org>

Push down Giant into fdfree() and remove it from two of the callers.
Other callers such as some rfork() cases weren't locking Giant anyway.

Reviewed by: csjp
MFC after: 1 week


# 0972628a 29-Oct-2005 David Xu <davidxu@FreeBSD.org>

Fix sigevent's POSIX incompatible problem by adding member fields
sigev_notify_function and sigev_notify_attributes. AIO syscalls
use sigevent, so they have to be adjusted.

Reviewed by: alc


# db43cd04 12-Oct-2005 Doug Ambrisko <ambrisko@FreeBSD.org>

Fix tinderbox box by removing incomplete/bad spl usage. Proper giant free
locking is required in for aio.

Pointed out by: imp


# 69cd28da 12-Oct-2005 Doug Ambrisko <ambrisko@FreeBSD.org>

Add in kqueue support to LIO event notification and fix how it handled
notifications when LIO operations completed. These were the problems
with LIO event complete notification:
- Move all LIO/AIO event notification into one general function
so we don't have bugs in different data paths. This unification
got rid of several notification bugs one of which if kqueue was
used a SIGILL could get sent to the process.
- Change the LIO event accounting to count all AIO request that
could have been split across the fast path and daemon mode.
The prior accounting only kept track of AIO op's in that
mode and not the entire list of operations. This could cause
a bogus LIO event complete notification to occur when all of
the fast path AIO op's completed and not the AIO op's that
ended up queued for the daemon.

Suggestions from: alc


# ec9c9e73 20-Jul-2005 Alan Cox <alc@FreeBSD.org>

Eliminate inconsistency in the setting of the B_DONE flag. Specifically,
make the b_iodone callback responsible for setting it if it is needed.
Previously, it was set unconditionally by bufdone() without holding
whichever lock is shared by the b_iodone callback and the corresponding
top-half function. Consequently, in a race, the top-half function could
conclude that operation was done before the b_iodone callback finished.
See, for example, aio_physwakeup() and aio_fphysio().

Note: I don't believe that the other, more widely-used b_iodone callbacks
are affected.

Discussed with: jeff
Reviewed by: phk
MFC after: 2 weeks


# 571dcd15 01-Jul-2005 Suleiman Souhlal <ssouhlal@FreeBSD.org>

Fix the recent panics/LORs/hangs created by my kqueue commit by:

- Introducing the possibility of using locks different than mutexes
for the knlist locking. In order to do this, we add three arguments to
knlist_init() to specify the functions to use to lock, unlock and
check if the lock is owned. If these arguments are NULL, we assume
mtx_lock, mtx_unlock and mtx_owned, respectively.

- Using the vnode lock for the knlist locking, when doing kqueue operations
on a vnode. This way, we don't have to lock the vnode while holding a
mutex, in filt_vfsread.

Reviewed by: jmg
Approved by: re (scottl), scottl (mentor override)
Pointyhat to: ssouhlal
Will be happy: everyone


# b490cc72 06-Jun-2005 Alan Cox <alc@FreeBSD.org>

In lio_listio(2) change jobref from an int to a long so that
lio_listio(LIO_WAIT, ...) works correctly on 64-bit architectures.

Reviewed by: tegge


# 67b95a95 04-Jun-2005 Alan Cox <alc@FreeBSD.org>

Eliminate an unused field from struct aio_liojob.


# bbe7bbdf 04-Jun-2005 Alan Cox <alc@FreeBSD.org>

Eliminate the original method of requesting notification of aio_read(2) and
aio_write(2) completion through kevent(2). This method does not work on
64-bit architectures. It was deprecated in FreeBSD 4.4. See revisions
1.87 and 1.70.2.7.

Change aio_physwakeup() to call psignal(9) directly rather than indirectly
through a timeout(9). Discussed with: bde

Correct a bug introduced in revision 1.65 that could result in premature
delivery of a signal if an lio_listio(2) consisted of a mixture of
direct/raw and queued I/O operations. Observed by: tegge

Eliminate a field from struct kaioinfo that is now unused.

Reviewed by: tegge


# 3769f562 02-Jun-2005 Alan Cox <alc@FreeBSD.org>

Synchronize access to the per process aiocb lists in many of the functions.


# e293dc86 02-Jun-2005 Alan Cox <alc@FreeBSD.org>

In aio_waitcomplete() correct two cases of using an aiocb after freeing it.


# 3148c2c9 30-May-2005 Alan Cox <alc@FreeBSD.org>

Synchronize access to aio_freeproc with a mutex. Eliminate related spl
calls.

Reduce the scope of Giant in aio_daemon().


# 3999ebe3 30-May-2005 Alan Cox <alc@FreeBSD.org>

Use the proc mtx to prevent simultaneous changes to p_aioinfo.


# 82851350 30-May-2005 Alan Cox <alc@FreeBSD.org>

Eliminate unnecessary calls to wakeup(); no one sleeps on &aio_freeproc.

Eliminate an unused flag, AIOP_SCHED; it's cleared but never set.


# 95eca142 29-May-2005 Alan Cox <alc@FreeBSD.org>

Eliminate aio_activeproc; it's unused.


# 8484b5e6 29-May-2005 Alan Cox <alc@FreeBSD.org>

Eliminate aio_bufjobs; it's unused.


# a230c79b 30-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Acquire Giant in AIO's iodone routine. VFS will no longer do it for us
soon.

Sponsored by: Isilon Systems, Inc.


# c4c44d29 17-Mar-2005 John-Mark Gurney <jmg@FreeBSD.org>

fix aio+kq... I've been running ambrisko's test program for much longer
w/o problems than I was before... This simply brings back the knote_delete
as knlist_delete which will also drop the knote's, instead of just clearing
the list and seeing _ONESHOT...

Fix a race where if a note was _INFLUX and _DETACHED, it could end up being
modified... whoopse..

MFC after: 1 week
Prodded by: ambrisko and dwhite


# 5ece08f5 09-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Make a SYSCTL_NODE static


# 9454b2d8 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for copyright notices, minor format tweaks as necessary


# c5690651 04-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove buf->b_dev field.


# 6afb3b1c 29-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Give dev_strategy() an explict cdev argument in preparation for removing
buf->b-dev.

Put a bio between the buf passed to dev_strategy() and the device driver
strategy routine in order to not clobber fields in the buf.

Assert copyright on vfs_bio.c and update copyright message to canonical
text. There is no legal difference between John Dysons two-clause
abbreviated BSD license and the canonical text.


# 5d9d81e7 26-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Put the I/O block size in bufobj->bo_bsize.

We keep si_bsize_phys around for now as that is the simplest way to pull
the number out of disk device drivers in devfs_open(). The correct solution
would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth
when filesystems sit on GEOM, so don't bother for now.


# 576c004f 30-Sep-2004 Alfred Perlstein <alfred@FreeBSD.org>

cover soreadable and sowriteable with the corresponding socketbuffer locks.


# 1a52a73d 23-Sep-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Eliminate DEV_STRATEGY() macro: call dev_strategy() directly.

Make dev_strategy() handle errors and departing devices properly.


# b6ac5828 02-Sep-2004 Robert Watson <rwatson@FreeBSD.org>

Tag AIO as requiring Giant over the network stack using
NET_NEEDS_GIANT().

RELENG_5 candidate.


# ad3b9257 15-Aug-2004 John-Mark Gurney <jmg@FreeBSD.org>

Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by: green, rwatson (both earlier versions)


# ac77164d 13-Aug-2004 John-Mark Gurney <jmg@FreeBSD.org>

clean up whitespace...


# 1a276a3f 26-Jul-2004 Alan Cox <alc@FreeBSD.org>

- Use atomic ops for updating the vmspace's refcnt and exitingcnt.
- Push down Giant into shmexit(). (Giant is acquired only if the vmspace
contains shm segments.)
- Eliminate the acquisition of Giant from proc_rwmem().
- Reduce the scope of Giant in exit1(), uncovering the destruction of the
address space.


# 9535efc0 17-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Merge additional socket buffer locking from rwatson_netperf:

- Lock down low hanging fruit use of sb_flags with socket buffer
lock.

- Lock down low hanging fruit use of so_state with socket lock.

- Lock down low hanging fruit use of so_options.

- Lock down low-hanging fruit use of sb_lowwat and sb_hiwat with
socket buffer lock.

- Annotate situations in which we unlock the socket lock and then
grab the receive socket buffer lock, which are currently actually
the same lock. Depending on how we want to play our cards, we
may want to coallesce these lock uses to reduce overhead.

- Convert a if()->panic() into a KASSERT relating to so_state in
soaccept().

- Remove a number of splnet()/splx() references.

More complex merging of socket and socket buffer locking to
follow.


# 77409fe1 30-May-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Add missing #include <sys/module.h>


# a5bdcb2a 13-Mar-2004 Peter Wemm <peter@FreeBSD.org>

Make the process_exit eventhandler run without Giant. Add Giant hooks
in the two consumers that need it.. processes using AIO and netncp.
Update docs. Say that process_exec is called with Giant, but not to
depend on it. All our consumers can handle it without Giant.


# 00cbe31b 15-Nov-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Send B_PHYS out to pasture, it no longer serves any function.


# 0eb3b7bb 24-Oct-2003 John-Mark Gurney <jmg@FreeBSD.org>

don't allow reading from files that haven't been open'd for reading.


# a44ca4f0 21-Oct-2003 Hidetoshi Shimokawa <simokawa@FreeBSD.org>

We need to initialize bp->b_offset and bp->b_iooffset
becuase bp->b_blkno is ignored now.


# 8edbaf85 10-Sep-2003 Hidetoshi Shimokawa <simokawa@FreeBSD.org>

Fix asynchronous physio breakage introduced in rev 1.163.
We cannnot use bp->b_caller2 because DEV_STRATEGY will overwrite it.


# 3b6d9652 22-Jun-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Add a f_vnode field to struct file.

Several of the subtypes have an associated vnode which is used for
stuff like the f*() functions.

By giving the vnode a speparate field, a number of checks for the specific
subtype can be replaced simply with a check for f_vnode != NULL, and
we can later free f_data up to subtype specific use.

At this point in time, f_data still points to the vnode, so any code I
might have overlooked will still work.


# e725c18c 16-Jun-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Get rid of the b_spc specialty field in struct buf by using an already
available caller private field.


# 677b542e 10-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# 104a9b7e 29-Apr-2003 Alexander Kabaev <kan@FreeBSD.org>

Deprecate machine/limits.h in favor of new sys/limits.h.
Change all in-tree consumers to include <sys/limits.h>

Discussed on: standards@
Partially submitted by: Craig Rodrigues <rodrigc@attbi.com>


# cd4ed3b5 17-Apr-2003 John Baldwin <jhb@FreeBSD.org>

- kthread's don't have p_textvp set to anything, so replace code that
dealt with that possibility with a KASSERT().
- No need to set P_SYSTEM, kthread_create() does that for us.


# ef38cda1 05-Apr-2003 Alan Cox <alc@FreeBSD.org>

Don't reinitialize fields that are already initialized by getpbuf().


# 06363906 03-Apr-2003 Alan Cox <alc@FreeBSD.org>

o Remove useracc() calls from aio_qphysio(); they are redundant
given the checks performed by vmapbuf().

Reviewed by: tegge


# 75b8b3b2 24-Mar-2003 John Baldwin <jhb@FreeBSD.org>

Replace the at_fork, at_exec, and at_exit functions with the slightly more
flexible process_fork, process_exec, and process_exit eventhandlers. This
reduces code duplication and also means that I don't have to go duplicate
the eventhandler locking three more times for each of at_fork, at_exec, and
at_exit.

Reviewed by: phk, jake, almost complete silence on arch@


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 2d5c7e45 20-Jan-2003 Matthew Dillon <dillon@FreeBSD.org>

Close the remaining user address mapping races for physical
I/O, CAM, and AIO. Still TODO: streamline useracc() checks.

Reviewed by: alc, tegge
MFC after: 7 days


# ac41f2ef 13-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

style(9) fixes, mostly add parens around return arguments.


# 48e3128b 12-Jan-2003 Matthew Dillon <dillon@FreeBSD.org>

Bow to the whining masses and change a union back into void *. Retain
removal of unnecessary casts and throw in some minor cleanups to see if
anyone complains, just for the hell of it.


# ae3b195f 12-Jan-2003 Tim J. Robbins <tjr@FreeBSD.org>

Allowing nent < 0 in aio_suspend() and lio_listio() is just asking for
trouble. Return EINVAL instead.


# 44a2c818 12-Jan-2003 Tim J. Robbins <tjr@FreeBSD.org>

Remove "XXX undocumented" comment from lio_listio().


# cd72f218 11-Jan-2003 Matthew Dillon <dillon@FreeBSD.org>

Change struct file f_data to un_data, a union of the correct struct
pointer types, and remove a huge number of casts from code using it.

Change struct xfile xf_data to xun_data (ABI is still compatible).

If we need to add a #define for f_data and xf_data we can, but I don't
think it will be necessary. There are no operational changes in this
commit.


# e2a3ea1c 02-Jan-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unused second argument from DEV_STRATEGY().


# 5590e7fd 27-Dec-2002 Alfred Perlstein <alfred@FreeBSD.org>

Lock filedesc while performing a range check on the file descriptor.

Reviewed by: alc


# f51c1e89 16-Nov-2002 Alfred Perlstein <alfred@FreeBSD.org>

Rework the sysconf(3) interaction with aio:

sysconf.c:
Use 'break' rather than 'goto yesno' in sysconf.c so that we report a '0'
return value from the kernel sysctl.

vfs_aio.c:
Make aio reset its configuration parameters to -1 after unloading
instead of 0.

posix4_mib.c:
Initialize the aio configuration parameters to -1
to indicate that it is not loaded.
Add a facility (p31b_iscfg()) to determine if a posix4 facility has been
initialized to avoid having to re-order the SYSINITs.
Use p31b_iscfg() to determine if aio has had a chance to run yet which
is likely if it is compiled into the kernel and avoid spamming its
values.
Introduce a macro P31B_VALID() instead of doing the same comparison over
and over.

posix4.h:
Prototype p31b_iscfg().


# 86d52125 15-Nov-2002 Alfred Perlstein <alfred@FreeBSD.org>

Export the values for _SC_AIO_MAX and _SC_AIO_PRIO_DELTA_MAX via the p1003b
sysctl interface.


# c844abc9 15-Nov-2002 Alfred Perlstein <alfred@FreeBSD.org>

Call 'p31b_setcfg(CTL_P1003_1B_AIO_LISTIO_MAX, AIO_LISTIO_MAX)'
when AIO is initialized so that sysconf() gives correct results.

Reported by: Craig Rodrigues <rodrigc@attbi.com>


# f8f750c5 07-Nov-2002 Robert Watson <rwatson@FreeBSD.org>

Do a bit more work in the aio code to simulate the credential environment
of the original AIO request: save and restore the active thread credential
as well as using the file credential, since MAC (and some other bits of
the system) rely on the thread credential instead of/as well as the
file credential. In brief: cache td->td_ucred when the AIO operation
is queued, temporarily set and restore the kernel thread credential,
and release the credential when done. Similar to ktrace credential
management.

Reviewed by: alc
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# c7047e52 27-Oct-2002 Garrett Wollman <wollman@FreeBSD.org>

Change the way support for asynchronous I/O is indicated to applications
to conform to 1003.1-2001. Make it possible for applications to actually
tell whether or not asynchronous I/O is supported.

Since FreeBSD's aio implementation works on all descriptor types, don't
call down into file or vnode ops when [f]pathconf() is asked about
_PC_ASYNC_IO; this avoids the need for every file and vnode op to know about
it.


# 6d345e2a 18-Oct-2002 John Baldwin <jhb@FreeBSD.org>

fdfree() clears p_fd for us, no need to do it again.


# 4d752b01 13-Oct-2002 Alan Cox <alc@FreeBSD.org>

Eliminate the unnecessary clearing of flag bits that are already clear
in lio_listio(2).


# 316ec49a 02-Oct-2002 Scott Long <scottl@FreeBSD.org>

Some kernel threads try to do significant work, and the default KSTACK_PAGES
doesn't give them enough stack to do much before blowing away the pcb.
This adds MI and MD code to allow the allocation of an alternate kstack
who's size can be speficied when calling kthread_create. Passing the
value 0 prevents the alternate kstack from being created. Note that the
ia64 MD code is missing for now, and PowerPC was only partially written
due to the pmap.c being incomplete there.
Though this patch does not modify anything to make use of the alternate
kstack, acpi and usb are good candidates.

Reviewed by: jake, peter, jhb


# 4a6a94d8 22-Aug-2002 Archie Cobbs <archie@FreeBSD.org>

Replace (ab)uses of "NULL" where "0" is really meant.


# 0a179f80 22-Aug-2002 Alan Cox <alc@FreeBSD.org>

o Remove the AIOCBLIST_ASYNCFREE flag and related code. It's never set.

Submitted by: Romer Gil <rgil@cs.rice.edu>


# ad49abc0 11-Aug-2002 Alan Cox <alc@FreeBSD.org>

o Make a correction to the last change: In aio_cancel(2) return AIO_ALLDONE
instead of EINVAL if p->p_aioinfo is NULL.


# b6c1f1ef 10-Aug-2002 Alan Cox <alc@FreeBSD.org>

o In aio_cancel(2), make sure that p->p_aioinfo isn't NULL before
dereferencing it.

Submitted by: saureen <sshah@apple.com>


# b46f1c55 06-Aug-2002 Alan Cox <alc@FreeBSD.org>

Set the ident field of the struct kevent that is registered by _aio_aqueue()
to the address of the user's aiocb rather than the kernel's aiocb. (In other
words, prior to this change, the ident field returned by kevent(2) on
completion of an AIO was effectively garbage.)

Submitted by: Romer Gil <rgil@cs.rice.edu>


# 20fb589d 05-Aug-2002 Alan Cox <alc@FreeBSD.org>

o The introduction of kevent() broke lio_listio(): _aio_aqueue() thought
that LIO_READ and LIO_WRITE were requests for kevent()-based
notification of completion. Modify _aio_aqueue() to recognize LIO_READ
and LIO_WRITE.

Notes: (1) The patch provided by the PR perpetuates a second bug in this
code, a direct access to user-space memory. This change fixes that bug
as well. (2) This change is to code that implements a deprecated interface.
It should probably be removed after an MFC.

PR: kern/39556


# 4cc20ab1 31-May-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Back out my lats commit of locking down a socket, it conflicts with hsu's work.

Requested by: hsu


# a739e09c 25-May-2002 Alan Cox <alc@FreeBSD.org>

o Remove some unnecessary casting from and add some necessary casting to
aio_suspend() and lio_listio().

Submitted by: bde


# 34e3110c 24-May-2002 Peter Wemm <peter@FreeBSD.org>

Fix warnings. Also, removed an unused variable that I found that was just
initialized and never used afterwards.


# 243917fe 19-May-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Lock down a socket, milestone 1.

o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

- so_count
- so_options
- so_linger
- so_state

o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:

- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()

Reviewed by: alfred


# 6041fa0a 03-May-2002 Seigo Tanimura <tanimura@FreeBSD.org>

As malloc(9) and free(9) are now Giant-free, remove the Giant lock
across malloc(9) and free(9) of a pgrp or a session.


# 1c2451c2 19-Apr-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Push down Giant for setpgid(), setsid() and aio_daemon(). Giant protects only
malloc(9) and free(9).


# ba626c1d 16-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Lock proctree_lock instead of pgrpsess_lock.


# 00e73160 13-Apr-2002 Alan Cox <alc@FreeBSD.org>

o Use aiocblist::fd_file in the AIO threads rather than recomputing
the file * from the calling process's descriptor table.
o Eliminate sharing of the calling process's descriptor table
with the AIO threads.


# c0bf5caa 07-Apr-2002 Alan Cox <alc@FreeBSD.org>

Restructure aio_return() to eliminate duplicated code and facilitate Giant
push down.


# ae124fc4 07-Apr-2002 Alan Cox <alc@FreeBSD.org>

Reduce the duplication of code for error handling in _aio_aqueue().


# 63a4964e 06-Apr-2002 Alan Cox <alc@FreeBSD.org>

Change jobref and *ijoblist from int to long in order to avoid
a catastrophe after the 2^32nd AIO operation on 64-bit architectures.


# 9b16adc1 03-Apr-2002 Alan Cox <alc@FreeBSD.org>

o aio_process needn't fhold()/fdrop() the fp now that _aio_aqueue() and
aio_free_entry() do this.
o Remove two unnecessary/unused variables from aio_process() and one field
from aiocblist.


# a5c0b1c0 31-Mar-2002 Alan Cox <alc@FreeBSD.org>

Keep the reference to the file acquired in _aio_aqueue() until the operation
completes. The reference is released in aio_free_entry().

Submitted by: tegge


# ee99e978 25-Mar-2002 Bruce Evans <bde@FreeBSD.org>

Added used include of <sys/sx.h>. Don't depend on namespace pollution in
<sys/file.h> or <sys/socketvar.h>.


# c897b813 19-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

Remove references to vm_zone.h and switch over to the new uma API.

Also, remove maxsockets. If you look carefully you'll notice that the old
zone allocator never honored this anyway.


# eb8e6d52 05-Mar-2002 Eivind Eklund <eivind@FreeBSD.org>

Document all functions, global and static variables, and sysctls.
Includes some minor whitespace changes, and re-ordering to be able to document
properly (e.g, grouping of variables and the SYSCTL macro calls for them, where
the documentation has been added.)

Reviewed by: phk (but all errors are mine)


# f591779b 23-Feb-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Lock struct pgrp, session and sigio.

New locks are:

- pgrpsess_lock which locks the whole pgrps and sessions,
- pg_mtx which protects the pgrp members, and
- s_mtx which protects the session members.

Please refer to sys/proc.h for the coverage of these locks.

Changes on the pgrp/session interface:

- pgfind() needs the pgrpsess_lock held.

- The caller of enterpgrp() is responsible to allocate a new pgrp and
session.

- Call enterthispgrp() in order to enter an existing pgrp.

- pgsignal() requires a pgrp lock held.

Reviewed by: jhb, alfred
Tested on: cvsup.jp.FreeBSD.org
(which is a quad-CPU machine running -current)


# 9fbd7ccf 12-Feb-2002 Alan Cox <alc@FreeBSD.org>

o Clearing p/td_retval[0] after aio_newproc() is unnecessary. (We stopped
calling rfork() to create aio threads in revision 1.46.)
o Don't recompute the FILE * when it's already stored in the kernel's AIOCB.


# 079b7bad 07-Feb-2002 Julian Elischer <julian@FreeBSD.org>

Pre-KSE/M3 commit.
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.

Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,


# c3869e4b 20-Jan-2002 Alan Cox <alc@FreeBSD.org>

o Remove the unused vestiges of JOBST_JOBQPROC and
the per-thread jobtorun queue.
o Use TAILQ_EMPTY() instead of TAILQ_FIRST(...) == NULL.


# 12f63f17 19-Jan-2002 Alan Cox <alc@FreeBSD.org>

o Revision 1.99 ("KSE Milestone 2") left the aio daemons
sleeping on a process object but changed the corresponding
wakeup()s to the thread object. The result was that non-raw
aio ops waited for an aio daemon to timeout before action
was taken. Now, we sleep on the thread object.

PR: kern/34016


# 825ce531 17-Jan-2002 Alan Cox <alc@FreeBSD.org>

o Eliminate an unused parameter from aio_fphysio().


# c6c191b2 14-Jan-2002 Alan Cox <alc@FreeBSD.org>

o Correct the initialization of aiolio_zone: Each entry was 16 times larger
than necessary.
o Move a rarely-used goto label inside a critical section so that we don't
perform an splnet() for which there is no corresponding splx().
o Remove unnecessary splnet()/splx() around accesses to kaioinfo::kaio_jobdone
in aio_return().
o Use TAILQ_FOREACH for simple cases of iteration over kaioinfo::kaio_jobdone.


# 7d17bbd0 08-Jan-2002 Alan Cox <alc@FreeBSD.org>

o Correct a 32/64-bit error in the initialization of aiol_zone, specifically,
sizeof(int) is not the size of a pointer.


# 48dac059 06-Jan-2002 Alan Cox <alc@FreeBSD.org>

o Add missing synchronization (splnet()/splx()) in aio_free_entry().
o Move the definition of struct aiocblist from sys/aio.h to kern/vfs_aio.c.
o Make aio_swake_cb() static.


# 23f13943 02-Jan-2002 Alan Cox <alc@FreeBSD.org>

o Properly check the file descriptor passed to aio_cancel(2). (Previously,
no out-of-bounds check was performed on the file descriptor.)
o Eliminate some excessive white space from aio_cancel(2).


# eae43d0e 31-Dec-2001 Alan Cox <alc@FreeBSD.org>

o Some style(9)-motivated changes to white space.


# 5ca50a4b 30-Dec-2001 Alan Cox <alc@FreeBSD.org>

o Correct an off-by-one error in aio_suspend(2).

PR: 18350


# 516d2564 30-Dec-2001 Alan Cox <alc@FreeBSD.org>

o Use "td->td_proc" instead of "curproc" where possible.
o Eliminate the unnecessary initialization of several static variables
to zero.


# 21d56e9c 29-Dec-2001 Alfred Perlstein <alfred@FreeBSD.org>

Make AIO a loadable module.

Remove the explicit call to aio_proc_rundown() from exit1(), instead AIO
will use at_exit(9).

Add functions at_exec(9), rm_at_exec(9) which function nearly the
same as at_exec(9) and rm_at_exec(9), these functions are called
on behalf of modules at the time of execve(2) after the image
activator has run.

Use a modified version of tegge's suggestion via at_exec(9) to close
an exploitable race in AIO.

Fix SYSCALL_MODULE_HELPER such that it's archetecuterally neutral,
the problem was that one had to pass it a paramater indicating the
number of arguments which were actually the number of "int". Fix
it by using an inline version of the AS macro against the syscall
arguments. (AS should be available globally but we'll get to that
later.)

Add a primative system for dynamically adding kqueue ops, it's really
not as sophisticated as it should be, but I'll discuss with jlemon when
he's around.


# 604035c5 09-Dec-2001 Alan Cox <alc@FreeBSD.org>

o Eliminate compilation warnings on 64-bit architectures.


# 91369fc7 09-Dec-2001 Alan Cox <alc@FreeBSD.org>

o Eliminate unnecessary synchronization from filt_aiodetach().
o The manual page for kevent says that EVFILT_AIO returns under the same
conditions as aio_error(). With that in mind, set the data field
of the returned struct kevent to the value that would be returned
by aio_error().
o Fix two compilation warnings.


# 43150722 05-Oct-2001 John Baldwin <jhb@FreeBSD.org>

The aio kthreads start off with a root credential just like all other
kthreads, so don't malloc a ucred just so we can create a duplicate of the
one we already have.


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 2f3cf918 18-Apr-2001 Alfred Perlstein <alfred@FreeBSD.org>

Check validity of signal callback requested via aio routines.

Also move the insertion of the request to after the request is validated,
there's still looks like there may be some problems if an invalid address
is passed to the aio routines, basically a possible leak or having a
not completely initialized structure on the queue may still be possible.

A new sig macro was made _SIG_VALID to check the validity of a signal,
it would be advisable to use it from now on (in kern/kern_sig.c) rather
than rolling your own.

PR: kern/17152


# 13644654 10-Mar-2001 Alan Cox <alc@FreeBSD.org>

When aio_read/write() is used on a raw device, physical buffers are
used for up to "vfs.aio.max_buf_aio" of the requests. If a request
size is MAXPHYS, but the request base isn't page aligned, vmapbuf()
will map the end of the user space buffer into the start of the kva
allocated for the next physical buffer. Don't use a physical buffer
in this case. (This change addresses problem report 25617.)

When an aio_read/write() on a raw device has completed, timeout() is
used to schedule a signal to the process. Thus, the reporting is
delayed up to 10 ms (assuming hz is 100). The process might have
terminated in the meantime, causing a trap 12 when attempting to
deliver the signal. Thus, the timeout must be cancelled when removing
the job.

aio jobs in state JOBST_JOBQGLOBAL should be removed from the
kaio_jobqueue list during process rundown.

During process rundown, some aio jobs might move from one list to a
different list that has already been "emptied", causing the rundown to
be incomplete. Retry the rundown.

A call to BUF_KERNPROC() is needed after obtaining a physical buffer
to disassociate the lock from the running process since it can return
to userland without releasing that lock.

PR: 25617
Submitted by: tegge


# c9a970a7 08-Mar-2001 Alan Cox <alc@FreeBSD.org>

Use the kthread API to create and destroy AIO daemons.

Submitted by: jhb


# 19eb87d2 06-Mar-2001 John Baldwin <jhb@FreeBSD.org>

Grab the process lock while calling psignal and before calling psignal.


# 9c8a2647 06-Mar-2001 Alan Cox <alc@FreeBSD.org>

Add a missing splx() to aio_fphysio(). (This change is a no-op in -5.0,
but potentially significant in -4.x.)

Eliminate a pointless parameter to aio_fphysio().

Remove unnecessary casts from aio_fphysio() and aio_physwakeup().


# 88ed460e 04-Mar-2001 Alan Cox <alc@FreeBSD.org>

Eliminate the aio_freejobs list. Its purpose was to store free
aiocb's allocated by zalloc(). In other words, zfree() was never
called. Now, we call zfree(). Why eliminate this micro-
optimization? At some later point, when we multithread the AIO
system, we would need a mutex to synchronize access to aio_freejobs,
making its use nearly indistinguishable in cost from zalloc() and
zfree().

Remove unnecessary fhold() and fdrop() calls from aio_qphysio(),
undo'ing a part of revision 1.86. The reference count on the file
structure is already incremented by _aio_aqueue() before it calls
aio_qphysio(). (Update the comments to document this fact.)

Remove unnecessary casts from _aio_aqueue(), aio_read(), aio_write()
and aio_waitcomplete().

Remove an unnecessary "return;" from aio_process().

Add "static" in various places.


# fb579e9a 03-Mar-2001 Alan Cox <alc@FreeBSD.org>

Remove the field privatemodes from struct __aiocb_private and the
related code from aio_read() and aio_write(). This field was
intended, but never used, to allow a mythical user-level library to
make an aio_read() or aio_write() behave like an ordinary read() or
write(), i.e., a blocking I/O operation.


# 9ed346ba 08-Feb-2001 Bosko Milekic <bmilekic@FreeBSD.org>

Change and clean the mutex lock interface.

mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)


# f09deb69 06-Feb-2001 Jeroen Ruigrok van der Werven <asmodai@FreeBSD.org>

Fix typo: wierd -> weird.

There is no such thing as wierd in the english language.


# 37d40066 04-Feb-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Another round of the <sys/queue.h> FOREACH transmogriffer.

Created with: sed(1)
Reviewed by: md5(1)


# 86360fee 01-Dec-2000 Jake Burkholder <jake@FreeBSD.org>

Remove thr_sleep and thr_wakeup. Remove fields p_nthread and p_wakeup
from struct proc, which are now unused (p_nthread already was).
Remove process flag P_KTHREADP which was untested and only set
in vfs_aio.c (it should use kthread_create). Move the yield
system call to kern_synch.c as kern_threads.c has been removed
completely.

moral support from: alfred, jhb


# c6fa9f78 21-Nov-2000 Alan Cox <alc@FreeBSD.org>

Provide a new interface for the user of aio_read() and aio_write() to request
a kevent upon completion of the I/O. Specifically, introduce a new type
of sigevent notification, SIGEV_EVENT. If sigev_notify is SIGEV_EVENT,
then sigev_notify_kqueue names the kqueue that should receive the event
and sigev_value contains the "void *" is copied into the kevent's udata
field.

In contrast to the existing interface, this one: 1) works on
the Alpha 2) avoids the extra copyin() call for the kevent because all
of the information needed is in the sigevent and 3) could be
applied to request a single kevent upon completion of an entire lio_listio().

Reviewed by: jlemon


# 279d7226 18-Nov-2000 Matthew Dillon <dillon@FreeBSD.org>

This patchset fixes a large number of file descriptor race conditions.
Pre-rfork code assumed inherent locking of a process's file descriptor
array. However, with the advent of rfork() the file descriptor table
could be shared between processes. This patch closes over a dozen
serious race conditions related to one thread manipulating the table
(e.g. closing or dup()ing a descriptor) while another is blocked in
an open(), close(), fcntl(), read(), write(), etc...

PR: kern/11629
Discussed with: Alexander Viro <viro@math.psu.edu>


# 39b2b25f 29-Oct-2000 Alan Cox <alc@FreeBSD.org>

_aio_aqueue(): Change kevent registration to use its own struct file pointer.
Otherwise, aio_read() and aio_write() on sockets are broken if a kevent is
registered. (The code after kevent registration for handling sockets assumes
that the struct file pointer "fp" still refers to the socket, not the kqueue.)


# 35e0e5b3 20-Oct-2000 John Baldwin <jhb@FreeBSD.org>

Catch up to moving headers:
- machine/ipl.h -> sys/ipl.h
- machine/mutex.h -> sys/mutex.h


# b92bb032 26-Sep-2000 Alan Cox <alc@FreeBSD.org>

aio_qphysio: Eliminate one instance of an out-of-range check that is
performed twice. Eliminate initialization that is already performed
by _aio_aqueue.

aio_physwakeup: Eliminate redundant synchronization that is already
performed by bufdone.


# 621dbe43 16-Sep-2000 Bruce Evans <bde@FreeBSD.org>

Added used include of <sys/mutex.h> (don't depend on pollution in
<sys/signalvar.h>).


# a93a7807 10-Sep-2000 John Baldwin <jhb@FreeBSD.org>

aio processes need to have the Giant mutex before doing work.

Submitted by: tegge


# f535380c 05-Sep-2000 Don Lewis <truckman@FreeBSD.org>

Remove uidinfo hash table lookup and maintenance out of chgproccnt() and
chgsbsize(), which are called rather frequently and may be called from an
interrupt context in the case of chgsbsize(). Instead, do the hash table
lookup and maintenance when credentials are changed, which is a lot less
frequent. Add pointers to the uidinfo structures to the ucred and pcred
structures for fast access. Pass a pointer to the credential to chgproccnt()
and chgsbsize() instead of passing the uid. Add a reference count to the
uidinfo structure and use it to decide when to free the structure rather
than freeing the structure when the resource consumption drops to zero.
Move the resource tracking code from kern_proc.c to kern_resource.c. Move
some duplicate code sequences in kern_prot.c to separate helper functions.
Change KASSERTs in this code to unconditional tests and calls to panic().


# b70158ba 04-Sep-2000 Alan Cox <alc@FreeBSD.org>

Make filt_aio() check the jobstate for JOBST_JOBBFINISHED (in addition
to JOBST_JOBFINISHED) in case the aio_read() or aio_write() was performed
via the high-performance physio method, i.e., aio_qphysio().


# 5dec52ba 28-Jul-2000 Peter Wemm <peter@FreeBSD.org>

Fix the #ifdef VFS_AIO to not compile a whole bunch of unused stuff in the
!VFS_AIO case. Lots of things have hooks into here (kqueue, exit(),
sockets, etc), I elected to keep the external interfaces the same
rather than spread more #ifdefs around the kernel.


# e3975643 25-May-2000 Jake Burkholder <jake@FreeBSD.org>

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 740a1973 23-May-2000 Jake Burkholder <jake@FreeBSD.org>

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# 9626b608 05-May-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by: peter


# cb679c38 16-Apr-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Introduce kqueue() and kevent(), a kernel event notification facility.


# c244d2de 02-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Move B_ERROR flag to b_ioflags and call it BIO_ERROR.

(Much of this done by script)

Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED.

Move b_pblkno and b_iodone_chain to struct bio while we transition, they
will be obsoleted once bio structs chain/stack.

Add bio_queue field for struct bio aware disksort.

Address a lot of stylistic issues brought up by bde.


# b99c307a 20-Mar-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Rename the existing BUF_STRATEGY() to DEV_STRATEGY()

substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)

substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)

This patch is machine generated except for the ccd.c and buf.h parts.


# 21144e3b 20-Mar-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new
field in struct buf: b_iocmd. The b_iocmd is enforced to have
exactly one bit set.

B_WRITE was bogusly defined as zero giving rise to obvious coding
mistakes.

Also eliminate the redundant struct buf flag B_CALL, it can just
as efficiently be done by comparing b_iodone to NULL.

Should you get a panic or drop into the debugger, complaining about
"b_iocmd", don't continue. It is likely to write on your disk
where it should have been reading.

This change is a step in the direction towards a stackable BIO capability.

A lot of this patch were machine generated (Thanks to style(9) compliance!)

Vinum users: Greg has not had time to test this yet, be careful.


# dd85920a 23-Feb-2000 Jason Evans <jasone@FreeBSD.org>

Add the VFS_AIO config option and leave it off by default. Unless the
VFS_AIO option is specified, all aio-related syscalls return ENOSYS.

The aio code is very fragile right now, and is unsuitable for default
inclusion in a production shell box.

Approved by: jkh


# b7592c7b 20-Jan-2000 Jason Evans <jasone@FreeBSD.org>

Back out the previous spl change, since it opens a race window.

Reviewed by: alfred, dillon, peter


# 60ffb019 19-Jan-2000 Jason Evans <jasone@FreeBSD.org>

Don't tsleep() while at splbio().

Correctly return EINPROGRESS from aio_error() even when an aio request
is still in the socket queue.

Submitted by: Adrian Chadd <adrian@bofh.co.uk>


# f582ac06 17-Jan-2000 Brian Feldman <green@FreeBSD.org>

Fix vn_isdisk() usage to make AIO work on non-disk-files again, rather
than just return ENOTBLK.

PR: 16163
Submitted by: Adrian Chadd <adrian@FreeBSD.org>


# bfbbc4aa 13-Jan-2000 Jason Evans <jasone@FreeBSD.org>

Add aio_waitcomplete(). Make aio work correctly for socket descriptors.
Make gratuitous style(9) fixes (me, not the submitter) to make the aio
code more readable.

PR: kern/12053
Submitted by: Chris Sedore <cmsedore@maxwell.syr.edu>


# ba4ad1fc 09-Jan-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Give vn_isdisk() a second argument where it can return a suitable errno.

Suggested by: bde


# 38224dcd 22-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Convert various pieces of code to use vn_isdisk() rather than checking
for vp->v_type == VBLK.

In ccd: we don't need to call VOP_GETATTR to find the type of a vnode.

Reviewed by: sos


# 008626c3 07-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Simplify and de-bogotify check for raw disk.


# 02c58685 30-Oct-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Change useracc() and kernacc() to use VM_PROT_{READ|WRITE|EXECUTE} for the
"rw" argument, rather than hijacking B_{READ|WRITE}.

Fix two bugs (physio & cam) resulting by the confusion caused by this.

Submitted by: Tor.Egge@fast.no
Reviewed by: alc, ken (partly)


# d1f088da 11-Oct-1999 Peter Wemm <peter@FreeBSD.org>

Trim unused options (or #ifdef for undoc options).

Submitted by: phk


# 13ccadd4 19-Sep-1999 Brian Feldman <green@FreeBSD.org>

This is what was "fdfix2.patch," a fix for fd sharing. It's pretty
far-reaching in fd-land, so you'll want to consult the code for
changes. The biggest change is that now, you don't use
fp->f_ops->fo_foo(fp, bar)
but instead
fo_foo(fp, bar),
which increments and decrements the fp refcount upon entry and exit.
Two new calls, fhold() and fdrop(), are provided. Each does what it
seems like it should, and if fdrop() brings the refcount to zero, the
fd is freed as well.

Thanks to peter ("to hell with it, it looks ok to me.") for his review.
Thanks to msmith for keeping me from putting locks everywhere :)

Reviewed by: peter


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# 49ff4deb 14-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Spring cleaning around strategy and disklabels/slices:

Introduce BUF_STRATEGY(struct buf *, int flag) macro, and use it throughout.
please see comment in sys/conf.h about the flag argument.

Remove strategy argument from all the diskslice/label/bad144
implementations, it should be found from the dev_t.

Remove bogus and unused strategy1 routines.

Remove open/close arguments from dssize(). Pick them up from dev_t.

Remove unused and unfinished setgeom support from diskslice/label/bad144 code.


# 4d4f9323 13-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

s/v_specinfo/v_rdev/


# 0ef1c826 08-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>,
a few lines into <sys/vnode.h>.

Add a few fields to struct specinfo, paving the way for the fun part.


# 9c8b8baa 01-Jul-1999 Peter Wemm <peter@FreeBSD.org>

Slight reorganization of kernel thread/process creation. Instead of using
SYSINIT_KT() etc (which is a static, compile-time procedure), use a
NetBSD-style kthread_create() interface. kproc_start is still available
as a SYSINIT() hook. This allowed simplification of chunks of the
sysinit code in the process. This kthread_create() is our old kproc_start
internals, with the SYSINIT_KT fork hooks grafted in and tweaked to work
the same as the NetBSD one.

One thing I'd like to do shortly is get rid of nfsiod as a user initiated
process. It makes sense for the nfs client code to create them on the
fly as needed up to a user settable limit. This means that nfsiod
doesn't need to be in /sbin and is always "available". This is a fair bit
easier to do outside of the SYSINIT_KT() framework.


# df8abd0b 30-Jun-1999 Peter Wemm <peter@FreeBSD.org>

Slight tweak to fork1() calling conventions. Add a third argument so
the caller can easily find the child proc struct. fork(), rfork() etc
syscalls set p->p_retval[] themselves. Simplify the SYSINIT_KT() code
and other kernel thread creators to not need to use pfind() to find the
child based on the pid. While here, partly tidy up some of the fork1()
code for RF_SIGSHARE etc.


# 67812eac 25-Jun-1999 Kirk McKusick <mckusick@FreeBSD.org>

Convert buffer locking from using the B_BUSY and B_WANTED flags to using
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.


# 6fcd8a7c 01-Jun-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce the makebdev() function, it does the same as the makedev()
function for now, but that will change.


# 0a346dab 09-May-1999 Poul-Henning Kamp <phk@FreeBSD.org>

major(something) can never become NODEV.


# 4be2eb8c 08-May-1999 Poul-Henning Kamp <phk@FreeBSD.org>

I got tired of seeing all the cdevsw[major(foo)] all over the place.

Made a new (inline) function devsw(dev_t dev) and substituted it.

Changed to the BDEV variant to this format as well: bdevsw(dev_t dev)

DEVFS will eventually benefit from this change too.


# b0eeea20 06-May-1999 Poul-Henning Kamp <phk@FreeBSD.org>

remove b_proc from struct buf, it's (now) unused.

Reviewed by: dillon, bde


# d5558c00 06-May-1999 Peter Wemm <peter@FreeBSD.org>

Fix up a few easy 'assignment used as truth value' and 'suggest parens
around && within ||' type warnings. I'm pretty sure I have not masked
any problems here, I've committed real problem fixes seperately.


# 5206bca1 27-Apr-1999 Luoqi Chen <luoqi@FreeBSD.org>

Enable vmspace sharing on SMP. Major changes are,
- %fs register is added to trapframe and saved/restored upon kernel entry/exit.
- Per-cpu pages are no longer mapped at the same virtual address.
- Each cpu now has a separate gdt selector table. A new segment selector
is added to point to per-cpu pages, per-cpu global variables are now
accessed through this new selector (%fs). The selectors in gdt table are
rearranged for cache line optimization.
- fask_vfork is now on as default for both UP and SMP.
- Some aio code cleanup.

Reviewed by: Alan Cox <alc@cs.rice.edu>
John Dyson <dyson@iquest.net>
Julian Elischer <julian@whistel.com>
Bruce Evans <bde@zeta.org.au>
David Greenman <dg@root.com>


# 8fe387ab 04-Apr-1999 Dmitrij Tejblum <dt@FreeBSD.org>

Add standard padding argument to pread and pwrite syscall. That should make them
NetBSD compatible.

Add parameter to fo_read and fo_write. (The only flag FOF_OFFSET mean that
the offset is set in the struct uio).

Factor out some common code from read/pread/write/pwrite syscalls.


# a5c9bce7 25-Feb-1999 Bruce Evans <bde@FreeBSD.org>

Added a used #include (don't depend on "vnode_if.h" including <sys/buf.h>).


# b1028ad1 19-Feb-1999 Luoqi Chen <luoqi@FreeBSD.org>

Hide access to vmspace:vm_pmap with inline function vmspace_pmap(). This
is the preparation step for moving pmap storage out of vmspace proper.

Reviewed by: Alan Cox <alc@cs.rice.edu>
Matthew Dillion <dillon@apollo.backplane.com>


# bc814931 29-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

More const fixes for -Wall, -Wcast-qual


# 9e26dd2a 29-Jan-1999 Bruce Evans <bde@FreeBSD.org>

Removed bogus casts to c_caddr_t. This is part of terminating
c_caddr_t with extreme prejudice. Here the original casts to
caddr_t were to support K&R compilers (or missing prototypes),
but the relevant source files require an ANSI compiler.


# 697457a1 28-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix warnings related to -Wall -Wcast-qual


# 8aef1712 27-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# d254af07 27-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# 1c7c3c6a 21-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

This is a rather large commit that encompasses the new swapper,
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.

Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>


# e3b3ba2d 15-Dec-1998 Dag-Erling Smørgrav <des@FreeBSD.org>

Wrap two macros into do { ... } while (0), and fix the way they're used
in the kernel.

Reviewed by: bde


# 18830dba 26-Nov-1998 Tor Egge <tegge@FreeBSD.org>

Don't forget to update the pmap associated with aio daemons when adding
new page directory entries for a growing kernel virtual address space.


# f5ef029e 25-Oct-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Nitpicking and dusting performed on a train. Removes trivial warnings
about unused variables, labels and other lint.


# 2d2f8ae7 17-Aug-1998 Bruce Evans <bde@FreeBSD.org>

Fixed nonsense overflow checking (checking that a long variable is less
than INT_MAX after it has possibly overflowed).

Removed an unused variable and its associated 2 style bugs.

Removed unused includes.


# 30166fab 15-Jul-1998 Bruce Evans <bde@FreeBSD.org>

Cast between longs and pointers via intptr_t. There shouldn't be
nearly so many casts here. Casting an pointer that was an integer
back to an integer just to compare it with -1 is bad, and casting
it back just to compare it with NULL is just wrong.


# 596f8506 05-Jul-1998 Julian Elischer <julian@FreeBSD.org>

fix braino from yesterdays' megacommit

Not sure of the result of it..
(may or may not effect anything) but it's fixed now.
(found by: comparing what cvsup sent back to me with what I tested..)


# f7ea2f55 04-Jul-1998 Julian Elischer <julian@FreeBSD.org>

There is no such thing any more as "struct bdevsw".

There is only cdevsw (which should be renamed in a later edit to deventry
or something). cdevsw contains the union of what were in both bdevsw an
cdevsw entries. The bdevsw[] table stiff exists and is a second pointer
to the cdevsw entry of the device. it's major is in d_bmaj rather than
d_maj. some cleanup still to happen (e.g. dsopen now gets two pointers
to the same cdevsw struct instead of one to a bdevsw and one to a cdevsw).

rawread()/rawwrite() went away as part of this though it's not strictly
the same patch, just that it involves all the same lines in the drivers.

cdroms no longer have write() entries (they did have rawwrite (?)).
tapes no longer have support for bdev operations.

Reviewed by: Eivind Eklund and Mike Smith
Changes suggested by eivind.


# 8c12612c 10-Jun-1998 Doug Rabson <dfr@FreeBSD.org>

64bit fixes: don't cast pointers to int.


# dc733423 17-Apr-1998 Dag-Erling Smørgrav <des@FreeBSD.org>

Seventy-odd "its" / "it's" typos in comments fixed as per kern/6108.


# 227ee8a1 30-Mar-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Eradicate the variable "time" from the kernel, using various measures.
"time" wasn't a atomic variable, so splfoo() protection were needed
around any access to it, unless you just wanted the seconds part.

Most uses of time.tv_sec now uses the new variable time_second instead.

gettime() changed to getmicrotime(0.

Remove a couple of unneeded splfoo() protections, the new getmicrotime()
is atomic, (until Bruce sets a breakpoint in it).

A couple of places needed random data, so use read_random() instead
of mucking about with time which isn't random.

Add a new nfs_curusec() function.

Mark a couple of bogosities involving the now disappeard time variable.

Update ffs_update() to avoid the weird "== &time" checks, by fixing the
one remaining call that passwd &time as args.

Change profiling in ncr.c to use ticks instead of time. Resolution is
the same.

Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call
hzto() which subtracts time" sequences.

Reviewed by: bde


# 8a6472b7 28-Mar-1998 Peter Dufault <dufault@FreeBSD.org>

Finish _POSIX_PRIORITY_SCHEDULING. Needs P1003_1B and
_KPOSIX_PRIORITY_SCHEDULING options to work. Changes:

Change all "posix4" to "p1003_1b". Misnamed files are left
as "posix4" until I'm told if I can simply delete them and add
new ones;

Add _POSIX_PRIORITY_SCHEDULING system calls for FreeBSD and Linux;

Add man pages for _POSIX_PRIORITY_SCHEDULING system calls;

Add options to LINT;

Minor fixes to P1003_1B code during testing.


# 08637435 28-Mar-1998 Bruce Evans <bde@FreeBSD.org>

Moved some #includes from <sys/param.h> nearer to where they are actually
used.


# 57518a4e 24-Feb-1998 Bruce Evans <bde@FreeBSD.org>

Removed a stale comment and staler code.


# 303b270b 08-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Staticize.


# 0b08f5f7 05-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Back out DIAGNOSTIC changes.


# 47cfdb16 04-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Turn DIAGNOSTIC into a new-style option.


# 64889941 09-Dec-1997 John Dyson <dyson@FreeBSD.org>

Quiet some lint.


# 78922e41 07-Dec-1997 John Dyson <dyson@FreeBSD.org>

Correct prototypes to match POSIX. Correct return code for aio_cancel.
Submitted by: Alex Nash <nash@mcs.com>


# e499ed6f 01-Dec-1997 John Dyson <dyson@FreeBSD.org>

Fix a problem when creating a new kernel thread. In some cases, aio_read
or aio_write can return the pid of the new thread. This is due to the
way that return values from system calls being passed by side-effect in
the proc structure now. This commit fixes the problem with aio_read and
aio_write.


# 11783b14 01-Dec-1997 John Dyson <dyson@FreeBSD.org>

Fix error handling for VCHR type I/O. Also, fix another spl problem, and
remove alot of overly verbose debugging statements.
ioproclist {
int aioprocflags; /* AIO proc flags */
TAILQ_ENTRY(aioproclist) list; /* List of processes */
struct proc *aioproc; /* The AIO thread */
TAILQ_HEAD (,aiocblist) jobtorun; /* suggested job to run */
};

/*
* data-structure for lio signal management
*/
struct aio_liojob {
int lioj_flags;
int lioj_buffer_count;
int lioj_buffer_finished_count;
int lioj_queue_count;
int lioj_queue_finished_count;
struct sigevent lioj_signal; /* signal on all I/O done */
TAILQ_ENTRY (aio_liojob) lioj_list;
struct kaioinfo *lioj_ki;
};
#define LIOJ_SIGNAL 0x1 /* signal on all done (lio) */
#define LIOJ_SIGNAL_POSTED 0x2 /* signal has been posted */

/*
* per process aio data structure
*/
struct kaioinfo {
int kaio_flags; /* per process kaio flags */
int kaio_maxactive_count; /* maximum number of AIOs */
int kaio_active_count; /* number of currently used AIOs */
int kaio_qallowed_count; /* maxiumu size of AIO queue */
int kaio_queue_count; /* size of AIO queue */
int kaio_ballowed_count; /* maximum number of buffers */
int kaio_queue_finished_count; /* number of daemon jobs finished */
int kaio_buffer_count; /* number of physio buffers */
int kaio_buffer_finished_count; /* count of I/O done */
struct proc *kaio_p; /* process that uses this kaio block */
TAILQ_HEAD (,aio_liojob) kaio_liojoblist; /* list of lio jobs */
TAILQ_HEAD (,aiocblist) kaio_jobqueue; /* job queue for process */
TAILQ_HEAD (,aiocblist) kaio_jobdone; /* done queue for process */
TAILQ_HEAD (,aiocblist) kaio_bufqueue; /* buffer job queue for process */
TAILQ_HEAD (,aiocblist) kaio_bufdone; /* buffer done queue for process */
};

#define KAIO_RUNDOWN 0x1 /* process is being run down */
#define KAIO_WAKEUP 0x2 /* wakeup process when there is a significant
event */


TAILQ_HEAD (,aioproclist) aio_freeproc, aio_activeproc;
TAILQ_HEAD(,aiocblist) aio_jobs; /* Async job list */
TAILQ_HEAD(,aiocblist) aio_bufjobs; /* Phys I/O job list */
TAILQ_HEAD(,aiocblist) aio_freejobs; /* Pool of free jobs */

static void aio_init_aioinfo(struct proc *p) ;
static void aio_onceonly(void *) ;
static int aio_free_entry(struct aiocblist *aiocbe);
static void aio_process(struct aiocblist *aiocbe);
static int aio_newproc(void) ;
static int aio_aqueue(struct proc *p, struct aiocb *job, int type) ;
static void aio_physwakeup(struct buf *bp);
static int aio_fphysio(struct proc *p, struct aiocblist *aiocbe, int type);
static int aio_qphysio(struct proc *p, struct aiocblist *iocb);
static void aio_daemon(void *uproc);

SYSINIT(aio, SI_SUB_VFS, SI_ORDER_ANY, aio_onceonly, NULL);

static vm_zone_t kaio_zone=0, aiop_zone=0,
aiocb_zone=0, aiol_zone=0, aiolio_zone=0;

/*
* Single AIOD vmspace shared amongst all of them
*/
static struct vmspace *aiovmspace = NULL;

/*
* Startup initialization
*/
void
aio_onceonly(void *na)
{
TAILQ_INIT(&aio_freeproc);
TAILQ_INIT(&aio_activeproc);
TAILQ_INIT(&aio_jobs);
TAILQ_INIT(&aio_bufjobs);
TAILQ_INIT(&aio_freejobs);
kaio_zone = zinit("AIO", sizeof (struct kaioinfo), 0, 0, 1);
aiop_zone = zinit("AIOP", sizeof (struct aioproclist), 0, 0, 1);
aiocb_zone = zinit("AIOCB", sizeof (struct aiocblist), 0, 0, 1);
aiol_zone = zinit("AIOL", AIO_LISTIO_MAX * sizeof (int), 0, 0, 1);
aiolio_zone = zinit("AIOLIO",
AIO_LISTIO_MAX * sizeof (struct aio_liojob), 0, 0, 1);
aiod_timeout = AIOD_TIMEOUT_DEFAULT;
aiod_lifetime = AIOD_LIFETIME_DEFAULT;
jobrefid = 1;
}

/*
* Init the per-process aioinfo structure.
* The aioinfo limits are set per-process for user limit (resource) management.
*/
void
aio_init_aioinfo(struct proc *p)
{
struct kaioinfo *ki;
if (p->p_aioinfo == NULL) {
ki = zalloc(kaio_zone);
p->p_aioinfo = ki


# f4f0ecef 30-Nov-1997 John Dyson <dyson@FreeBSD.org>

Correct a last minute code change. Would have been an infinite loop under
certain error conditions.
Submitted by: pst@shockwave.com


# c5efdcbd 30-Nov-1997 John Dyson <dyson@FreeBSD.org>

Fix an spl nit.


# 84af4da6 29-Nov-1997 John Dyson <dyson@FreeBSD.org>

Finish up the vast majority of the AIO/LIO functionality. Proper signal
support was missing in the previous version of the AIO code. More
tunables added, and very efficient support for VCHR files has been added.
Kernel threads are not used for VCHR files, all work for such files is
done for the requesting process directly. Some attempt has been made to
charge the requesting process for resource utilization, but more work
is needed. aio_fsync is still missing (but the original fsync system
call can be used for now.) aio_cancel is essentially a noop, but that
is okay per POSIX. More aio_cancel functionality can be added later,
if it is found to be needed.

The functions implemented include:
aio_read, aio_write, lio_listio, aio_error, aio_return,
aio_cancel, aio_suspend.

The code has been implemented to support the POSIX spec 1003.1b
(formerly known as POSIX 1003.4 spec) features of the above. The
async I/O features are truly async, with the VCHR mode of operation
being essentially the same as physio (for appropriate files) for
maximum efficiency. This code also supports the signal capability,
is highly tunable, allowing management of resource usage, and
has been written to allow a per process usage quota.

Both the O'Reilly POSIX.4 book and the actual POSIX 1003.1b document
were the reference specs used. Any filedescriptor can be used with
these new system calls. I know of no exceptions where these
system calls will not work. (TTY's will also probably work.)


# f4feb04e 28-Nov-1997 John Dyson <dyson@FreeBSD.org>

Disable the VCHR optimization for AIO until I have implemented it. Just in
case anyone wants to play with the POSIX AIO/LIO stuff. (As it is, it should
work with ANY vnode, on UP systems only, for now.)


# fd3bf775 28-Nov-1997 John Dyson <dyson@FreeBSD.org>

Fix and complete the AIO syscalls. There are some performance enhancements
coming up soon, but the code is functional. Docs will be forthcoming.


# fdebd4f0 18-Nov-1997 Bruce Evans <bde@FreeBSD.org>

Get locking stuff by #including <sys/lock.h> instead of <sys/vnode.h>.


# 4a11ca4e 07-Nov-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Remove a bunch of variables which were unused both in GENERIC and LINT.

Found by: -Wunused


# cb226aaa 06-Nov-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Move the "retval" (3rd) parameter from all syscall functions and put
it in struct proc instead.

This fixes a boatload of compiler warning, and removes a lot of cruft
from the sources.

I have not removed the /*ARGSUSED*/, they will require some looking at.

libkvm, ps and other userland struct proc frobbing programs will need
recompiled.


# a1c995b6 12-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Last major round (Unless Bruce thinks of somthing :-) of malloc changes.

Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.

A couple of finer points by: bde


# 55166637 11-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Distribute and statizice a lot of the malloc M_* types.

Substantial input from: bde


# c4860686 10-Oct-1997 John Dyson <dyson@FreeBSD.org>

Make the target for the number of AIO daemons work.


# a624e84f 08-Oct-1997 John Dyson <dyson@FreeBSD.org>

Major cleanup and debugging of the new AIO/LIO code. The LIO code is
now corrected. New tunables/instrumentation added. The code is now
likely "good enough to use." I will add the userland support soon.
The "high performance" mode for raw devices is still missing, and will
be added next. POSIX system calls that now appear to work:
aio_cancel, aio_error, aio_read, aio_return, aio_suspend, aio_write,
lio_listio. Missing, but to be added soon: aio_fsync.


# e4ba6a82 02-Sep-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# 5aaef07c 16-Jul-1997 John Dyson <dyson@FreeBSD.org>

Clean up some lint associated with the AIO code.


# 2244ea07 05-Jul-1997 John Dyson <dyson@FreeBSD.org>

This is an upgrade so that the kernel supports the AIO calls from
POSIX.4. Additionally, there is some initial code that supports LIO.
This code supports AIO/LIO for all types of file descriptors, with
few if any restrictions. There will be a followup very soon that
will support significantly more efficient operation for VCHR type
files (raw.) This code is also dependent on some kernel features
that don't work under SMP yet. After I commit the changes to the
kernel to support proper address space sharing on SMP, this code
will also work under SMP.


# ee877a35 15-Jun-1997 John Dyson <dyson@FreeBSD.org>

Add initial AIO/LIO kernel thread support files. This is preliminary, and
further features will be added.