History log of /openbsd-current/sys/sys/file.h
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# 1.66 20-Jun-2022 visa

Remove unused struct fileops field fo_poll and callbacks.

OK mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.65 20-Jan-2022 jsg

initial support for drm sync files, fences associated with file
descriptors for explicit fencing

tested with libdrm's amdgpu_test syncobj timeline tests and vkcube on
intel broadwell with Mesa 21.3 (which hangs without sync file support
after the 'anv: Assume syncobj support' Mesa commit)

feedback and ok visa@


# 1.64 25-Oct-2021 claudio

Revert commitid: ufM9BcSbXqfLpzBH;
Move vfs_stall_barrier() from the fd layer into vn_lock() and the vfs layer.
In some cases it can result in a deadlock while suspending.
Discussed with mpi@ and deraadt@


# 1.63 21-Oct-2021 claudio

Move vfs_stall_barrier() from the fd layer into vn_lock() and the vfs layer.
vfs stalling is used by suspend/resume and by vmt(4) to stall any
filesystem operation from altering the state on disk. All these
operations will call vn_lock and be stalled. Adjust vfs_stall_barrier()
to allow the lock owner to still progress so that suspend can sync
the filesystems after stalling vfs operation.
OK mpi@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.62 02-Dec-2020 martijn

Hoist DTYPE_* out of #ifdef _KERNEL.
Similar to what NetBSD and FreeBSD have done.

OK guenther@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.61 13-Mar-2020 anton

In order to unlock flock(2), make writes to the f_iflags field of struct
file atomic. This also gets rid of the last kernel lock protected field
in the scope of struct file.

ok mpi@ visa@


# 1.60 01-Feb-2020 anton

Make writes to the f_flag field of `struct file' MP-safe using atomic
operations. Since the type of f_flag must change in order to use the
atomic(9) API, reorder the struct in order to avoid padding; as pointed
out by tedu@.

ok mpi@ visa@


# 1.59 05-Jan-2020 visa

Constify instances of struct fileops.

OK anton@, mpi@, bluhm@


Revision tags: OPENBSD_6_6_BASE
# 1.58 05-Aug-2019 anton

Allow concurrent reads of the f_offset field of struct file by
serializing both read/write operations using the existing file mutex.
The vnode lock still grants exclusive write access to the offset; the
mutex is only used to make the actual write atomic and prevent any
concurrent reader from observing intermediate values.

ok mpi@ visa@


# 1.57 12-Jul-2019 solene

Revert anton@ changes about read/write unlocking
https://marc.info/?l=openbsd-cvs&m=156277704122293&w=2

ok anton@


# 1.56 11-Jul-2019 anton

zero pad and align FO_POSITION; no binary change


# 1.55 10-Jul-2019 anton

Make read/write of the f_offset field belonging to struct file MP-safe;
as part of the effort to unlock the kernel. Instead of relying on the
vnode lock, introduce a dedicated lock per file. Exclusive write access
is granted using the new foffset_enter and foffset_leave API. A
convenience function foffset_get is also available for threads that only
need to read the current offset.

The lock acquisition order in vn_write has been changed to match the one
in vn_read in order to avoid a potential deadlock. This change also gets
rid of a documented race in vn_read().

Inspired by the FreeBSD implementation.

With help and ok mpi@ visa@


# 1.54 22-Jun-2019 semarie

push the KERNEL_LOCK deeper on read(2) and write(2)

unlocks read(2) and write(2) syscalls families, and push the KERNEL_LOCK
deeper in the code path. KERNEL_LOCK is managed per file type in fileops
handlers (fo_read, fo_write, and fo_close). read(2) and write(2) on
socket are KERNEL_LOCK-free.

initial work from mpi@ and ians@

ok mpi@ kettenis@ visa@ ians@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.53 20-Aug-2018 mpi

Reorder checks in the read/write(2) family of syscalls to prepare making
file operations mp-safe.

This change makes it clear that `f_offset' is only accessed in vn_read()
and vn_write(), which will help taking it out of the KERNEL_LOCK().

This refactoring uncovered a race in vn_read() which is now documented
and will be addressed in a later diff.

ok visa@


# 1.52 03-Jul-2018 kettenis

Add a new so_seek member to "struct file" such that we can have seekable
files that aren't vnodes. Move the vnode-specific code into its own
function. Add an implementation for the "DMA buffers" that can be used
by DRI3/prime code to find out the size of the graphics buffer.
This implementation is very limited and only supports offset 0 and only
for SEEK_SET and SEEK_END. This doesn't really make sense; implementing
stat(2) would be a more obvious choice. But this is what Linux does.

ok guenther@, visa@


# 1.51 02-Jul-2018 visa

Update the file reference count field `f_count' using atomic operations
instead of using a mutex for update serialization. Use a per-fdp mutex
to manage updating of file instance pointers in the `fd_ofiles' array
to let fd_getfile() acquire file references safely with concurrent file
reference releases.

OK mpi@


# 1.50 25-Jun-2018 kettenis

Implement DRI3/prime support. This allows graphics buffers to be passed
between processes using file descriptors. This provides an alternative to
eporting them with guesable 32-bit IDs. This implementation does not (yet)
allow sharing of graphics buffers between GPUs.

ok mpi@, visa@


# 1.49 20-Jun-2018 mpi

Unlock sendmsg(2) and sendto(2).

These syscalls can now be executed w/o the KERNEL_LOCK() depending on
the kind of socket.

The current solution uses a single global mutex to serialize access to,
and reference count, 'struct file'.

ok visa@, kettenis@


# 1.48 18-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup, take 3.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]' or the global linked list. This allows
us to simplifies a lot code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu Masson, visa@, guenther@ and art@

Previous version ok bluhm@, ok visa@, sthen@


# 1.47 05-Jun-2018 mpi

Revert introduction of fdinsert(), a sanitify check triggers when
closing a LARVAL file.

Found the hardway by sthen@.


# 1.46 02-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]'. This allows us to simplifies a lot
code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu -, visa@, guenther@ and art@

ok visa@, bluhm@


# 1.45 09-May-2018 mpi

Mark `f_ops' as immutable.

The only place where it was modified after initialization is a corner
case where the vnode of an open file is substitued by another one. Sine
the type of the file doesn't change, there's no need to overwrite `f_ops'.

While here proctect file counters with `f_mtx'.

ok bluhm@, visa@


# 1.44 08-May-2018 mpi

Do do include <sys/mount.h> because it breaks some userland programs
that define _KERNEL...


# 1.43 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.42 08-May-2018 mpi

Protect per-file counters and document which lock is used to protect
the other fields.

Once we no longer have any [k] (kernel lock) protections, we'll be
able to unlock almost all network related syscalls.

Inputs from and ok bluhm@, visa@


# 1.41 25-Apr-2018 mpi

Introduce fd_iterfile() a new helper function to iterate over `filehead'.

This turns `filehead' into a local variable, that will make it easier
to protect it.

ok visa@


Revision tags: OPENBSD_6_3_BASE
# 1.40 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.39 02-Jan-2018 guenther

Don't #include fcntl.h when _KERNEL is defined.

inspired by FreeBSD r24131
ok millert@ sthen@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.38 23-Aug-2016 tedu

rename nfiles to numfiles to avoid shadowing and stretch out the name.
ok deraadt


Revision tags: OPENBSD_6_0_BASE
# 1.37 26-Apr-2016 deraadt

No good reason to retain comments about old DTYPE_CRYPTO or DTYPE_SYSTRACE
values.


# 1.36 25-Apr-2016 tedu

remove systrace remnants


Revision tags: OPENBSD_5_9_BASE
# 1.35 28-Aug-2015 guenther

Rework the UNIX domain socket garbage collector, including ideas from
{Free,Net}BSD
- when a socket is closed with fds in its input, defer closing them to
a task to avoid recursing. This eliminates the complicated extra
reference taking which had a 37 line(!) comment explanation
- move flags, counts, and links only needed for this from struct file to
struct unpcb
- document the flow of the mark/sweep collector

much help from claudio@ who made me explain the GC to him until we trusted it
ok claudio@ mpi@ deraadt@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.34 18-Nov-2014 mikeb

DTYPE_CRYPTO is not used anymore; ok guenther (a while ago)


# 1.33 18-Nov-2014 tedu

file.h doesn't need to include unistd.h


Revision tags: OPENBSD_5_6_BASE
# 1.32 10-Jul-2014 deraadt

struct ucred; for fstat _KERNEL block


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.31 05-Jun-2013 guenther

Move FHASLOCK from f_flag to f_iflags, freeing up a bit for passing
O_* flags and eliminating an XXX comment.

ok matthew@ deraadt@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.30 01-May-2012 guenther

Eliminate the f_usecount ref count in struct file; instead of sleeping
at the top of closef() until all in-progress calls finish, just do the
advisory locking bits required of close() by POSIX and let whichever
thread has the last reference do the call to the file's fo_close()
method and the final cleanup.

lots of discussion with deraadt@ and others; worked out with and ok krw@


# 1.29 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.28 28-Jun-2011 thib

Rename FMARK to FIF_MARK and FDEFER to FIF_DEFER and
move those flags to f_iflags; This makes rooms in the
flag member of struct file for some goodies matthew@
as planned.

ok matthew@, deraadt@.


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.27 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.26 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.25 04-Jun-2009 blambert

Put readv/writev changes back in, as they no longer hang ckuethe's ntpd.

Special thanks to ckuethe's ntpd for noticing the problem.

ok deraadt@


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.24 26-Mar-2006 mickey

do per file io accounting and show that in fstat as well; pedro@ marco@ ok


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE OPENBSD_3_9_BASE SMP_SYNC_A SMP_SYNC_B
# 1.23 23-Sep-2003 millert

Replace select backends with poll backends. selscan() and pollscan()
now call the poll backend. With this change we implement greater
poll(2) functionality instead of emulating it via the select backend.
Adapted from NetBSD and including some changes from FreeBSD.
Tested by many, deraadt@ OK


Revision tags: OPENBSD_3_4_BASE
# 1.22 06-Aug-2003 deraadt

must pre-def struct file before circular structs


# 1.21 01-Aug-2003 tedu

move fileops out of file, and make it pretty. ok deraadt@ millert@


# 1.20 18-Jul-2003 tedu

caddr_t -> void *. ok millert@ tdeval@


# 1.19 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.18 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.17 14-Mar-2002 millert

First round of __P removal in sys


# 1.16 08-Feb-2002 art

- Rename FILE_{,UN}USE to FREF and FRELE. USE is a bad verb and we don't have
the same semantics as NetBSD anyway, so it's good to avoid name collissions.
- Always fdremove before freeing the file, not the other way around.
- falloc FREFs the file.
- have FILE_SET_MATURE FRELE the file (It feels like a good ortogonality to
falloc FREFing the file).
- Use closef as much as possible instead of ffree in error paths of
falloc:ing functions. closef is much more careful with the fd and can
deal with the fd being forcibly closed by dup2. Also try to avoid
manually calling *fo_close when closef can do that for us (this makes
some error paths mroe complicated (sys_socketpair and sys_pipe), but
others become simpler (sys_open)).


# 1.15 05-Feb-2002 art

Add counting of temporary references to a struct file (as opposed to references
from fd tables and other long-lived objects). This is to avoid races between
using a file descriptor and having another process (with shared fd table)
close it. We use a separate refence count so that error values from close(2)
will be correctly returned to the caller of close(2).

The macros for those reference counts are FILE_USE(fp) and FILE_UNUSE(fp).

Make sure that the cases where closef can be called "incorrectly" (most notably
dup2(2)) are handled.

Right now only callers of closef (and {,p}read) use FILE_{,UN}USE correctly,
more fixes incoming soon.


Revision tags: UBC_BASE
# 1.14 31-Oct-2001 art

branches: 1.14.2;
Clarify some struct fields.


# 1.13 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.12 15-May-2001 deraadt

DTYPE_CRYPTO


# 1.11 14-May-2001 art

Add a fo_stat member to struct fileops. Used soon.
Also add a stat function for kqueue from FreeBSD.


Revision tags: OPENBSD_2_9_BASE
# 1.10 01-Mar-2001 provos

port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.9 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.8 24-May-2000 deraadt

move kernel prototypes using iovec to the right place


Revision tags: OPENBSD_2_7_BASE
# 1.7 20-Apr-2000 deraadt

p{read,write}{,v} from csapuntz, partial NetBSD origin I think


# 1.6 19-Apr-2000 csapuntz

Change struct file interface methods read and write to pass file offset in
and out.

Make pread/pwrite in netbsd & linux thread safe - which is the whole point
anyway.


Revision tags: SMP_BASE
# 1.5 01-Feb-2000 assar

branches: 1.5.2;
add declaration of `vnops'


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE OPENBSD_2_5_BASE OPENBSD_2_6_BASE kame_19991208
# 1.4 01-Mar-1998 deraadt

crank f_count/f_msgcount to long; when incrementing try to leave 2 slots
empty for unp_gc() in case of cross referenced sockets .


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.3 27-Aug-1996 shawn

New fast pipe(2) from freebsd without fancy vm stuff.

The old pipes can be used with the "OLD_PIPE" config option.


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.65 20-Jan-2022 jsg

initial support for drm sync files, fences associated with file
descriptors for explicit fencing

tested with libdrm's amdgpu_test syncobj timeline tests and vkcube on
intel broadwell with Mesa 21.3 (which hangs without sync file support
after the 'anv: Assume syncobj support' Mesa commit)

feedback and ok visa@


# 1.64 25-Oct-2021 claudio

Revert commitid: ufM9BcSbXqfLpzBH;
Move vfs_stall_barrier() from the fd layer into vn_lock() and the vfs layer.
In some cases it can result in a deadlock while suspending.
Discussed with mpi@ and deraadt@


# 1.63 21-Oct-2021 claudio

Move vfs_stall_barrier() from the fd layer into vn_lock() and the vfs layer.
vfs stalling is used by suspend/resume and by vmt(4) to stall any
filesystem operation from altering the state on disk. All these
operations will call vn_lock and be stalled. Adjust vfs_stall_barrier()
to allow the lock owner to still progress so that suspend can sync
the filesystems after stalling vfs operation.
OK mpi@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.62 02-Dec-2020 martijn

Hoist DTYPE_* out of #ifdef _KERNEL.
Similar to what NetBSD and FreeBSD have done.

OK guenther@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.61 13-Mar-2020 anton

In order to unlock flock(2), make writes to the f_iflags field of struct
file atomic. This also gets rid of the last kernel lock protected field
in the scope of struct file.

ok mpi@ visa@


# 1.60 01-Feb-2020 anton

Make writes to the f_flag field of `struct file' MP-safe using atomic
operations. Since the type of f_flag must change in order to use the
atomic(9) API, reorder the struct in order to avoid padding; as pointed
out by tedu@.

ok mpi@ visa@


# 1.59 05-Jan-2020 visa

Constify instances of struct fileops.

OK anton@, mpi@, bluhm@


Revision tags: OPENBSD_6_6_BASE
# 1.58 05-Aug-2019 anton

Allow concurrent reads of the f_offset field of struct file by
serializing both read/write operations using the existing file mutex.
The vnode lock still grants exclusive write access to the offset; the
mutex is only used to make the actual write atomic and prevent any
concurrent reader from observing intermediate values.

ok mpi@ visa@


# 1.57 12-Jul-2019 solene

Revert anton@ changes about read/write unlocking
https://marc.info/?l=openbsd-cvs&m=156277704122293&w=2

ok anton@


# 1.56 11-Jul-2019 anton

zero pad and align FO_POSITION; no binary change


# 1.55 10-Jul-2019 anton

Make read/write of the f_offset field belonging to struct file MP-safe;
as part of the effort to unlock the kernel. Instead of relying on the
vnode lock, introduce a dedicated lock per file. Exclusive write access
is granted using the new foffset_enter and foffset_leave API. A
convenience function foffset_get is also available for threads that only
need to read the current offset.

The lock acquisition order in vn_write has been changed to match the one
in vn_read in order to avoid a potential deadlock. This change also gets
rid of a documented race in vn_read().

Inspired by the FreeBSD implementation.

With help and ok mpi@ visa@


# 1.54 22-Jun-2019 semarie

push the KERNEL_LOCK deeper on read(2) and write(2)

unlocks read(2) and write(2) syscalls families, and push the KERNEL_LOCK
deeper in the code path. KERNEL_LOCK is managed per file type in fileops
handlers (fo_read, fo_write, and fo_close). read(2) and write(2) on
socket are KERNEL_LOCK-free.

initial work from mpi@ and ians@

ok mpi@ kettenis@ visa@ ians@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.53 20-Aug-2018 mpi

Reorder checks in the read/write(2) family of syscalls to prepare making
file operations mp-safe.

This change makes it clear that `f_offset' is only accessed in vn_read()
and vn_write(), which will help taking it out of the KERNEL_LOCK().

This refactoring uncovered a race in vn_read() which is now documented
and will be addressed in a later diff.

ok visa@


# 1.52 03-Jul-2018 kettenis

Add a new so_seek member to "struct file" such that we can have seekable
files that aren't vnodes. Move the vnode-specific code into its own
function. Add an implementation for the "DMA buffers" that can be used
by DRI3/prime code to find out the size of the graphics buffer.
This implementation is very limited and only supports offset 0 and only
for SEEK_SET and SEEK_END. This doesn't really make sense; implementing
stat(2) would be a more obvious choice. But this is what Linux does.

ok guenther@, visa@


# 1.51 02-Jul-2018 visa

Update the file reference count field `f_count' using atomic operations
instead of using a mutex for update serialization. Use a per-fdp mutex
to manage updating of file instance pointers in the `fd_ofiles' array
to let fd_getfile() acquire file references safely with concurrent file
reference releases.

OK mpi@


# 1.50 25-Jun-2018 kettenis

Implement DRI3/prime support. This allows graphics buffers to be passed
between processes using file descriptors. This provides an alternative to
eporting them with guesable 32-bit IDs. This implementation does not (yet)
allow sharing of graphics buffers between GPUs.

ok mpi@, visa@


# 1.49 20-Jun-2018 mpi

Unlock sendmsg(2) and sendto(2).

These syscalls can now be executed w/o the KERNEL_LOCK() depending on
the kind of socket.

The current solution uses a single global mutex to serialize access to,
and reference count, 'struct file'.

ok visa@, kettenis@


# 1.48 18-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup, take 3.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]' or the global linked list. This allows
us to simplifies a lot code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu Masson, visa@, guenther@ and art@

Previous version ok bluhm@, ok visa@, sthen@


# 1.47 05-Jun-2018 mpi

Revert introduction of fdinsert(), a sanitify check triggers when
closing a LARVAL file.

Found the hardway by sthen@.


# 1.46 02-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]'. This allows us to simplifies a lot
code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu -, visa@, guenther@ and art@

ok visa@, bluhm@


# 1.45 09-May-2018 mpi

Mark `f_ops' as immutable.

The only place where it was modified after initialization is a corner
case where the vnode of an open file is substitued by another one. Sine
the type of the file doesn't change, there's no need to overwrite `f_ops'.

While here proctect file counters with `f_mtx'.

ok bluhm@, visa@


# 1.44 08-May-2018 mpi

Do do include <sys/mount.h> because it breaks some userland programs
that define _KERNEL...


# 1.43 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.42 08-May-2018 mpi

Protect per-file counters and document which lock is used to protect
the other fields.

Once we no longer have any [k] (kernel lock) protections, we'll be
able to unlock almost all network related syscalls.

Inputs from and ok bluhm@, visa@


# 1.41 25-Apr-2018 mpi

Introduce fd_iterfile() a new helper function to iterate over `filehead'.

This turns `filehead' into a local variable, that will make it easier
to protect it.

ok visa@


Revision tags: OPENBSD_6_3_BASE
# 1.40 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.39 02-Jan-2018 guenther

Don't #include fcntl.h when _KERNEL is defined.

inspired by FreeBSD r24131
ok millert@ sthen@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.38 23-Aug-2016 tedu

rename nfiles to numfiles to avoid shadowing and stretch out the name.
ok deraadt


Revision tags: OPENBSD_6_0_BASE
# 1.37 26-Apr-2016 deraadt

No good reason to retain comments about old DTYPE_CRYPTO or DTYPE_SYSTRACE
values.


# 1.36 25-Apr-2016 tedu

remove systrace remnants


Revision tags: OPENBSD_5_9_BASE
# 1.35 28-Aug-2015 guenther

Rework the UNIX domain socket garbage collector, including ideas from
{Free,Net}BSD
- when a socket is closed with fds in its input, defer closing them to
a task to avoid recursing. This eliminates the complicated extra
reference taking which had a 37 line(!) comment explanation
- move flags, counts, and links only needed for this from struct file to
struct unpcb
- document the flow of the mark/sweep collector

much help from claudio@ who made me explain the GC to him until we trusted it
ok claudio@ mpi@ deraadt@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.34 18-Nov-2014 mikeb

DTYPE_CRYPTO is not used anymore; ok guenther (a while ago)


# 1.33 18-Nov-2014 tedu

file.h doesn't need to include unistd.h


Revision tags: OPENBSD_5_6_BASE
# 1.32 10-Jul-2014 deraadt

struct ucred; for fstat _KERNEL block


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.31 05-Jun-2013 guenther

Move FHASLOCK from f_flag to f_iflags, freeing up a bit for passing
O_* flags and eliminating an XXX comment.

ok matthew@ deraadt@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.30 01-May-2012 guenther

Eliminate the f_usecount ref count in struct file; instead of sleeping
at the top of closef() until all in-progress calls finish, just do the
advisory locking bits required of close() by POSIX and let whichever
thread has the last reference do the call to the file's fo_close()
method and the final cleanup.

lots of discussion with deraadt@ and others; worked out with and ok krw@


# 1.29 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.28 28-Jun-2011 thib

Rename FMARK to FIF_MARK and FDEFER to FIF_DEFER and
move those flags to f_iflags; This makes rooms in the
flag member of struct file for some goodies matthew@
as planned.

ok matthew@, deraadt@.


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.27 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.26 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.25 04-Jun-2009 blambert

Put readv/writev changes back in, as they no longer hang ckuethe's ntpd.

Special thanks to ckuethe's ntpd for noticing the problem.

ok deraadt@


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.24 26-Mar-2006 mickey

do per file io accounting and show that in fstat as well; pedro@ marco@ ok


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE OPENBSD_3_9_BASE SMP_SYNC_A SMP_SYNC_B
# 1.23 23-Sep-2003 millert

Replace select backends with poll backends. selscan() and pollscan()
now call the poll backend. With this change we implement greater
poll(2) functionality instead of emulating it via the select backend.
Adapted from NetBSD and including some changes from FreeBSD.
Tested by many, deraadt@ OK


Revision tags: OPENBSD_3_4_BASE
# 1.22 06-Aug-2003 deraadt

must pre-def struct file before circular structs


# 1.21 01-Aug-2003 tedu

move fileops out of file, and make it pretty. ok deraadt@ millert@


# 1.20 18-Jul-2003 tedu

caddr_t -> void *. ok millert@ tdeval@


# 1.19 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.18 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.17 14-Mar-2002 millert

First round of __P removal in sys


# 1.16 08-Feb-2002 art

- Rename FILE_{,UN}USE to FREF and FRELE. USE is a bad verb and we don't have
the same semantics as NetBSD anyway, so it's good to avoid name collissions.
- Always fdremove before freeing the file, not the other way around.
- falloc FREFs the file.
- have FILE_SET_MATURE FRELE the file (It feels like a good ortogonality to
falloc FREFing the file).
- Use closef as much as possible instead of ffree in error paths of
falloc:ing functions. closef is much more careful with the fd and can
deal with the fd being forcibly closed by dup2. Also try to avoid
manually calling *fo_close when closef can do that for us (this makes
some error paths mroe complicated (sys_socketpair and sys_pipe), but
others become simpler (sys_open)).


# 1.15 05-Feb-2002 art

Add counting of temporary references to a struct file (as opposed to references
from fd tables and other long-lived objects). This is to avoid races between
using a file descriptor and having another process (with shared fd table)
close it. We use a separate refence count so that error values from close(2)
will be correctly returned to the caller of close(2).

The macros for those reference counts are FILE_USE(fp) and FILE_UNUSE(fp).

Make sure that the cases where closef can be called "incorrectly" (most notably
dup2(2)) are handled.

Right now only callers of closef (and {,p}read) use FILE_{,UN}USE correctly,
more fixes incoming soon.


Revision tags: UBC_BASE
# 1.14 31-Oct-2001 art

branches: 1.14.2;
Clarify some struct fields.


# 1.13 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.12 15-May-2001 deraadt

DTYPE_CRYPTO


# 1.11 14-May-2001 art

Add a fo_stat member to struct fileops. Used soon.
Also add a stat function for kqueue from FreeBSD.


Revision tags: OPENBSD_2_9_BASE
# 1.10 01-Mar-2001 provos

port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.9 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.8 24-May-2000 deraadt

move kernel prototypes using iovec to the right place


Revision tags: OPENBSD_2_7_BASE
# 1.7 20-Apr-2000 deraadt

p{read,write}{,v} from csapuntz, partial NetBSD origin I think


# 1.6 19-Apr-2000 csapuntz

Change struct file interface methods read and write to pass file offset in
and out.

Make pread/pwrite in netbsd & linux thread safe - which is the whole point
anyway.


Revision tags: SMP_BASE
# 1.5 01-Feb-2000 assar

branches: 1.5.2;
add declaration of `vnops'


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE OPENBSD_2_5_BASE OPENBSD_2_6_BASE kame_19991208
# 1.4 01-Mar-1998 deraadt

crank f_count/f_msgcount to long; when incrementing try to leave 2 slots
empty for unp_gc() in case of cross referenced sockets .


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.3 27-Aug-1996 shawn

New fast pipe(2) from freebsd without fancy vm stuff.

The old pipes can be used with the "OLD_PIPE" config option.


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.64 25-Oct-2021 claudio

Revert commitid: ufM9BcSbXqfLpzBH;
Move vfs_stall_barrier() from the fd layer into vn_lock() and the vfs layer.
In some cases it can result in a deadlock while suspending.
Discussed with mpi@ and deraadt@


# 1.63 21-Oct-2021 claudio

Move vfs_stall_barrier() from the fd layer into vn_lock() and the vfs layer.
vfs stalling is used by suspend/resume and by vmt(4) to stall any
filesystem operation from altering the state on disk. All these
operations will call vn_lock and be stalled. Adjust vfs_stall_barrier()
to allow the lock owner to still progress so that suspend can sync
the filesystems after stalling vfs operation.
OK mpi@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.62 02-Dec-2020 martijn

Hoist DTYPE_* out of #ifdef _KERNEL.
Similar to what NetBSD and FreeBSD have done.

OK guenther@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.61 13-Mar-2020 anton

In order to unlock flock(2), make writes to the f_iflags field of struct
file atomic. This also gets rid of the last kernel lock protected field
in the scope of struct file.

ok mpi@ visa@


# 1.60 01-Feb-2020 anton

Make writes to the f_flag field of `struct file' MP-safe using atomic
operations. Since the type of f_flag must change in order to use the
atomic(9) API, reorder the struct in order to avoid padding; as pointed
out by tedu@.

ok mpi@ visa@


# 1.59 05-Jan-2020 visa

Constify instances of struct fileops.

OK anton@, mpi@, bluhm@


Revision tags: OPENBSD_6_6_BASE
# 1.58 05-Aug-2019 anton

Allow concurrent reads of the f_offset field of struct file by
serializing both read/write operations using the existing file mutex.
The vnode lock still grants exclusive write access to the offset; the
mutex is only used to make the actual write atomic and prevent any
concurrent reader from observing intermediate values.

ok mpi@ visa@


# 1.57 12-Jul-2019 solene

Revert anton@ changes about read/write unlocking
https://marc.info/?l=openbsd-cvs&m=156277704122293&w=2

ok anton@


# 1.56 11-Jul-2019 anton

zero pad and align FO_POSITION; no binary change


# 1.55 10-Jul-2019 anton

Make read/write of the f_offset field belonging to struct file MP-safe;
as part of the effort to unlock the kernel. Instead of relying on the
vnode lock, introduce a dedicated lock per file. Exclusive write access
is granted using the new foffset_enter and foffset_leave API. A
convenience function foffset_get is also available for threads that only
need to read the current offset.

The lock acquisition order in vn_write has been changed to match the one
in vn_read in order to avoid a potential deadlock. This change also gets
rid of a documented race in vn_read().

Inspired by the FreeBSD implementation.

With help and ok mpi@ visa@


# 1.54 22-Jun-2019 semarie

push the KERNEL_LOCK deeper on read(2) and write(2)

unlocks read(2) and write(2) syscalls families, and push the KERNEL_LOCK
deeper in the code path. KERNEL_LOCK is managed per file type in fileops
handlers (fo_read, fo_write, and fo_close). read(2) and write(2) on
socket are KERNEL_LOCK-free.

initial work from mpi@ and ians@

ok mpi@ kettenis@ visa@ ians@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.53 20-Aug-2018 mpi

Reorder checks in the read/write(2) family of syscalls to prepare making
file operations mp-safe.

This change makes it clear that `f_offset' is only accessed in vn_read()
and vn_write(), which will help taking it out of the KERNEL_LOCK().

This refactoring uncovered a race in vn_read() which is now documented
and will be addressed in a later diff.

ok visa@


# 1.52 03-Jul-2018 kettenis

Add a new so_seek member to "struct file" such that we can have seekable
files that aren't vnodes. Move the vnode-specific code into its own
function. Add an implementation for the "DMA buffers" that can be used
by DRI3/prime code to find out the size of the graphics buffer.
This implementation is very limited and only supports offset 0 and only
for SEEK_SET and SEEK_END. This doesn't really make sense; implementing
stat(2) would be a more obvious choice. But this is what Linux does.

ok guenther@, visa@


# 1.51 02-Jul-2018 visa

Update the file reference count field `f_count' using atomic operations
instead of using a mutex for update serialization. Use a per-fdp mutex
to manage updating of file instance pointers in the `fd_ofiles' array
to let fd_getfile() acquire file references safely with concurrent file
reference releases.

OK mpi@


# 1.50 25-Jun-2018 kettenis

Implement DRI3/prime support. This allows graphics buffers to be passed
between processes using file descriptors. This provides an alternative to
eporting them with guesable 32-bit IDs. This implementation does not (yet)
allow sharing of graphics buffers between GPUs.

ok mpi@, visa@


# 1.49 20-Jun-2018 mpi

Unlock sendmsg(2) and sendto(2).

These syscalls can now be executed w/o the KERNEL_LOCK() depending on
the kind of socket.

The current solution uses a single global mutex to serialize access to,
and reference count, 'struct file'.

ok visa@, kettenis@


# 1.48 18-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup, take 3.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]' or the global linked list. This allows
us to simplifies a lot code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu Masson, visa@, guenther@ and art@

Previous version ok bluhm@, ok visa@, sthen@


# 1.47 05-Jun-2018 mpi

Revert introduction of fdinsert(), a sanitify check triggers when
closing a LARVAL file.

Found the hardway by sthen@.


# 1.46 02-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]'. This allows us to simplifies a lot
code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu -, visa@, guenther@ and art@

ok visa@, bluhm@


# 1.45 09-May-2018 mpi

Mark `f_ops' as immutable.

The only place where it was modified after initialization is a corner
case where the vnode of an open file is substitued by another one. Sine
the type of the file doesn't change, there's no need to overwrite `f_ops'.

While here proctect file counters with `f_mtx'.

ok bluhm@, visa@


# 1.44 08-May-2018 mpi

Do do include <sys/mount.h> because it breaks some userland programs
that define _KERNEL...


# 1.43 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.42 08-May-2018 mpi

Protect per-file counters and document which lock is used to protect
the other fields.

Once we no longer have any [k] (kernel lock) protections, we'll be
able to unlock almost all network related syscalls.

Inputs from and ok bluhm@, visa@


# 1.41 25-Apr-2018 mpi

Introduce fd_iterfile() a new helper function to iterate over `filehead'.

This turns `filehead' into a local variable, that will make it easier
to protect it.

ok visa@


Revision tags: OPENBSD_6_3_BASE
# 1.40 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.39 02-Jan-2018 guenther

Don't #include fcntl.h when _KERNEL is defined.

inspired by FreeBSD r24131
ok millert@ sthen@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.38 23-Aug-2016 tedu

rename nfiles to numfiles to avoid shadowing and stretch out the name.
ok deraadt


Revision tags: OPENBSD_6_0_BASE
# 1.37 26-Apr-2016 deraadt

No good reason to retain comments about old DTYPE_CRYPTO or DTYPE_SYSTRACE
values.


# 1.36 25-Apr-2016 tedu

remove systrace remnants


Revision tags: OPENBSD_5_9_BASE
# 1.35 28-Aug-2015 guenther

Rework the UNIX domain socket garbage collector, including ideas from
{Free,Net}BSD
- when a socket is closed with fds in its input, defer closing them to
a task to avoid recursing. This eliminates the complicated extra
reference taking which had a 37 line(!) comment explanation
- move flags, counts, and links only needed for this from struct file to
struct unpcb
- document the flow of the mark/sweep collector

much help from claudio@ who made me explain the GC to him until we trusted it
ok claudio@ mpi@ deraadt@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.34 18-Nov-2014 mikeb

DTYPE_CRYPTO is not used anymore; ok guenther (a while ago)


# 1.33 18-Nov-2014 tedu

file.h doesn't need to include unistd.h


Revision tags: OPENBSD_5_6_BASE
# 1.32 10-Jul-2014 deraadt

struct ucred; for fstat _KERNEL block


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.31 05-Jun-2013 guenther

Move FHASLOCK from f_flag to f_iflags, freeing up a bit for passing
O_* flags and eliminating an XXX comment.

ok matthew@ deraadt@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.30 01-May-2012 guenther

Eliminate the f_usecount ref count in struct file; instead of sleeping
at the top of closef() until all in-progress calls finish, just do the
advisory locking bits required of close() by POSIX and let whichever
thread has the last reference do the call to the file's fo_close()
method and the final cleanup.

lots of discussion with deraadt@ and others; worked out with and ok krw@


# 1.29 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.28 28-Jun-2011 thib

Rename FMARK to FIF_MARK and FDEFER to FIF_DEFER and
move those flags to f_iflags; This makes rooms in the
flag member of struct file for some goodies matthew@
as planned.

ok matthew@, deraadt@.


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.27 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.26 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.25 04-Jun-2009 blambert

Put readv/writev changes back in, as they no longer hang ckuethe's ntpd.

Special thanks to ckuethe's ntpd for noticing the problem.

ok deraadt@


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.24 26-Mar-2006 mickey

do per file io accounting and show that in fstat as well; pedro@ marco@ ok


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE OPENBSD_3_9_BASE SMP_SYNC_A SMP_SYNC_B
# 1.23 23-Sep-2003 millert

Replace select backends with poll backends. selscan() and pollscan()
now call the poll backend. With this change we implement greater
poll(2) functionality instead of emulating it via the select backend.
Adapted from NetBSD and including some changes from FreeBSD.
Tested by many, deraadt@ OK


Revision tags: OPENBSD_3_4_BASE
# 1.22 06-Aug-2003 deraadt

must pre-def struct file before circular structs


# 1.21 01-Aug-2003 tedu

move fileops out of file, and make it pretty. ok deraadt@ millert@


# 1.20 18-Jul-2003 tedu

caddr_t -> void *. ok millert@ tdeval@


# 1.19 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.18 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.17 14-Mar-2002 millert

First round of __P removal in sys


# 1.16 08-Feb-2002 art

- Rename FILE_{,UN}USE to FREF and FRELE. USE is a bad verb and we don't have
the same semantics as NetBSD anyway, so it's good to avoid name collissions.
- Always fdremove before freeing the file, not the other way around.
- falloc FREFs the file.
- have FILE_SET_MATURE FRELE the file (It feels like a good ortogonality to
falloc FREFing the file).
- Use closef as much as possible instead of ffree in error paths of
falloc:ing functions. closef is much more careful with the fd and can
deal with the fd being forcibly closed by dup2. Also try to avoid
manually calling *fo_close when closef can do that for us (this makes
some error paths mroe complicated (sys_socketpair and sys_pipe), but
others become simpler (sys_open)).


# 1.15 05-Feb-2002 art

Add counting of temporary references to a struct file (as opposed to references
from fd tables and other long-lived objects). This is to avoid races between
using a file descriptor and having another process (with shared fd table)
close it. We use a separate refence count so that error values from close(2)
will be correctly returned to the caller of close(2).

The macros for those reference counts are FILE_USE(fp) and FILE_UNUSE(fp).

Make sure that the cases where closef can be called "incorrectly" (most notably
dup2(2)) are handled.

Right now only callers of closef (and {,p}read) use FILE_{,UN}USE correctly,
more fixes incoming soon.


Revision tags: UBC_BASE
# 1.14 31-Oct-2001 art

branches: 1.14.2;
Clarify some struct fields.


# 1.13 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.12 15-May-2001 deraadt

DTYPE_CRYPTO


# 1.11 14-May-2001 art

Add a fo_stat member to struct fileops. Used soon.
Also add a stat function for kqueue from FreeBSD.


Revision tags: OPENBSD_2_9_BASE
# 1.10 01-Mar-2001 provos

port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.9 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.8 24-May-2000 deraadt

move kernel prototypes using iovec to the right place


Revision tags: OPENBSD_2_7_BASE
# 1.7 20-Apr-2000 deraadt

p{read,write}{,v} from csapuntz, partial NetBSD origin I think


# 1.6 19-Apr-2000 csapuntz

Change struct file interface methods read and write to pass file offset in
and out.

Make pread/pwrite in netbsd & linux thread safe - which is the whole point
anyway.


Revision tags: SMP_BASE
# 1.5 01-Feb-2000 assar

branches: 1.5.2;
add declaration of `vnops'


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE OPENBSD_2_5_BASE OPENBSD_2_6_BASE kame_19991208
# 1.4 01-Mar-1998 deraadt

crank f_count/f_msgcount to long; when incrementing try to leave 2 slots
empty for unp_gc() in case of cross referenced sockets .


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.3 27-Aug-1996 shawn

New fast pipe(2) from freebsd without fancy vm stuff.

The old pipes can be used with the "OLD_PIPE" config option.


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.63 21-Oct-2021 claudio

Move vfs_stall_barrier() from the fd layer into vn_lock() and the vfs layer.
vfs stalling is used by suspend/resume and by vmt(4) to stall any
filesystem operation from altering the state on disk. All these
operations will call vn_lock and be stalled. Adjust vfs_stall_barrier()
to allow the lock owner to still progress so that suspend can sync
the filesystems after stalling vfs operation.
OK mpi@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.62 02-Dec-2020 martijn

Hoist DTYPE_* out of #ifdef _KERNEL.
Similar to what NetBSD and FreeBSD have done.

OK guenther@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.61 13-Mar-2020 anton

In order to unlock flock(2), make writes to the f_iflags field of struct
file atomic. This also gets rid of the last kernel lock protected field
in the scope of struct file.

ok mpi@ visa@


# 1.60 01-Feb-2020 anton

Make writes to the f_flag field of `struct file' MP-safe using atomic
operations. Since the type of f_flag must change in order to use the
atomic(9) API, reorder the struct in order to avoid padding; as pointed
out by tedu@.

ok mpi@ visa@


# 1.59 05-Jan-2020 visa

Constify instances of struct fileops.

OK anton@, mpi@, bluhm@


Revision tags: OPENBSD_6_6_BASE
# 1.58 05-Aug-2019 anton

Allow concurrent reads of the f_offset field of struct file by
serializing both read/write operations using the existing file mutex.
The vnode lock still grants exclusive write access to the offset; the
mutex is only used to make the actual write atomic and prevent any
concurrent reader from observing intermediate values.

ok mpi@ visa@


# 1.57 12-Jul-2019 solene

Revert anton@ changes about read/write unlocking
https://marc.info/?l=openbsd-cvs&m=156277704122293&w=2

ok anton@


# 1.56 11-Jul-2019 anton

zero pad and align FO_POSITION; no binary change


# 1.55 10-Jul-2019 anton

Make read/write of the f_offset field belonging to struct file MP-safe;
as part of the effort to unlock the kernel. Instead of relying on the
vnode lock, introduce a dedicated lock per file. Exclusive write access
is granted using the new foffset_enter and foffset_leave API. A
convenience function foffset_get is also available for threads that only
need to read the current offset.

The lock acquisition order in vn_write has been changed to match the one
in vn_read in order to avoid a potential deadlock. This change also gets
rid of a documented race in vn_read().

Inspired by the FreeBSD implementation.

With help and ok mpi@ visa@


# 1.54 22-Jun-2019 semarie

push the KERNEL_LOCK deeper on read(2) and write(2)

unlocks read(2) and write(2) syscalls families, and push the KERNEL_LOCK
deeper in the code path. KERNEL_LOCK is managed per file type in fileops
handlers (fo_read, fo_write, and fo_close). read(2) and write(2) on
socket are KERNEL_LOCK-free.

initial work from mpi@ and ians@

ok mpi@ kettenis@ visa@ ians@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.53 20-Aug-2018 mpi

Reorder checks in the read/write(2) family of syscalls to prepare making
file operations mp-safe.

This change makes it clear that `f_offset' is only accessed in vn_read()
and vn_write(), which will help taking it out of the KERNEL_LOCK().

This refactoring uncovered a race in vn_read() which is now documented
and will be addressed in a later diff.

ok visa@


# 1.52 03-Jul-2018 kettenis

Add a new so_seek member to "struct file" such that we can have seekable
files that aren't vnodes. Move the vnode-specific code into its own
function. Add an implementation for the "DMA buffers" that can be used
by DRI3/prime code to find out the size of the graphics buffer.
This implementation is very limited and only supports offset 0 and only
for SEEK_SET and SEEK_END. This doesn't really make sense; implementing
stat(2) would be a more obvious choice. But this is what Linux does.

ok guenther@, visa@


# 1.51 02-Jul-2018 visa

Update the file reference count field `f_count' using atomic operations
instead of using a mutex for update serialization. Use a per-fdp mutex
to manage updating of file instance pointers in the `fd_ofiles' array
to let fd_getfile() acquire file references safely with concurrent file
reference releases.

OK mpi@


# 1.50 25-Jun-2018 kettenis

Implement DRI3/prime support. This allows graphics buffers to be passed
between processes using file descriptors. This provides an alternative to
eporting them with guesable 32-bit IDs. This implementation does not (yet)
allow sharing of graphics buffers between GPUs.

ok mpi@, visa@


# 1.49 20-Jun-2018 mpi

Unlock sendmsg(2) and sendto(2).

These syscalls can now be executed w/o the KERNEL_LOCK() depending on
the kind of socket.

The current solution uses a single global mutex to serialize access to,
and reference count, 'struct file'.

ok visa@, kettenis@


# 1.48 18-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup, take 3.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]' or the global linked list. This allows
us to simplifies a lot code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu Masson, visa@, guenther@ and art@

Previous version ok bluhm@, ok visa@, sthen@


# 1.47 05-Jun-2018 mpi

Revert introduction of fdinsert(), a sanitify check triggers when
closing a LARVAL file.

Found the hardway by sthen@.


# 1.46 02-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]'. This allows us to simplifies a lot
code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu -, visa@, guenther@ and art@

ok visa@, bluhm@


# 1.45 09-May-2018 mpi

Mark `f_ops' as immutable.

The only place where it was modified after initialization is a corner
case where the vnode of an open file is substitued by another one. Sine
the type of the file doesn't change, there's no need to overwrite `f_ops'.

While here proctect file counters with `f_mtx'.

ok bluhm@, visa@


# 1.44 08-May-2018 mpi

Do do include <sys/mount.h> because it breaks some userland programs
that define _KERNEL...


# 1.43 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.42 08-May-2018 mpi

Protect per-file counters and document which lock is used to protect
the other fields.

Once we no longer have any [k] (kernel lock) protections, we'll be
able to unlock almost all network related syscalls.

Inputs from and ok bluhm@, visa@


# 1.41 25-Apr-2018 mpi

Introduce fd_iterfile() a new helper function to iterate over `filehead'.

This turns `filehead' into a local variable, that will make it easier
to protect it.

ok visa@


Revision tags: OPENBSD_6_3_BASE
# 1.40 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.39 02-Jan-2018 guenther

Don't #include fcntl.h when _KERNEL is defined.

inspired by FreeBSD r24131
ok millert@ sthen@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.38 23-Aug-2016 tedu

rename nfiles to numfiles to avoid shadowing and stretch out the name.
ok deraadt


Revision tags: OPENBSD_6_0_BASE
# 1.37 26-Apr-2016 deraadt

No good reason to retain comments about old DTYPE_CRYPTO or DTYPE_SYSTRACE
values.


# 1.36 25-Apr-2016 tedu

remove systrace remnants


Revision tags: OPENBSD_5_9_BASE
# 1.35 28-Aug-2015 guenther

Rework the UNIX domain socket garbage collector, including ideas from
{Free,Net}BSD
- when a socket is closed with fds in its input, defer closing them to
a task to avoid recursing. This eliminates the complicated extra
reference taking which had a 37 line(!) comment explanation
- move flags, counts, and links only needed for this from struct file to
struct unpcb
- document the flow of the mark/sweep collector

much help from claudio@ who made me explain the GC to him until we trusted it
ok claudio@ mpi@ deraadt@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.34 18-Nov-2014 mikeb

DTYPE_CRYPTO is not used anymore; ok guenther (a while ago)


# 1.33 18-Nov-2014 tedu

file.h doesn't need to include unistd.h


Revision tags: OPENBSD_5_6_BASE
# 1.32 10-Jul-2014 deraadt

struct ucred; for fstat _KERNEL block


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.31 05-Jun-2013 guenther

Move FHASLOCK from f_flag to f_iflags, freeing up a bit for passing
O_* flags and eliminating an XXX comment.

ok matthew@ deraadt@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.30 01-May-2012 guenther

Eliminate the f_usecount ref count in struct file; instead of sleeping
at the top of closef() until all in-progress calls finish, just do the
advisory locking bits required of close() by POSIX and let whichever
thread has the last reference do the call to the file's fo_close()
method and the final cleanup.

lots of discussion with deraadt@ and others; worked out with and ok krw@


# 1.29 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.28 28-Jun-2011 thib

Rename FMARK to FIF_MARK and FDEFER to FIF_DEFER and
move those flags to f_iflags; This makes rooms in the
flag member of struct file for some goodies matthew@
as planned.

ok matthew@, deraadt@.


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.27 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.26 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.25 04-Jun-2009 blambert

Put readv/writev changes back in, as they no longer hang ckuethe's ntpd.

Special thanks to ckuethe's ntpd for noticing the problem.

ok deraadt@


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.24 26-Mar-2006 mickey

do per file io accounting and show that in fstat as well; pedro@ marco@ ok


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE OPENBSD_3_9_BASE SMP_SYNC_A SMP_SYNC_B
# 1.23 23-Sep-2003 millert

Replace select backends with poll backends. selscan() and pollscan()
now call the poll backend. With this change we implement greater
poll(2) functionality instead of emulating it via the select backend.
Adapted from NetBSD and including some changes from FreeBSD.
Tested by many, deraadt@ OK


Revision tags: OPENBSD_3_4_BASE
# 1.22 06-Aug-2003 deraadt

must pre-def struct file before circular structs


# 1.21 01-Aug-2003 tedu

move fileops out of file, and make it pretty. ok deraadt@ millert@


# 1.20 18-Jul-2003 tedu

caddr_t -> void *. ok millert@ tdeval@


# 1.19 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.18 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.17 14-Mar-2002 millert

First round of __P removal in sys


# 1.16 08-Feb-2002 art

- Rename FILE_{,UN}USE to FREF and FRELE. USE is a bad verb and we don't have
the same semantics as NetBSD anyway, so it's good to avoid name collissions.
- Always fdremove before freeing the file, not the other way around.
- falloc FREFs the file.
- have FILE_SET_MATURE FRELE the file (It feels like a good ortogonality to
falloc FREFing the file).
- Use closef as much as possible instead of ffree in error paths of
falloc:ing functions. closef is much more careful with the fd and can
deal with the fd being forcibly closed by dup2. Also try to avoid
manually calling *fo_close when closef can do that for us (this makes
some error paths mroe complicated (sys_socketpair and sys_pipe), but
others become simpler (sys_open)).


# 1.15 05-Feb-2002 art

Add counting of temporary references to a struct file (as opposed to references
from fd tables and other long-lived objects). This is to avoid races between
using a file descriptor and having another process (with shared fd table)
close it. We use a separate refence count so that error values from close(2)
will be correctly returned to the caller of close(2).

The macros for those reference counts are FILE_USE(fp) and FILE_UNUSE(fp).

Make sure that the cases where closef can be called "incorrectly" (most notably
dup2(2)) are handled.

Right now only callers of closef (and {,p}read) use FILE_{,UN}USE correctly,
more fixes incoming soon.


Revision tags: UBC_BASE
# 1.14 31-Oct-2001 art

branches: 1.14.2;
Clarify some struct fields.


# 1.13 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.12 15-May-2001 deraadt

DTYPE_CRYPTO


# 1.11 14-May-2001 art

Add a fo_stat member to struct fileops. Used soon.
Also add a stat function for kqueue from FreeBSD.


Revision tags: OPENBSD_2_9_BASE
# 1.10 01-Mar-2001 provos

port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.9 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.8 24-May-2000 deraadt

move kernel prototypes using iovec to the right place


Revision tags: OPENBSD_2_7_BASE
# 1.7 20-Apr-2000 deraadt

p{read,write}{,v} from csapuntz, partial NetBSD origin I think


# 1.6 19-Apr-2000 csapuntz

Change struct file interface methods read and write to pass file offset in
and out.

Make pread/pwrite in netbsd & linux thread safe - which is the whole point
anyway.


Revision tags: SMP_BASE
# 1.5 01-Feb-2000 assar

branches: 1.5.2;
add declaration of `vnops'


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE OPENBSD_2_5_BASE OPENBSD_2_6_BASE kame_19991208
# 1.4 01-Mar-1998 deraadt

crank f_count/f_msgcount to long; when incrementing try to leave 2 slots
empty for unp_gc() in case of cross referenced sockets .


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.3 27-Aug-1996 shawn

New fast pipe(2) from freebsd without fancy vm stuff.

The old pipes can be used with the "OLD_PIPE" config option.


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.62 02-Dec-2020 martijn

Hoist DTYPE_* out of #ifdef _KERNEL.
Similar to what NetBSD and FreeBSD have done.

OK guenther@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.61 13-Mar-2020 anton

In order to unlock flock(2), make writes to the f_iflags field of struct
file atomic. This also gets rid of the last kernel lock protected field
in the scope of struct file.

ok mpi@ visa@


# 1.60 01-Feb-2020 anton

Make writes to the f_flag field of `struct file' MP-safe using atomic
operations. Since the type of f_flag must change in order to use the
atomic(9) API, reorder the struct in order to avoid padding; as pointed
out by tedu@.

ok mpi@ visa@


# 1.59 05-Jan-2020 visa

Constify instances of struct fileops.

OK anton@, mpi@, bluhm@


Revision tags: OPENBSD_6_6_BASE
# 1.58 05-Aug-2019 anton

Allow concurrent reads of the f_offset field of struct file by
serializing both read/write operations using the existing file mutex.
The vnode lock still grants exclusive write access to the offset; the
mutex is only used to make the actual write atomic and prevent any
concurrent reader from observing intermediate values.

ok mpi@ visa@


# 1.57 12-Jul-2019 solene

Revert anton@ changes about read/write unlocking
https://marc.info/?l=openbsd-cvs&m=156277704122293&w=2

ok anton@


# 1.56 11-Jul-2019 anton

zero pad and align FO_POSITION; no binary change


# 1.55 10-Jul-2019 anton

Make read/write of the f_offset field belonging to struct file MP-safe;
as part of the effort to unlock the kernel. Instead of relying on the
vnode lock, introduce a dedicated lock per file. Exclusive write access
is granted using the new foffset_enter and foffset_leave API. A
convenience function foffset_get is also available for threads that only
need to read the current offset.

The lock acquisition order in vn_write has been changed to match the one
in vn_read in order to avoid a potential deadlock. This change also gets
rid of a documented race in vn_read().

Inspired by the FreeBSD implementation.

With help and ok mpi@ visa@


# 1.54 22-Jun-2019 semarie

push the KERNEL_LOCK deeper on read(2) and write(2)

unlocks read(2) and write(2) syscalls families, and push the KERNEL_LOCK
deeper in the code path. KERNEL_LOCK is managed per file type in fileops
handlers (fo_read, fo_write, and fo_close). read(2) and write(2) on
socket are KERNEL_LOCK-free.

initial work from mpi@ and ians@

ok mpi@ kettenis@ visa@ ians@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.53 20-Aug-2018 mpi

Reorder checks in the read/write(2) family of syscalls to prepare making
file operations mp-safe.

This change makes it clear that `f_offset' is only accessed in vn_read()
and vn_write(), which will help taking it out of the KERNEL_LOCK().

This refactoring uncovered a race in vn_read() which is now documented
and will be addressed in a later diff.

ok visa@


# 1.52 03-Jul-2018 kettenis

Add a new so_seek member to "struct file" such that we can have seekable
files that aren't vnodes. Move the vnode-specific code into its own
function. Add an implementation for the "DMA buffers" that can be used
by DRI3/prime code to find out the size of the graphics buffer.
This implementation is very limited and only supports offset 0 and only
for SEEK_SET and SEEK_END. This doesn't really make sense; implementing
stat(2) would be a more obvious choice. But this is what Linux does.

ok guenther@, visa@


# 1.51 02-Jul-2018 visa

Update the file reference count field `f_count' using atomic operations
instead of using a mutex for update serialization. Use a per-fdp mutex
to manage updating of file instance pointers in the `fd_ofiles' array
to let fd_getfile() acquire file references safely with concurrent file
reference releases.

OK mpi@


# 1.50 25-Jun-2018 kettenis

Implement DRI3/prime support. This allows graphics buffers to be passed
between processes using file descriptors. This provides an alternative to
eporting them with guesable 32-bit IDs. This implementation does not (yet)
allow sharing of graphics buffers between GPUs.

ok mpi@, visa@


# 1.49 20-Jun-2018 mpi

Unlock sendmsg(2) and sendto(2).

These syscalls can now be executed w/o the KERNEL_LOCK() depending on
the kind of socket.

The current solution uses a single global mutex to serialize access to,
and reference count, 'struct file'.

ok visa@, kettenis@


# 1.48 18-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup, take 3.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]' or the global linked list. This allows
us to simplifies a lot code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu Masson, visa@, guenther@ and art@

Previous version ok bluhm@, ok visa@, sthen@


# 1.47 05-Jun-2018 mpi

Revert introduction of fdinsert(), a sanitify check triggers when
closing a LARVAL file.

Found the hardway by sthen@.


# 1.46 02-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]'. This allows us to simplifies a lot
code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu -, visa@, guenther@ and art@

ok visa@, bluhm@


# 1.45 09-May-2018 mpi

Mark `f_ops' as immutable.

The only place where it was modified after initialization is a corner
case where the vnode of an open file is substitued by another one. Sine
the type of the file doesn't change, there's no need to overwrite `f_ops'.

While here proctect file counters with `f_mtx'.

ok bluhm@, visa@


# 1.44 08-May-2018 mpi

Do do include <sys/mount.h> because it breaks some userland programs
that define _KERNEL...


# 1.43 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.42 08-May-2018 mpi

Protect per-file counters and document which lock is used to protect
the other fields.

Once we no longer have any [k] (kernel lock) protections, we'll be
able to unlock almost all network related syscalls.

Inputs from and ok bluhm@, visa@


# 1.41 25-Apr-2018 mpi

Introduce fd_iterfile() a new helper function to iterate over `filehead'.

This turns `filehead' into a local variable, that will make it easier
to protect it.

ok visa@


Revision tags: OPENBSD_6_3_BASE
# 1.40 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.39 02-Jan-2018 guenther

Don't #include fcntl.h when _KERNEL is defined.

inspired by FreeBSD r24131
ok millert@ sthen@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.38 23-Aug-2016 tedu

rename nfiles to numfiles to avoid shadowing and stretch out the name.
ok deraadt


Revision tags: OPENBSD_6_0_BASE
# 1.37 26-Apr-2016 deraadt

No good reason to retain comments about old DTYPE_CRYPTO or DTYPE_SYSTRACE
values.


# 1.36 25-Apr-2016 tedu

remove systrace remnants


Revision tags: OPENBSD_5_9_BASE
# 1.35 28-Aug-2015 guenther

Rework the UNIX domain socket garbage collector, including ideas from
{Free,Net}BSD
- when a socket is closed with fds in its input, defer closing them to
a task to avoid recursing. This eliminates the complicated extra
reference taking which had a 37 line(!) comment explanation
- move flags, counts, and links only needed for this from struct file to
struct unpcb
- document the flow of the mark/sweep collector

much help from claudio@ who made me explain the GC to him until we trusted it
ok claudio@ mpi@ deraadt@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.34 18-Nov-2014 mikeb

DTYPE_CRYPTO is not used anymore; ok guenther (a while ago)


# 1.33 18-Nov-2014 tedu

file.h doesn't need to include unistd.h


Revision tags: OPENBSD_5_6_BASE
# 1.32 10-Jul-2014 deraadt

struct ucred; for fstat _KERNEL block


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.31 05-Jun-2013 guenther

Move FHASLOCK from f_flag to f_iflags, freeing up a bit for passing
O_* flags and eliminating an XXX comment.

ok matthew@ deraadt@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.30 01-May-2012 guenther

Eliminate the f_usecount ref count in struct file; instead of sleeping
at the top of closef() until all in-progress calls finish, just do the
advisory locking bits required of close() by POSIX and let whichever
thread has the last reference do the call to the file's fo_close()
method and the final cleanup.

lots of discussion with deraadt@ and others; worked out with and ok krw@


# 1.29 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.28 28-Jun-2011 thib

Rename FMARK to FIF_MARK and FDEFER to FIF_DEFER and
move those flags to f_iflags; This makes rooms in the
flag member of struct file for some goodies matthew@
as planned.

ok matthew@, deraadt@.


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.27 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.26 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.25 04-Jun-2009 blambert

Put readv/writev changes back in, as they no longer hang ckuethe's ntpd.

Special thanks to ckuethe's ntpd for noticing the problem.

ok deraadt@


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.24 26-Mar-2006 mickey

do per file io accounting and show that in fstat as well; pedro@ marco@ ok


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE OPENBSD_3_9_BASE SMP_SYNC_A SMP_SYNC_B
# 1.23 23-Sep-2003 millert

Replace select backends with poll backends. selscan() and pollscan()
now call the poll backend. With this change we implement greater
poll(2) functionality instead of emulating it via the select backend.
Adapted from NetBSD and including some changes from FreeBSD.
Tested by many, deraadt@ OK


Revision tags: OPENBSD_3_4_BASE
# 1.22 06-Aug-2003 deraadt

must pre-def struct file before circular structs


# 1.21 01-Aug-2003 tedu

move fileops out of file, and make it pretty. ok deraadt@ millert@


# 1.20 18-Jul-2003 tedu

caddr_t -> void *. ok millert@ tdeval@


# 1.19 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.18 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.17 14-Mar-2002 millert

First round of __P removal in sys


# 1.16 08-Feb-2002 art

- Rename FILE_{,UN}USE to FREF and FRELE. USE is a bad verb and we don't have
the same semantics as NetBSD anyway, so it's good to avoid name collissions.
- Always fdremove before freeing the file, not the other way around.
- falloc FREFs the file.
- have FILE_SET_MATURE FRELE the file (It feels like a good ortogonality to
falloc FREFing the file).
- Use closef as much as possible instead of ffree in error paths of
falloc:ing functions. closef is much more careful with the fd and can
deal with the fd being forcibly closed by dup2. Also try to avoid
manually calling *fo_close when closef can do that for us (this makes
some error paths mroe complicated (sys_socketpair and sys_pipe), but
others become simpler (sys_open)).


# 1.15 05-Feb-2002 art

Add counting of temporary references to a struct file (as opposed to references
from fd tables and other long-lived objects). This is to avoid races between
using a file descriptor and having another process (with shared fd table)
close it. We use a separate refence count so that error values from close(2)
will be correctly returned to the caller of close(2).

The macros for those reference counts are FILE_USE(fp) and FILE_UNUSE(fp).

Make sure that the cases where closef can be called "incorrectly" (most notably
dup2(2)) are handled.

Right now only callers of closef (and {,p}read) use FILE_{,UN}USE correctly,
more fixes incoming soon.


Revision tags: UBC_BASE
# 1.14 31-Oct-2001 art

branches: 1.14.2;
Clarify some struct fields.


# 1.13 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.12 15-May-2001 deraadt

DTYPE_CRYPTO


# 1.11 14-May-2001 art

Add a fo_stat member to struct fileops. Used soon.
Also add a stat function for kqueue from FreeBSD.


Revision tags: OPENBSD_2_9_BASE
# 1.10 01-Mar-2001 provos

port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.9 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.8 24-May-2000 deraadt

move kernel prototypes using iovec to the right place


Revision tags: OPENBSD_2_7_BASE
# 1.7 20-Apr-2000 deraadt

p{read,write}{,v} from csapuntz, partial NetBSD origin I think


# 1.6 19-Apr-2000 csapuntz

Change struct file interface methods read and write to pass file offset in
and out.

Make pread/pwrite in netbsd & linux thread safe - which is the whole point
anyway.


Revision tags: SMP_BASE
# 1.5 01-Feb-2000 assar

branches: 1.5.2;
add declaration of `vnops'


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE OPENBSD_2_5_BASE OPENBSD_2_6_BASE kame_19991208
# 1.4 01-Mar-1998 deraadt

crank f_count/f_msgcount to long; when incrementing try to leave 2 slots
empty for unp_gc() in case of cross referenced sockets .


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.3 27-Aug-1996 shawn

New fast pipe(2) from freebsd without fancy vm stuff.

The old pipes can be used with the "OLD_PIPE" config option.


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.61 13-Mar-2020 anton

In order to unlock flock(2), make writes to the f_iflags field of struct
file atomic. This also gets rid of the last kernel lock protected field
in the scope of struct file.

ok mpi@ visa@


# 1.60 01-Feb-2020 anton

Make writes to the f_flag field of `struct file' MP-safe using atomic
operations. Since the type of f_flag must change in order to use the
atomic(9) API, reorder the struct in order to avoid padding; as pointed
out by tedu@.

ok mpi@ visa@


# 1.59 05-Jan-2020 visa

Constify instances of struct fileops.

OK anton@, mpi@, bluhm@


Revision tags: OPENBSD_6_6_BASE
# 1.58 05-Aug-2019 anton

Allow concurrent reads of the f_offset field of struct file by
serializing both read/write operations using the existing file mutex.
The vnode lock still grants exclusive write access to the offset; the
mutex is only used to make the actual write atomic and prevent any
concurrent reader from observing intermediate values.

ok mpi@ visa@


# 1.57 12-Jul-2019 solene

Revert anton@ changes about read/write unlocking
https://marc.info/?l=openbsd-cvs&m=156277704122293&w=2

ok anton@


# 1.56 11-Jul-2019 anton

zero pad and align FO_POSITION; no binary change


# 1.55 10-Jul-2019 anton

Make read/write of the f_offset field belonging to struct file MP-safe;
as part of the effort to unlock the kernel. Instead of relying on the
vnode lock, introduce a dedicated lock per file. Exclusive write access
is granted using the new foffset_enter and foffset_leave API. A
convenience function foffset_get is also available for threads that only
need to read the current offset.

The lock acquisition order in vn_write has been changed to match the one
in vn_read in order to avoid a potential deadlock. This change also gets
rid of a documented race in vn_read().

Inspired by the FreeBSD implementation.

With help and ok mpi@ visa@


# 1.54 22-Jun-2019 semarie

push the KERNEL_LOCK deeper on read(2) and write(2)

unlocks read(2) and write(2) syscalls families, and push the KERNEL_LOCK
deeper in the code path. KERNEL_LOCK is managed per file type in fileops
handlers (fo_read, fo_write, and fo_close). read(2) and write(2) on
socket are KERNEL_LOCK-free.

initial work from mpi@ and ians@

ok mpi@ kettenis@ visa@ ians@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.53 20-Aug-2018 mpi

Reorder checks in the read/write(2) family of syscalls to prepare making
file operations mp-safe.

This change makes it clear that `f_offset' is only accessed in vn_read()
and vn_write(), which will help taking it out of the KERNEL_LOCK().

This refactoring uncovered a race in vn_read() which is now documented
and will be addressed in a later diff.

ok visa@


# 1.52 03-Jul-2018 kettenis

Add a new so_seek member to "struct file" such that we can have seekable
files that aren't vnodes. Move the vnode-specific code into its own
function. Add an implementation for the "DMA buffers" that can be used
by DRI3/prime code to find out the size of the graphics buffer.
This implementation is very limited and only supports offset 0 and only
for SEEK_SET and SEEK_END. This doesn't really make sense; implementing
stat(2) would be a more obvious choice. But this is what Linux does.

ok guenther@, visa@


# 1.51 02-Jul-2018 visa

Update the file reference count field `f_count' using atomic operations
instead of using a mutex for update serialization. Use a per-fdp mutex
to manage updating of file instance pointers in the `fd_ofiles' array
to let fd_getfile() acquire file references safely with concurrent file
reference releases.

OK mpi@


# 1.50 25-Jun-2018 kettenis

Implement DRI3/prime support. This allows graphics buffers to be passed
between processes using file descriptors. This provides an alternative to
eporting them with guesable 32-bit IDs. This implementation does not (yet)
allow sharing of graphics buffers between GPUs.

ok mpi@, visa@


# 1.49 20-Jun-2018 mpi

Unlock sendmsg(2) and sendto(2).

These syscalls can now be executed w/o the KERNEL_LOCK() depending on
the kind of socket.

The current solution uses a single global mutex to serialize access to,
and reference count, 'struct file'.

ok visa@, kettenis@


# 1.48 18-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup, take 3.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]' or the global linked list. This allows
us to simplifies a lot code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu Masson, visa@, guenther@ and art@

Previous version ok bluhm@, ok visa@, sthen@


# 1.47 05-Jun-2018 mpi

Revert introduction of fdinsert(), a sanitify check triggers when
closing a LARVAL file.

Found the hardway by sthen@.


# 1.46 02-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]'. This allows us to simplifies a lot
code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu -, visa@, guenther@ and art@

ok visa@, bluhm@


# 1.45 09-May-2018 mpi

Mark `f_ops' as immutable.

The only place where it was modified after initialization is a corner
case where the vnode of an open file is substitued by another one. Sine
the type of the file doesn't change, there's no need to overwrite `f_ops'.

While here proctect file counters with `f_mtx'.

ok bluhm@, visa@


# 1.44 08-May-2018 mpi

Do do include <sys/mount.h> because it breaks some userland programs
that define _KERNEL...


# 1.43 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.42 08-May-2018 mpi

Protect per-file counters and document which lock is used to protect
the other fields.

Once we no longer have any [k] (kernel lock) protections, we'll be
able to unlock almost all network related syscalls.

Inputs from and ok bluhm@, visa@


# 1.41 25-Apr-2018 mpi

Introduce fd_iterfile() a new helper function to iterate over `filehead'.

This turns `filehead' into a local variable, that will make it easier
to protect it.

ok visa@


Revision tags: OPENBSD_6_3_BASE
# 1.40 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.39 02-Jan-2018 guenther

Don't #include fcntl.h when _KERNEL is defined.

inspired by FreeBSD r24131
ok millert@ sthen@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.38 23-Aug-2016 tedu

rename nfiles to numfiles to avoid shadowing and stretch out the name.
ok deraadt


Revision tags: OPENBSD_6_0_BASE
# 1.37 26-Apr-2016 deraadt

No good reason to retain comments about old DTYPE_CRYPTO or DTYPE_SYSTRACE
values.


# 1.36 25-Apr-2016 tedu

remove systrace remnants


Revision tags: OPENBSD_5_9_BASE
# 1.35 28-Aug-2015 guenther

Rework the UNIX domain socket garbage collector, including ideas from
{Free,Net}BSD
- when a socket is closed with fds in its input, defer closing them to
a task to avoid recursing. This eliminates the complicated extra
reference taking which had a 37 line(!) comment explanation
- move flags, counts, and links only needed for this from struct file to
struct unpcb
- document the flow of the mark/sweep collector

much help from claudio@ who made me explain the GC to him until we trusted it
ok claudio@ mpi@ deraadt@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.34 18-Nov-2014 mikeb

DTYPE_CRYPTO is not used anymore; ok guenther (a while ago)


# 1.33 18-Nov-2014 tedu

file.h doesn't need to include unistd.h


Revision tags: OPENBSD_5_6_BASE
# 1.32 10-Jul-2014 deraadt

struct ucred; for fstat _KERNEL block


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.31 05-Jun-2013 guenther

Move FHASLOCK from f_flag to f_iflags, freeing up a bit for passing
O_* flags and eliminating an XXX comment.

ok matthew@ deraadt@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.30 01-May-2012 guenther

Eliminate the f_usecount ref count in struct file; instead of sleeping
at the top of closef() until all in-progress calls finish, just do the
advisory locking bits required of close() by POSIX and let whichever
thread has the last reference do the call to the file's fo_close()
method and the final cleanup.

lots of discussion with deraadt@ and others; worked out with and ok krw@


# 1.29 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.28 28-Jun-2011 thib

Rename FMARK to FIF_MARK and FDEFER to FIF_DEFER and
move those flags to f_iflags; This makes rooms in the
flag member of struct file for some goodies matthew@
as planned.

ok matthew@, deraadt@.


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.27 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.26 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.25 04-Jun-2009 blambert

Put readv/writev changes back in, as they no longer hang ckuethe's ntpd.

Special thanks to ckuethe's ntpd for noticing the problem.

ok deraadt@


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.24 26-Mar-2006 mickey

do per file io accounting and show that in fstat as well; pedro@ marco@ ok


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE OPENBSD_3_9_BASE SMP_SYNC_A SMP_SYNC_B
# 1.23 23-Sep-2003 millert

Replace select backends with poll backends. selscan() and pollscan()
now call the poll backend. With this change we implement greater
poll(2) functionality instead of emulating it via the select backend.
Adapted from NetBSD and including some changes from FreeBSD.
Tested by many, deraadt@ OK


Revision tags: OPENBSD_3_4_BASE
# 1.22 06-Aug-2003 deraadt

must pre-def struct file before circular structs


# 1.21 01-Aug-2003 tedu

move fileops out of file, and make it pretty. ok deraadt@ millert@


# 1.20 18-Jul-2003 tedu

caddr_t -> void *. ok millert@ tdeval@


# 1.19 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.18 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.17 14-Mar-2002 millert

First round of __P removal in sys


# 1.16 08-Feb-2002 art

- Rename FILE_{,UN}USE to FREF and FRELE. USE is a bad verb and we don't have
the same semantics as NetBSD anyway, so it's good to avoid name collissions.
- Always fdremove before freeing the file, not the other way around.
- falloc FREFs the file.
- have FILE_SET_MATURE FRELE the file (It feels like a good ortogonality to
falloc FREFing the file).
- Use closef as much as possible instead of ffree in error paths of
falloc:ing functions. closef is much more careful with the fd and can
deal with the fd being forcibly closed by dup2. Also try to avoid
manually calling *fo_close when closef can do that for us (this makes
some error paths mroe complicated (sys_socketpair and sys_pipe), but
others become simpler (sys_open)).


# 1.15 05-Feb-2002 art

Add counting of temporary references to a struct file (as opposed to references
from fd tables and other long-lived objects). This is to avoid races between
using a file descriptor and having another process (with shared fd table)
close it. We use a separate refence count so that error values from close(2)
will be correctly returned to the caller of close(2).

The macros for those reference counts are FILE_USE(fp) and FILE_UNUSE(fp).

Make sure that the cases where closef can be called "incorrectly" (most notably
dup2(2)) are handled.

Right now only callers of closef (and {,p}read) use FILE_{,UN}USE correctly,
more fixes incoming soon.


Revision tags: UBC_BASE
# 1.14 31-Oct-2001 art

branches: 1.14.2;
Clarify some struct fields.


# 1.13 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.12 15-May-2001 deraadt

DTYPE_CRYPTO


# 1.11 14-May-2001 art

Add a fo_stat member to struct fileops. Used soon.
Also add a stat function for kqueue from FreeBSD.


Revision tags: OPENBSD_2_9_BASE
# 1.10 01-Mar-2001 provos

port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.9 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.8 24-May-2000 deraadt

move kernel prototypes using iovec to the right place


Revision tags: OPENBSD_2_7_BASE
# 1.7 20-Apr-2000 deraadt

p{read,write}{,v} from csapuntz, partial NetBSD origin I think


# 1.6 19-Apr-2000 csapuntz

Change struct file interface methods read and write to pass file offset in
and out.

Make pread/pwrite in netbsd & linux thread safe - which is the whole point
anyway.


Revision tags: SMP_BASE
# 1.5 01-Feb-2000 assar

branches: 1.5.2;
add declaration of `vnops'


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE OPENBSD_2_5_BASE OPENBSD_2_6_BASE kame_19991208
# 1.4 01-Mar-1998 deraadt

crank f_count/f_msgcount to long; when incrementing try to leave 2 slots
empty for unp_gc() in case of cross referenced sockets .


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.3 27-Aug-1996 shawn

New fast pipe(2) from freebsd without fancy vm stuff.

The old pipes can be used with the "OLD_PIPE" config option.


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.60 01-Feb-2020 anton

Make writes to the f_flag field of `struct file' MP-safe using atomic
operations. Since the type of f_flag must change in order to use the
atomic(9) API, reorder the struct in order to avoid padding; as pointed
out by tedu@.

ok mpi@ visa@


# 1.59 05-Jan-2020 visa

Constify instances of struct fileops.

OK anton@, mpi@, bluhm@


Revision tags: OPENBSD_6_6_BASE
# 1.58 05-Aug-2019 anton

Allow concurrent reads of the f_offset field of struct file by
serializing both read/write operations using the existing file mutex.
The vnode lock still grants exclusive write access to the offset; the
mutex is only used to make the actual write atomic and prevent any
concurrent reader from observing intermediate values.

ok mpi@ visa@


# 1.57 12-Jul-2019 solene

Revert anton@ changes about read/write unlocking
https://marc.info/?l=openbsd-cvs&m=156277704122293&w=2

ok anton@


# 1.56 11-Jul-2019 anton

zero pad and align FO_POSITION; no binary change


# 1.55 10-Jul-2019 anton

Make read/write of the f_offset field belonging to struct file MP-safe;
as part of the effort to unlock the kernel. Instead of relying on the
vnode lock, introduce a dedicated lock per file. Exclusive write access
is granted using the new foffset_enter and foffset_leave API. A
convenience function foffset_get is also available for threads that only
need to read the current offset.

The lock acquisition order in vn_write has been changed to match the one
in vn_read in order to avoid a potential deadlock. This change also gets
rid of a documented race in vn_read().

Inspired by the FreeBSD implementation.

With help and ok mpi@ visa@


# 1.54 22-Jun-2019 semarie

push the KERNEL_LOCK deeper on read(2) and write(2)

unlocks read(2) and write(2) syscalls families, and push the KERNEL_LOCK
deeper in the code path. KERNEL_LOCK is managed per file type in fileops
handlers (fo_read, fo_write, and fo_close). read(2) and write(2) on
socket are KERNEL_LOCK-free.

initial work from mpi@ and ians@

ok mpi@ kettenis@ visa@ ians@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.53 20-Aug-2018 mpi

Reorder checks in the read/write(2) family of syscalls to prepare making
file operations mp-safe.

This change makes it clear that `f_offset' is only accessed in vn_read()
and vn_write(), which will help taking it out of the KERNEL_LOCK().

This refactoring uncovered a race in vn_read() which is now documented
and will be addressed in a later diff.

ok visa@


# 1.52 03-Jul-2018 kettenis

Add a new so_seek member to "struct file" such that we can have seekable
files that aren't vnodes. Move the vnode-specific code into its own
function. Add an implementation for the "DMA buffers" that can be used
by DRI3/prime code to find out the size of the graphics buffer.
This implementation is very limited and only supports offset 0 and only
for SEEK_SET and SEEK_END. This doesn't really make sense; implementing
stat(2) would be a more obvious choice. But this is what Linux does.

ok guenther@, visa@


# 1.51 02-Jul-2018 visa

Update the file reference count field `f_count' using atomic operations
instead of using a mutex for update serialization. Use a per-fdp mutex
to manage updating of file instance pointers in the `fd_ofiles' array
to let fd_getfile() acquire file references safely with concurrent file
reference releases.

OK mpi@


# 1.50 25-Jun-2018 kettenis

Implement DRI3/prime support. This allows graphics buffers to be passed
between processes using file descriptors. This provides an alternative to
eporting them with guesable 32-bit IDs. This implementation does not (yet)
allow sharing of graphics buffers between GPUs.

ok mpi@, visa@


# 1.49 20-Jun-2018 mpi

Unlock sendmsg(2) and sendto(2).

These syscalls can now be executed w/o the KERNEL_LOCK() depending on
the kind of socket.

The current solution uses a single global mutex to serialize access to,
and reference count, 'struct file'.

ok visa@, kettenis@


# 1.48 18-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup, take 3.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]' or the global linked list. This allows
us to simplifies a lot code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu Masson, visa@, guenther@ and art@

Previous version ok bluhm@, ok visa@, sthen@


# 1.47 05-Jun-2018 mpi

Revert introduction of fdinsert(), a sanitify check triggers when
closing a LARVAL file.

Found the hardway by sthen@.


# 1.46 02-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]'. This allows us to simplifies a lot
code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu -, visa@, guenther@ and art@

ok visa@, bluhm@


# 1.45 09-May-2018 mpi

Mark `f_ops' as immutable.

The only place where it was modified after initialization is a corner
case where the vnode of an open file is substitued by another one. Sine
the type of the file doesn't change, there's no need to overwrite `f_ops'.

While here proctect file counters with `f_mtx'.

ok bluhm@, visa@


# 1.44 08-May-2018 mpi

Do do include <sys/mount.h> because it breaks some userland programs
that define _KERNEL...


# 1.43 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.42 08-May-2018 mpi

Protect per-file counters and document which lock is used to protect
the other fields.

Once we no longer have any [k] (kernel lock) protections, we'll be
able to unlock almost all network related syscalls.

Inputs from and ok bluhm@, visa@


# 1.41 25-Apr-2018 mpi

Introduce fd_iterfile() a new helper function to iterate over `filehead'.

This turns `filehead' into a local variable, that will make it easier
to protect it.

ok visa@


Revision tags: OPENBSD_6_3_BASE
# 1.40 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.39 02-Jan-2018 guenther

Don't #include fcntl.h when _KERNEL is defined.

inspired by FreeBSD r24131
ok millert@ sthen@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.38 23-Aug-2016 tedu

rename nfiles to numfiles to avoid shadowing and stretch out the name.
ok deraadt


Revision tags: OPENBSD_6_0_BASE
# 1.37 26-Apr-2016 deraadt

No good reason to retain comments about old DTYPE_CRYPTO or DTYPE_SYSTRACE
values.


# 1.36 25-Apr-2016 tedu

remove systrace remnants


Revision tags: OPENBSD_5_9_BASE
# 1.35 28-Aug-2015 guenther

Rework the UNIX domain socket garbage collector, including ideas from
{Free,Net}BSD
- when a socket is closed with fds in its input, defer closing them to
a task to avoid recursing. This eliminates the complicated extra
reference taking which had a 37 line(!) comment explanation
- move flags, counts, and links only needed for this from struct file to
struct unpcb
- document the flow of the mark/sweep collector

much help from claudio@ who made me explain the GC to him until we trusted it
ok claudio@ mpi@ deraadt@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.34 18-Nov-2014 mikeb

DTYPE_CRYPTO is not used anymore; ok guenther (a while ago)


# 1.33 18-Nov-2014 tedu

file.h doesn't need to include unistd.h


Revision tags: OPENBSD_5_6_BASE
# 1.32 10-Jul-2014 deraadt

struct ucred; for fstat _KERNEL block


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.31 05-Jun-2013 guenther

Move FHASLOCK from f_flag to f_iflags, freeing up a bit for passing
O_* flags and eliminating an XXX comment.

ok matthew@ deraadt@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.30 01-May-2012 guenther

Eliminate the f_usecount ref count in struct file; instead of sleeping
at the top of closef() until all in-progress calls finish, just do the
advisory locking bits required of close() by POSIX and let whichever
thread has the last reference do the call to the file's fo_close()
method and the final cleanup.

lots of discussion with deraadt@ and others; worked out with and ok krw@


# 1.29 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.28 28-Jun-2011 thib

Rename FMARK to FIF_MARK and FDEFER to FIF_DEFER and
move those flags to f_iflags; This makes rooms in the
flag member of struct file for some goodies matthew@
as planned.

ok matthew@, deraadt@.


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.27 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.26 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.25 04-Jun-2009 blambert

Put readv/writev changes back in, as they no longer hang ckuethe's ntpd.

Special thanks to ckuethe's ntpd for noticing the problem.

ok deraadt@


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.24 26-Mar-2006 mickey

do per file io accounting and show that in fstat as well; pedro@ marco@ ok


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE OPENBSD_3_9_BASE SMP_SYNC_A SMP_SYNC_B
# 1.23 23-Sep-2003 millert

Replace select backends with poll backends. selscan() and pollscan()
now call the poll backend. With this change we implement greater
poll(2) functionality instead of emulating it via the select backend.
Adapted from NetBSD and including some changes from FreeBSD.
Tested by many, deraadt@ OK


Revision tags: OPENBSD_3_4_BASE
# 1.22 06-Aug-2003 deraadt

must pre-def struct file before circular structs


# 1.21 01-Aug-2003 tedu

move fileops out of file, and make it pretty. ok deraadt@ millert@


# 1.20 18-Jul-2003 tedu

caddr_t -> void *. ok millert@ tdeval@


# 1.19 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.18 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.17 14-Mar-2002 millert

First round of __P removal in sys


# 1.16 08-Feb-2002 art

- Rename FILE_{,UN}USE to FREF and FRELE. USE is a bad verb and we don't have
the same semantics as NetBSD anyway, so it's good to avoid name collissions.
- Always fdremove before freeing the file, not the other way around.
- falloc FREFs the file.
- have FILE_SET_MATURE FRELE the file (It feels like a good ortogonality to
falloc FREFing the file).
- Use closef as much as possible instead of ffree in error paths of
falloc:ing functions. closef is much more careful with the fd and can
deal with the fd being forcibly closed by dup2. Also try to avoid
manually calling *fo_close when closef can do that for us (this makes
some error paths mroe complicated (sys_socketpair and sys_pipe), but
others become simpler (sys_open)).


# 1.15 05-Feb-2002 art

Add counting of temporary references to a struct file (as opposed to references
from fd tables and other long-lived objects). This is to avoid races between
using a file descriptor and having another process (with shared fd table)
close it. We use a separate refence count so that error values from close(2)
will be correctly returned to the caller of close(2).

The macros for those reference counts are FILE_USE(fp) and FILE_UNUSE(fp).

Make sure that the cases where closef can be called "incorrectly" (most notably
dup2(2)) are handled.

Right now only callers of closef (and {,p}read) use FILE_{,UN}USE correctly,
more fixes incoming soon.


Revision tags: UBC_BASE
# 1.14 31-Oct-2001 art

branches: 1.14.2;
Clarify some struct fields.


# 1.13 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.12 15-May-2001 deraadt

DTYPE_CRYPTO


# 1.11 14-May-2001 art

Add a fo_stat member to struct fileops. Used soon.
Also add a stat function for kqueue from FreeBSD.


Revision tags: OPENBSD_2_9_BASE
# 1.10 01-Mar-2001 provos

port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.9 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.8 24-May-2000 deraadt

move kernel prototypes using iovec to the right place


Revision tags: OPENBSD_2_7_BASE
# 1.7 20-Apr-2000 deraadt

p{read,write}{,v} from csapuntz, partial NetBSD origin I think


# 1.6 19-Apr-2000 csapuntz

Change struct file interface methods read and write to pass file offset in
and out.

Make pread/pwrite in netbsd & linux thread safe - which is the whole point
anyway.


Revision tags: SMP_BASE
# 1.5 01-Feb-2000 assar

branches: 1.5.2;
add declaration of `vnops'


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE OPENBSD_2_5_BASE OPENBSD_2_6_BASE kame_19991208
# 1.4 01-Mar-1998 deraadt

crank f_count/f_msgcount to long; when incrementing try to leave 2 slots
empty for unp_gc() in case of cross referenced sockets .


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.3 27-Aug-1996 shawn

New fast pipe(2) from freebsd without fancy vm stuff.

The old pipes can be used with the "OLD_PIPE" config option.


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.59 05-Jan-2020 visa

Constify instances of struct fileops.

OK anton@, mpi@, bluhm@


Revision tags: OPENBSD_6_6_BASE
# 1.58 05-Aug-2019 anton

Allow concurrent reads of the f_offset field of struct file by
serializing both read/write operations using the existing file mutex.
The vnode lock still grants exclusive write access to the offset; the
mutex is only used to make the actual write atomic and prevent any
concurrent reader from observing intermediate values.

ok mpi@ visa@


# 1.57 12-Jul-2019 solene

Revert anton@ changes about read/write unlocking
https://marc.info/?l=openbsd-cvs&m=156277704122293&w=2

ok anton@


# 1.56 11-Jul-2019 anton

zero pad and align FO_POSITION; no binary change


# 1.55 10-Jul-2019 anton

Make read/write of the f_offset field belonging to struct file MP-safe;
as part of the effort to unlock the kernel. Instead of relying on the
vnode lock, introduce a dedicated lock per file. Exclusive write access
is granted using the new foffset_enter and foffset_leave API. A
convenience function foffset_get is also available for threads that only
need to read the current offset.

The lock acquisition order in vn_write has been changed to match the one
in vn_read in order to avoid a potential deadlock. This change also gets
rid of a documented race in vn_read().

Inspired by the FreeBSD implementation.

With help and ok mpi@ visa@


# 1.54 22-Jun-2019 semarie

push the KERNEL_LOCK deeper on read(2) and write(2)

unlocks read(2) and write(2) syscalls families, and push the KERNEL_LOCK
deeper in the code path. KERNEL_LOCK is managed per file type in fileops
handlers (fo_read, fo_write, and fo_close). read(2) and write(2) on
socket are KERNEL_LOCK-free.

initial work from mpi@ and ians@

ok mpi@ kettenis@ visa@ ians@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.53 20-Aug-2018 mpi

Reorder checks in the read/write(2) family of syscalls to prepare making
file operations mp-safe.

This change makes it clear that `f_offset' is only accessed in vn_read()
and vn_write(), which will help taking it out of the KERNEL_LOCK().

This refactoring uncovered a race in vn_read() which is now documented
and will be addressed in a later diff.

ok visa@


# 1.52 03-Jul-2018 kettenis

Add a new so_seek member to "struct file" such that we can have seekable
files that aren't vnodes. Move the vnode-specific code into its own
function. Add an implementation for the "DMA buffers" that can be used
by DRI3/prime code to find out the size of the graphics buffer.
This implementation is very limited and only supports offset 0 and only
for SEEK_SET and SEEK_END. This doesn't really make sense; implementing
stat(2) would be a more obvious choice. But this is what Linux does.

ok guenther@, visa@


# 1.51 02-Jul-2018 visa

Update the file reference count field `f_count' using atomic operations
instead of using a mutex for update serialization. Use a per-fdp mutex
to manage updating of file instance pointers in the `fd_ofiles' array
to let fd_getfile() acquire file references safely with concurrent file
reference releases.

OK mpi@


# 1.50 25-Jun-2018 kettenis

Implement DRI3/prime support. This allows graphics buffers to be passed
between processes using file descriptors. This provides an alternative to
eporting them with guesable 32-bit IDs. This implementation does not (yet)
allow sharing of graphics buffers between GPUs.

ok mpi@, visa@


# 1.49 20-Jun-2018 mpi

Unlock sendmsg(2) and sendto(2).

These syscalls can now be executed w/o the KERNEL_LOCK() depending on
the kind of socket.

The current solution uses a single global mutex to serialize access to,
and reference count, 'struct file'.

ok visa@, kettenis@


# 1.48 18-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup, take 3.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]' or the global linked list. This allows
us to simplifies a lot code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu Masson, visa@, guenther@ and art@

Previous version ok bluhm@, ok visa@, sthen@


# 1.47 05-Jun-2018 mpi

Revert introduction of fdinsert(), a sanitify check triggers when
closing a LARVAL file.

Found the hardway by sthen@.


# 1.46 02-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]'. This allows us to simplifies a lot
code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu -, visa@, guenther@ and art@

ok visa@, bluhm@


# 1.45 09-May-2018 mpi

Mark `f_ops' as immutable.

The only place where it was modified after initialization is a corner
case where the vnode of an open file is substitued by another one. Sine
the type of the file doesn't change, there's no need to overwrite `f_ops'.

While here proctect file counters with `f_mtx'.

ok bluhm@, visa@


# 1.44 08-May-2018 mpi

Do do include <sys/mount.h> because it breaks some userland programs
that define _KERNEL...


# 1.43 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.42 08-May-2018 mpi

Protect per-file counters and document which lock is used to protect
the other fields.

Once we no longer have any [k] (kernel lock) protections, we'll be
able to unlock almost all network related syscalls.

Inputs from and ok bluhm@, visa@


# 1.41 25-Apr-2018 mpi

Introduce fd_iterfile() a new helper function to iterate over `filehead'.

This turns `filehead' into a local variable, that will make it easier
to protect it.

ok visa@


Revision tags: OPENBSD_6_3_BASE
# 1.40 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.39 02-Jan-2018 guenther

Don't #include fcntl.h when _KERNEL is defined.

inspired by FreeBSD r24131
ok millert@ sthen@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.38 23-Aug-2016 tedu

rename nfiles to numfiles to avoid shadowing and stretch out the name.
ok deraadt


Revision tags: OPENBSD_6_0_BASE
# 1.37 26-Apr-2016 deraadt

No good reason to retain comments about old DTYPE_CRYPTO or DTYPE_SYSTRACE
values.


# 1.36 25-Apr-2016 tedu

remove systrace remnants


Revision tags: OPENBSD_5_9_BASE
# 1.35 28-Aug-2015 guenther

Rework the UNIX domain socket garbage collector, including ideas from
{Free,Net}BSD
- when a socket is closed with fds in its input, defer closing them to
a task to avoid recursing. This eliminates the complicated extra
reference taking which had a 37 line(!) comment explanation
- move flags, counts, and links only needed for this from struct file to
struct unpcb
- document the flow of the mark/sweep collector

much help from claudio@ who made me explain the GC to him until we trusted it
ok claudio@ mpi@ deraadt@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.34 18-Nov-2014 mikeb

DTYPE_CRYPTO is not used anymore; ok guenther (a while ago)


# 1.33 18-Nov-2014 tedu

file.h doesn't need to include unistd.h


Revision tags: OPENBSD_5_6_BASE
# 1.32 10-Jul-2014 deraadt

struct ucred; for fstat _KERNEL block


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.31 05-Jun-2013 guenther

Move FHASLOCK from f_flag to f_iflags, freeing up a bit for passing
O_* flags and eliminating an XXX comment.

ok matthew@ deraadt@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.30 01-May-2012 guenther

Eliminate the f_usecount ref count in struct file; instead of sleeping
at the top of closef() until all in-progress calls finish, just do the
advisory locking bits required of close() by POSIX and let whichever
thread has the last reference do the call to the file's fo_close()
method and the final cleanup.

lots of discussion with deraadt@ and others; worked out with and ok krw@


# 1.29 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.28 28-Jun-2011 thib

Rename FMARK to FIF_MARK and FDEFER to FIF_DEFER and
move those flags to f_iflags; This makes rooms in the
flag member of struct file for some goodies matthew@
as planned.

ok matthew@, deraadt@.


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.27 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.26 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.25 04-Jun-2009 blambert

Put readv/writev changes back in, as they no longer hang ckuethe's ntpd.

Special thanks to ckuethe's ntpd for noticing the problem.

ok deraadt@


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.24 26-Mar-2006 mickey

do per file io accounting and show that in fstat as well; pedro@ marco@ ok


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE OPENBSD_3_9_BASE SMP_SYNC_A SMP_SYNC_B
# 1.23 23-Sep-2003 millert

Replace select backends with poll backends. selscan() and pollscan()
now call the poll backend. With this change we implement greater
poll(2) functionality instead of emulating it via the select backend.
Adapted from NetBSD and including some changes from FreeBSD.
Tested by many, deraadt@ OK


Revision tags: OPENBSD_3_4_BASE
# 1.22 06-Aug-2003 deraadt

must pre-def struct file before circular structs


# 1.21 01-Aug-2003 tedu

move fileops out of file, and make it pretty. ok deraadt@ millert@


# 1.20 18-Jul-2003 tedu

caddr_t -> void *. ok millert@ tdeval@


# 1.19 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.18 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.17 14-Mar-2002 millert

First round of __P removal in sys


# 1.16 08-Feb-2002 art

- Rename FILE_{,UN}USE to FREF and FRELE. USE is a bad verb and we don't have
the same semantics as NetBSD anyway, so it's good to avoid name collissions.
- Always fdremove before freeing the file, not the other way around.
- falloc FREFs the file.
- have FILE_SET_MATURE FRELE the file (It feels like a good ortogonality to
falloc FREFing the file).
- Use closef as much as possible instead of ffree in error paths of
falloc:ing functions. closef is much more careful with the fd and can
deal with the fd being forcibly closed by dup2. Also try to avoid
manually calling *fo_close when closef can do that for us (this makes
some error paths mroe complicated (sys_socketpair and sys_pipe), but
others become simpler (sys_open)).


# 1.15 05-Feb-2002 art

Add counting of temporary references to a struct file (as opposed to references
from fd tables and other long-lived objects). This is to avoid races between
using a file descriptor and having another process (with shared fd table)
close it. We use a separate refence count so that error values from close(2)
will be correctly returned to the caller of close(2).

The macros for those reference counts are FILE_USE(fp) and FILE_UNUSE(fp).

Make sure that the cases where closef can be called "incorrectly" (most notably
dup2(2)) are handled.

Right now only callers of closef (and {,p}read) use FILE_{,UN}USE correctly,
more fixes incoming soon.


Revision tags: UBC_BASE
# 1.14 31-Oct-2001 art

branches: 1.14.2;
Clarify some struct fields.


# 1.13 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.12 15-May-2001 deraadt

DTYPE_CRYPTO


# 1.11 14-May-2001 art

Add a fo_stat member to struct fileops. Used soon.
Also add a stat function for kqueue from FreeBSD.


Revision tags: OPENBSD_2_9_BASE
# 1.10 01-Mar-2001 provos

port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.9 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.8 24-May-2000 deraadt

move kernel prototypes using iovec to the right place


Revision tags: OPENBSD_2_7_BASE
# 1.7 20-Apr-2000 deraadt

p{read,write}{,v} from csapuntz, partial NetBSD origin I think


# 1.6 19-Apr-2000 csapuntz

Change struct file interface methods read and write to pass file offset in
and out.

Make pread/pwrite in netbsd & linux thread safe - which is the whole point
anyway.


Revision tags: SMP_BASE
# 1.5 01-Feb-2000 assar

branches: 1.5.2;
add declaration of `vnops'


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE OPENBSD_2_5_BASE OPENBSD_2_6_BASE kame_19991208
# 1.4 01-Mar-1998 deraadt

crank f_count/f_msgcount to long; when incrementing try to leave 2 slots
empty for unp_gc() in case of cross referenced sockets .


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.3 27-Aug-1996 shawn

New fast pipe(2) from freebsd without fancy vm stuff.

The old pipes can be used with the "OLD_PIPE" config option.


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.58 05-Aug-2019 anton

Allow concurrent reads of the f_offset field of struct file by
serializing both read/write operations using the existing file mutex.
The vnode lock still grants exclusive write access to the offset; the
mutex is only used to make the actual write atomic and prevent any
concurrent reader from observing intermediate values.

ok mpi@ visa@


# 1.57 12-Jul-2019 solene

Revert anton@ changes about read/write unlocking
https://marc.info/?l=openbsd-cvs&m=156277704122293&w=2

ok anton@


# 1.56 11-Jul-2019 anton

zero pad and align FO_POSITION; no binary change


# 1.55 10-Jul-2019 anton

Make read/write of the f_offset field belonging to struct file MP-safe;
as part of the effort to unlock the kernel. Instead of relying on the
vnode lock, introduce a dedicated lock per file. Exclusive write access
is granted using the new foffset_enter and foffset_leave API. A
convenience function foffset_get is also available for threads that only
need to read the current offset.

The lock acquisition order in vn_write has been changed to match the one
in vn_read in order to avoid a potential deadlock. This change also gets
rid of a documented race in vn_read().

Inspired by the FreeBSD implementation.

With help and ok mpi@ visa@


# 1.54 22-Jun-2019 semarie

push the KERNEL_LOCK deeper on read(2) and write(2)

unlocks read(2) and write(2) syscalls families, and push the KERNEL_LOCK
deeper in the code path. KERNEL_LOCK is managed per file type in fileops
handlers (fo_read, fo_write, and fo_close). read(2) and write(2) on
socket are KERNEL_LOCK-free.

initial work from mpi@ and ians@

ok mpi@ kettenis@ visa@ ians@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.53 20-Aug-2018 mpi

Reorder checks in the read/write(2) family of syscalls to prepare making
file operations mp-safe.

This change makes it clear that `f_offset' is only accessed in vn_read()
and vn_write(), which will help taking it out of the KERNEL_LOCK().

This refactoring uncovered a race in vn_read() which is now documented
and will be addressed in a later diff.

ok visa@


# 1.52 03-Jul-2018 kettenis

Add a new so_seek member to "struct file" such that we can have seekable
files that aren't vnodes. Move the vnode-specific code into its own
function. Add an implementation for the "DMA buffers" that can be used
by DRI3/prime code to find out the size of the graphics buffer.
This implementation is very limited and only supports offset 0 and only
for SEEK_SET and SEEK_END. This doesn't really make sense; implementing
stat(2) would be a more obvious choice. But this is what Linux does.

ok guenther@, visa@


# 1.51 02-Jul-2018 visa

Update the file reference count field `f_count' using atomic operations
instead of using a mutex for update serialization. Use a per-fdp mutex
to manage updating of file instance pointers in the `fd_ofiles' array
to let fd_getfile() acquire file references safely with concurrent file
reference releases.

OK mpi@


# 1.50 25-Jun-2018 kettenis

Implement DRI3/prime support. This allows graphics buffers to be passed
between processes using file descriptors. This provides an alternative to
eporting them with guesable 32-bit IDs. This implementation does not (yet)
allow sharing of graphics buffers between GPUs.

ok mpi@, visa@


# 1.49 20-Jun-2018 mpi

Unlock sendmsg(2) and sendto(2).

These syscalls can now be executed w/o the KERNEL_LOCK() depending on
the kind of socket.

The current solution uses a single global mutex to serialize access to,
and reference count, 'struct file'.

ok visa@, kettenis@


# 1.48 18-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup, take 3.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]' or the global linked list. This allows
us to simplifies a lot code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu Masson, visa@, guenther@ and art@

Previous version ok bluhm@, ok visa@, sthen@


# 1.47 05-Jun-2018 mpi

Revert introduction of fdinsert(), a sanitify check triggers when
closing a LARVAL file.

Found the hardway by sthen@.


# 1.46 02-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]'. This allows us to simplifies a lot
code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu -, visa@, guenther@ and art@

ok visa@, bluhm@


# 1.45 09-May-2018 mpi

Mark `f_ops' as immutable.

The only place where it was modified after initialization is a corner
case where the vnode of an open file is substitued by another one. Sine
the type of the file doesn't change, there's no need to overwrite `f_ops'.

While here proctect file counters with `f_mtx'.

ok bluhm@, visa@


# 1.44 08-May-2018 mpi

Do do include <sys/mount.h> because it breaks some userland programs
that define _KERNEL...


# 1.43 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.42 08-May-2018 mpi

Protect per-file counters and document which lock is used to protect
the other fields.

Once we no longer have any [k] (kernel lock) protections, we'll be
able to unlock almost all network related syscalls.

Inputs from and ok bluhm@, visa@


# 1.41 25-Apr-2018 mpi

Introduce fd_iterfile() a new helper function to iterate over `filehead'.

This turns `filehead' into a local variable, that will make it easier
to protect it.

ok visa@


Revision tags: OPENBSD_6_3_BASE
# 1.40 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.39 02-Jan-2018 guenther

Don't #include fcntl.h when _KERNEL is defined.

inspired by FreeBSD r24131
ok millert@ sthen@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.38 23-Aug-2016 tedu

rename nfiles to numfiles to avoid shadowing and stretch out the name.
ok deraadt


Revision tags: OPENBSD_6_0_BASE
# 1.37 26-Apr-2016 deraadt

No good reason to retain comments about old DTYPE_CRYPTO or DTYPE_SYSTRACE
values.


# 1.36 25-Apr-2016 tedu

remove systrace remnants


Revision tags: OPENBSD_5_9_BASE
# 1.35 28-Aug-2015 guenther

Rework the UNIX domain socket garbage collector, including ideas from
{Free,Net}BSD
- when a socket is closed with fds in its input, defer closing them to
a task to avoid recursing. This eliminates the complicated extra
reference taking which had a 37 line(!) comment explanation
- move flags, counts, and links only needed for this from struct file to
struct unpcb
- document the flow of the mark/sweep collector

much help from claudio@ who made me explain the GC to him until we trusted it
ok claudio@ mpi@ deraadt@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.34 18-Nov-2014 mikeb

DTYPE_CRYPTO is not used anymore; ok guenther (a while ago)


# 1.33 18-Nov-2014 tedu

file.h doesn't need to include unistd.h


Revision tags: OPENBSD_5_6_BASE
# 1.32 10-Jul-2014 deraadt

struct ucred; for fstat _KERNEL block


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.31 05-Jun-2013 guenther

Move FHASLOCK from f_flag to f_iflags, freeing up a bit for passing
O_* flags and eliminating an XXX comment.

ok matthew@ deraadt@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.30 01-May-2012 guenther

Eliminate the f_usecount ref count in struct file; instead of sleeping
at the top of closef() until all in-progress calls finish, just do the
advisory locking bits required of close() by POSIX and let whichever
thread has the last reference do the call to the file's fo_close()
method and the final cleanup.

lots of discussion with deraadt@ and others; worked out with and ok krw@


# 1.29 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.28 28-Jun-2011 thib

Rename FMARK to FIF_MARK and FDEFER to FIF_DEFER and
move those flags to f_iflags; This makes rooms in the
flag member of struct file for some goodies matthew@
as planned.

ok matthew@, deraadt@.


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.27 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.26 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.25 04-Jun-2009 blambert

Put readv/writev changes back in, as they no longer hang ckuethe's ntpd.

Special thanks to ckuethe's ntpd for noticing the problem.

ok deraadt@


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.24 26-Mar-2006 mickey

do per file io accounting and show that in fstat as well; pedro@ marco@ ok


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE OPENBSD_3_9_BASE SMP_SYNC_A SMP_SYNC_B
# 1.23 23-Sep-2003 millert

Replace select backends with poll backends. selscan() and pollscan()
now call the poll backend. With this change we implement greater
poll(2) functionality instead of emulating it via the select backend.
Adapted from NetBSD and including some changes from FreeBSD.
Tested by many, deraadt@ OK


Revision tags: OPENBSD_3_4_BASE
# 1.22 06-Aug-2003 deraadt

must pre-def struct file before circular structs


# 1.21 01-Aug-2003 tedu

move fileops out of file, and make it pretty. ok deraadt@ millert@


# 1.20 18-Jul-2003 tedu

caddr_t -> void *. ok millert@ tdeval@


# 1.19 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.18 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.17 14-Mar-2002 millert

First round of __P removal in sys


# 1.16 08-Feb-2002 art

- Rename FILE_{,UN}USE to FREF and FRELE. USE is a bad verb and we don't have
the same semantics as NetBSD anyway, so it's good to avoid name collissions.
- Always fdremove before freeing the file, not the other way around.
- falloc FREFs the file.
- have FILE_SET_MATURE FRELE the file (It feels like a good ortogonality to
falloc FREFing the file).
- Use closef as much as possible instead of ffree in error paths of
falloc:ing functions. closef is much more careful with the fd and can
deal with the fd being forcibly closed by dup2. Also try to avoid
manually calling *fo_close when closef can do that for us (this makes
some error paths mroe complicated (sys_socketpair and sys_pipe), but
others become simpler (sys_open)).


# 1.15 05-Feb-2002 art

Add counting of temporary references to a struct file (as opposed to references
from fd tables and other long-lived objects). This is to avoid races between
using a file descriptor and having another process (with shared fd table)
close it. We use a separate refence count so that error values from close(2)
will be correctly returned to the caller of close(2).

The macros for those reference counts are FILE_USE(fp) and FILE_UNUSE(fp).

Make sure that the cases where closef can be called "incorrectly" (most notably
dup2(2)) are handled.

Right now only callers of closef (and {,p}read) use FILE_{,UN}USE correctly,
more fixes incoming soon.


Revision tags: UBC_BASE
# 1.14 31-Oct-2001 art

branches: 1.14.2;
Clarify some struct fields.


# 1.13 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.12 15-May-2001 deraadt

DTYPE_CRYPTO


# 1.11 14-May-2001 art

Add a fo_stat member to struct fileops. Used soon.
Also add a stat function for kqueue from FreeBSD.


Revision tags: OPENBSD_2_9_BASE
# 1.10 01-Mar-2001 provos

port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.9 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.8 24-May-2000 deraadt

move kernel prototypes using iovec to the right place


Revision tags: OPENBSD_2_7_BASE
# 1.7 20-Apr-2000 deraadt

p{read,write}{,v} from csapuntz, partial NetBSD origin I think


# 1.6 19-Apr-2000 csapuntz

Change struct file interface methods read and write to pass file offset in
and out.

Make pread/pwrite in netbsd & linux thread safe - which is the whole point
anyway.


Revision tags: SMP_BASE
# 1.5 01-Feb-2000 assar

branches: 1.5.2;
add declaration of `vnops'


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE OPENBSD_2_5_BASE OPENBSD_2_6_BASE kame_19991208
# 1.4 01-Mar-1998 deraadt

crank f_count/f_msgcount to long; when incrementing try to leave 2 slots
empty for unp_gc() in case of cross referenced sockets .


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.3 27-Aug-1996 shawn

New fast pipe(2) from freebsd without fancy vm stuff.

The old pipes can be used with the "OLD_PIPE" config option.


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.57 12-Jul-2019 solene

Revert anton@ changes about read/write unlocking
https://marc.info/?l=openbsd-cvs&m=156277704122293&w=2

ok anton@


# 1.56 11-Jul-2019 anton

zero pad and align FO_POSITION; no binary change


# 1.55 10-Jul-2019 anton

Make read/write of the f_offset field belonging to struct file MP-safe;
as part of the effort to unlock the kernel. Instead of relying on the
vnode lock, introduce a dedicated lock per file. Exclusive write access
is granted using the new foffset_enter and foffset_leave API. A
convenience function foffset_get is also available for threads that only
need to read the current offset.

The lock acquisition order in vn_write has been changed to match the one
in vn_read in order to avoid a potential deadlock. This change also gets
rid of a documented race in vn_read().

Inspired by the FreeBSD implementation.

With help and ok mpi@ visa@


# 1.54 22-Jun-2019 semarie

push the KERNEL_LOCK deeper on read(2) and write(2)

unlocks read(2) and write(2) syscalls families, and push the KERNEL_LOCK
deeper in the code path. KERNEL_LOCK is managed per file type in fileops
handlers (fo_read, fo_write, and fo_close). read(2) and write(2) on
socket are KERNEL_LOCK-free.

initial work from mpi@ and ians@

ok mpi@ kettenis@ visa@ ians@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.53 20-Aug-2018 mpi

Reorder checks in the read/write(2) family of syscalls to prepare making
file operations mp-safe.

This change makes it clear that `f_offset' is only accessed in vn_read()
and vn_write(), which will help taking it out of the KERNEL_LOCK().

This refactoring uncovered a race in vn_read() which is now documented
and will be addressed in a later diff.

ok visa@


# 1.52 03-Jul-2018 kettenis

Add a new so_seek member to "struct file" such that we can have seekable
files that aren't vnodes. Move the vnode-specific code into its own
function. Add an implementation for the "DMA buffers" that can be used
by DRI3/prime code to find out the size of the graphics buffer.
This implementation is very limited and only supports offset 0 and only
for SEEK_SET and SEEK_END. This doesn't really make sense; implementing
stat(2) would be a more obvious choice. But this is what Linux does.

ok guenther@, visa@


# 1.51 02-Jul-2018 visa

Update the file reference count field `f_count' using atomic operations
instead of using a mutex for update serialization. Use a per-fdp mutex
to manage updating of file instance pointers in the `fd_ofiles' array
to let fd_getfile() acquire file references safely with concurrent file
reference releases.

OK mpi@


# 1.50 25-Jun-2018 kettenis

Implement DRI3/prime support. This allows graphics buffers to be passed
between processes using file descriptors. This provides an alternative to
eporting them with guesable 32-bit IDs. This implementation does not (yet)
allow sharing of graphics buffers between GPUs.

ok mpi@, visa@


# 1.49 20-Jun-2018 mpi

Unlock sendmsg(2) and sendto(2).

These syscalls can now be executed w/o the KERNEL_LOCK() depending on
the kind of socket.

The current solution uses a single global mutex to serialize access to,
and reference count, 'struct file'.

ok visa@, kettenis@


# 1.48 18-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup, take 3.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]' or the global linked list. This allows
us to simplifies a lot code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu Masson, visa@, guenther@ and art@

Previous version ok bluhm@, ok visa@, sthen@


# 1.47 05-Jun-2018 mpi

Revert introduction of fdinsert(), a sanitify check triggers when
closing a LARVAL file.

Found the hardway by sthen@.


# 1.46 02-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]'. This allows us to simplifies a lot
code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu -, visa@, guenther@ and art@

ok visa@, bluhm@


# 1.45 09-May-2018 mpi

Mark `f_ops' as immutable.

The only place where it was modified after initialization is a corner
case where the vnode of an open file is substitued by another one. Sine
the type of the file doesn't change, there's no need to overwrite `f_ops'.

While here proctect file counters with `f_mtx'.

ok bluhm@, visa@


# 1.44 08-May-2018 mpi

Do do include <sys/mount.h> because it breaks some userland programs
that define _KERNEL...


# 1.43 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.42 08-May-2018 mpi

Protect per-file counters and document which lock is used to protect
the other fields.

Once we no longer have any [k] (kernel lock) protections, we'll be
able to unlock almost all network related syscalls.

Inputs from and ok bluhm@, visa@


# 1.41 25-Apr-2018 mpi

Introduce fd_iterfile() a new helper function to iterate over `filehead'.

This turns `filehead' into a local variable, that will make it easier
to protect it.

ok visa@


Revision tags: OPENBSD_6_3_BASE
# 1.40 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.39 02-Jan-2018 guenther

Don't #include fcntl.h when _KERNEL is defined.

inspired by FreeBSD r24131
ok millert@ sthen@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.38 23-Aug-2016 tedu

rename nfiles to numfiles to avoid shadowing and stretch out the name.
ok deraadt


Revision tags: OPENBSD_6_0_BASE
# 1.37 26-Apr-2016 deraadt

No good reason to retain comments about old DTYPE_CRYPTO or DTYPE_SYSTRACE
values.


# 1.36 25-Apr-2016 tedu

remove systrace remnants


Revision tags: OPENBSD_5_9_BASE
# 1.35 28-Aug-2015 guenther

Rework the UNIX domain socket garbage collector, including ideas from
{Free,Net}BSD
- when a socket is closed with fds in its input, defer closing them to
a task to avoid recursing. This eliminates the complicated extra
reference taking which had a 37 line(!) comment explanation
- move flags, counts, and links only needed for this from struct file to
struct unpcb
- document the flow of the mark/sweep collector

much help from claudio@ who made me explain the GC to him until we trusted it
ok claudio@ mpi@ deraadt@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.34 18-Nov-2014 mikeb

DTYPE_CRYPTO is not used anymore; ok guenther (a while ago)


# 1.33 18-Nov-2014 tedu

file.h doesn't need to include unistd.h


Revision tags: OPENBSD_5_6_BASE
# 1.32 10-Jul-2014 deraadt

struct ucred; for fstat _KERNEL block


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.31 05-Jun-2013 guenther

Move FHASLOCK from f_flag to f_iflags, freeing up a bit for passing
O_* flags and eliminating an XXX comment.

ok matthew@ deraadt@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.30 01-May-2012 guenther

Eliminate the f_usecount ref count in struct file; instead of sleeping
at the top of closef() until all in-progress calls finish, just do the
advisory locking bits required of close() by POSIX and let whichever
thread has the last reference do the call to the file's fo_close()
method and the final cleanup.

lots of discussion with deraadt@ and others; worked out with and ok krw@


# 1.29 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.28 28-Jun-2011 thib

Rename FMARK to FIF_MARK and FDEFER to FIF_DEFER and
move those flags to f_iflags; This makes rooms in the
flag member of struct file for some goodies matthew@
as planned.

ok matthew@, deraadt@.


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.27 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.26 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.25 04-Jun-2009 blambert

Put readv/writev changes back in, as they no longer hang ckuethe's ntpd.

Special thanks to ckuethe's ntpd for noticing the problem.

ok deraadt@


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.24 26-Mar-2006 mickey

do per file io accounting and show that in fstat as well; pedro@ marco@ ok


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE OPENBSD_3_9_BASE SMP_SYNC_A SMP_SYNC_B
# 1.23 23-Sep-2003 millert

Replace select backends with poll backends. selscan() and pollscan()
now call the poll backend. With this change we implement greater
poll(2) functionality instead of emulating it via the select backend.
Adapted from NetBSD and including some changes from FreeBSD.
Tested by many, deraadt@ OK


Revision tags: OPENBSD_3_4_BASE
# 1.22 06-Aug-2003 deraadt

must pre-def struct file before circular structs


# 1.21 01-Aug-2003 tedu

move fileops out of file, and make it pretty. ok deraadt@ millert@


# 1.20 18-Jul-2003 tedu

caddr_t -> void *. ok millert@ tdeval@


# 1.19 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.18 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.17 14-Mar-2002 millert

First round of __P removal in sys


# 1.16 08-Feb-2002 art

- Rename FILE_{,UN}USE to FREF and FRELE. USE is a bad verb and we don't have
the same semantics as NetBSD anyway, so it's good to avoid name collissions.
- Always fdremove before freeing the file, not the other way around.
- falloc FREFs the file.
- have FILE_SET_MATURE FRELE the file (It feels like a good ortogonality to
falloc FREFing the file).
- Use closef as much as possible instead of ffree in error paths of
falloc:ing functions. closef is much more careful with the fd and can
deal with the fd being forcibly closed by dup2. Also try to avoid
manually calling *fo_close when closef can do that for us (this makes
some error paths mroe complicated (sys_socketpair and sys_pipe), but
others become simpler (sys_open)).


# 1.15 05-Feb-2002 art

Add counting of temporary references to a struct file (as opposed to references
from fd tables and other long-lived objects). This is to avoid races between
using a file descriptor and having another process (with shared fd table)
close it. We use a separate refence count so that error values from close(2)
will be correctly returned to the caller of close(2).

The macros for those reference counts are FILE_USE(fp) and FILE_UNUSE(fp).

Make sure that the cases where closef can be called "incorrectly" (most notably
dup2(2)) are handled.

Right now only callers of closef (and {,p}read) use FILE_{,UN}USE correctly,
more fixes incoming soon.


Revision tags: UBC_BASE
# 1.14 31-Oct-2001 art

branches: 1.14.2;
Clarify some struct fields.


# 1.13 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.12 15-May-2001 deraadt

DTYPE_CRYPTO


# 1.11 14-May-2001 art

Add a fo_stat member to struct fileops. Used soon.
Also add a stat function for kqueue from FreeBSD.


Revision tags: OPENBSD_2_9_BASE
# 1.10 01-Mar-2001 provos

port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.9 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.8 24-May-2000 deraadt

move kernel prototypes using iovec to the right place


Revision tags: OPENBSD_2_7_BASE
# 1.7 20-Apr-2000 deraadt

p{read,write}{,v} from csapuntz, partial NetBSD origin I think


# 1.6 19-Apr-2000 csapuntz

Change struct file interface methods read and write to pass file offset in
and out.

Make pread/pwrite in netbsd & linux thread safe - which is the whole point
anyway.


Revision tags: SMP_BASE
# 1.5 01-Feb-2000 assar

branches: 1.5.2;
add declaration of `vnops'


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE OPENBSD_2_5_BASE OPENBSD_2_6_BASE kame_19991208
# 1.4 01-Mar-1998 deraadt

crank f_count/f_msgcount to long; when incrementing try to leave 2 slots
empty for unp_gc() in case of cross referenced sockets .


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.3 27-Aug-1996 shawn

New fast pipe(2) from freebsd without fancy vm stuff.

The old pipes can be used with the "OLD_PIPE" config option.


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.56 11-Jul-2019 anton

zero pad and align FO_POSITION; no binary change


# 1.55 10-Jul-2019 anton

Make read/write of the f_offset field belonging to struct file MP-safe;
as part of the effort to unlock the kernel. Instead of relying on the
vnode lock, introduce a dedicated lock per file. Exclusive write access
is granted using the new foffset_enter and foffset_leave API. A
convenience function foffset_get is also available for threads that only
need to read the current offset.

The lock acquisition order in vn_write has been changed to match the one
in vn_read in order to avoid a potential deadlock. This change also gets
rid of a documented race in vn_read().

Inspired by the FreeBSD implementation.

With help and ok mpi@ visa@


# 1.54 22-Jun-2019 semarie

push the KERNEL_LOCK deeper on read(2) and write(2)

unlocks read(2) and write(2) syscalls families, and push the KERNEL_LOCK
deeper in the code path. KERNEL_LOCK is managed per file type in fileops
handlers (fo_read, fo_write, and fo_close). read(2) and write(2) on
socket are KERNEL_LOCK-free.

initial work from mpi@ and ians@

ok mpi@ kettenis@ visa@ ians@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.53 20-Aug-2018 mpi

Reorder checks in the read/write(2) family of syscalls to prepare making
file operations mp-safe.

This change makes it clear that `f_offset' is only accessed in vn_read()
and vn_write(), which will help taking it out of the KERNEL_LOCK().

This refactoring uncovered a race in vn_read() which is now documented
and will be addressed in a later diff.

ok visa@


# 1.52 03-Jul-2018 kettenis

Add a new so_seek member to "struct file" such that we can have seekable
files that aren't vnodes. Move the vnode-specific code into its own
function. Add an implementation for the "DMA buffers" that can be used
by DRI3/prime code to find out the size of the graphics buffer.
This implementation is very limited and only supports offset 0 and only
for SEEK_SET and SEEK_END. This doesn't really make sense; implementing
stat(2) would be a more obvious choice. But this is what Linux does.

ok guenther@, visa@


# 1.51 02-Jul-2018 visa

Update the file reference count field `f_count' using atomic operations
instead of using a mutex for update serialization. Use a per-fdp mutex
to manage updating of file instance pointers in the `fd_ofiles' array
to let fd_getfile() acquire file references safely with concurrent file
reference releases.

OK mpi@


# 1.50 25-Jun-2018 kettenis

Implement DRI3/prime support. This allows graphics buffers to be passed
between processes using file descriptors. This provides an alternative to
eporting them with guesable 32-bit IDs. This implementation does not (yet)
allow sharing of graphics buffers between GPUs.

ok mpi@, visa@


# 1.49 20-Jun-2018 mpi

Unlock sendmsg(2) and sendto(2).

These syscalls can now be executed w/o the KERNEL_LOCK() depending on
the kind of socket.

The current solution uses a single global mutex to serialize access to,
and reference count, 'struct file'.

ok visa@, kettenis@


# 1.48 18-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup, take 3.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]' or the global linked list. This allows
us to simplifies a lot code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu Masson, visa@, guenther@ and art@

Previous version ok bluhm@, ok visa@, sthen@


# 1.47 05-Jun-2018 mpi

Revert introduction of fdinsert(), a sanitify check triggers when
closing a LARVAL file.

Found the hardway by sthen@.


# 1.46 02-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]'. This allows us to simplifies a lot
code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu -, visa@, guenther@ and art@

ok visa@, bluhm@


# 1.45 09-May-2018 mpi

Mark `f_ops' as immutable.

The only place where it was modified after initialization is a corner
case where the vnode of an open file is substitued by another one. Sine
the type of the file doesn't change, there's no need to overwrite `f_ops'.

While here proctect file counters with `f_mtx'.

ok bluhm@, visa@


# 1.44 08-May-2018 mpi

Do do include <sys/mount.h> because it breaks some userland programs
that define _KERNEL...


# 1.43 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.42 08-May-2018 mpi

Protect per-file counters and document which lock is used to protect
the other fields.

Once we no longer have any [k] (kernel lock) protections, we'll be
able to unlock almost all network related syscalls.

Inputs from and ok bluhm@, visa@


# 1.41 25-Apr-2018 mpi

Introduce fd_iterfile() a new helper function to iterate over `filehead'.

This turns `filehead' into a local variable, that will make it easier
to protect it.

ok visa@


Revision tags: OPENBSD_6_3_BASE
# 1.40 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.39 02-Jan-2018 guenther

Don't #include fcntl.h when _KERNEL is defined.

inspired by FreeBSD r24131
ok millert@ sthen@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.38 23-Aug-2016 tedu

rename nfiles to numfiles to avoid shadowing and stretch out the name.
ok deraadt


Revision tags: OPENBSD_6_0_BASE
# 1.37 26-Apr-2016 deraadt

No good reason to retain comments about old DTYPE_CRYPTO or DTYPE_SYSTRACE
values.


# 1.36 25-Apr-2016 tedu

remove systrace remnants


Revision tags: OPENBSD_5_9_BASE
# 1.35 28-Aug-2015 guenther

Rework the UNIX domain socket garbage collector, including ideas from
{Free,Net}BSD
- when a socket is closed with fds in its input, defer closing them to
a task to avoid recursing. This eliminates the complicated extra
reference taking which had a 37 line(!) comment explanation
- move flags, counts, and links only needed for this from struct file to
struct unpcb
- document the flow of the mark/sweep collector

much help from claudio@ who made me explain the GC to him until we trusted it
ok claudio@ mpi@ deraadt@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.34 18-Nov-2014 mikeb

DTYPE_CRYPTO is not used anymore; ok guenther (a while ago)


# 1.33 18-Nov-2014 tedu

file.h doesn't need to include unistd.h


Revision tags: OPENBSD_5_6_BASE
# 1.32 10-Jul-2014 deraadt

struct ucred; for fstat _KERNEL block


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.31 05-Jun-2013 guenther

Move FHASLOCK from f_flag to f_iflags, freeing up a bit for passing
O_* flags and eliminating an XXX comment.

ok matthew@ deraadt@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.30 01-May-2012 guenther

Eliminate the f_usecount ref count in struct file; instead of sleeping
at the top of closef() until all in-progress calls finish, just do the
advisory locking bits required of close() by POSIX and let whichever
thread has the last reference do the call to the file's fo_close()
method and the final cleanup.

lots of discussion with deraadt@ and others; worked out with and ok krw@


# 1.29 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.28 28-Jun-2011 thib

Rename FMARK to FIF_MARK and FDEFER to FIF_DEFER and
move those flags to f_iflags; This makes rooms in the
flag member of struct file for some goodies matthew@
as planned.

ok matthew@, deraadt@.


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.27 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.26 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.25 04-Jun-2009 blambert

Put readv/writev changes back in, as they no longer hang ckuethe's ntpd.

Special thanks to ckuethe's ntpd for noticing the problem.

ok deraadt@


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.24 26-Mar-2006 mickey

do per file io accounting and show that in fstat as well; pedro@ marco@ ok


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE OPENBSD_3_9_BASE SMP_SYNC_A SMP_SYNC_B
# 1.23 23-Sep-2003 millert

Replace select backends with poll backends. selscan() and pollscan()
now call the poll backend. With this change we implement greater
poll(2) functionality instead of emulating it via the select backend.
Adapted from NetBSD and including some changes from FreeBSD.
Tested by many, deraadt@ OK


Revision tags: OPENBSD_3_4_BASE
# 1.22 06-Aug-2003 deraadt

must pre-def struct file before circular structs


# 1.21 01-Aug-2003 tedu

move fileops out of file, and make it pretty. ok deraadt@ millert@


# 1.20 18-Jul-2003 tedu

caddr_t -> void *. ok millert@ tdeval@


# 1.19 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.18 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.17 14-Mar-2002 millert

First round of __P removal in sys


# 1.16 08-Feb-2002 art

- Rename FILE_{,UN}USE to FREF and FRELE. USE is a bad verb and we don't have
the same semantics as NetBSD anyway, so it's good to avoid name collissions.
- Always fdremove before freeing the file, not the other way around.
- falloc FREFs the file.
- have FILE_SET_MATURE FRELE the file (It feels like a good ortogonality to
falloc FREFing the file).
- Use closef as much as possible instead of ffree in error paths of
falloc:ing functions. closef is much more careful with the fd and can
deal with the fd being forcibly closed by dup2. Also try to avoid
manually calling *fo_close when closef can do that for us (this makes
some error paths mroe complicated (sys_socketpair and sys_pipe), but
others become simpler (sys_open)).


# 1.15 05-Feb-2002 art

Add counting of temporary references to a struct file (as opposed to references
from fd tables and other long-lived objects). This is to avoid races between
using a file descriptor and having another process (with shared fd table)
close it. We use a separate refence count so that error values from close(2)
will be correctly returned to the caller of close(2).

The macros for those reference counts are FILE_USE(fp) and FILE_UNUSE(fp).

Make sure that the cases where closef can be called "incorrectly" (most notably
dup2(2)) are handled.

Right now only callers of closef (and {,p}read) use FILE_{,UN}USE correctly,
more fixes incoming soon.


Revision tags: UBC_BASE
# 1.14 31-Oct-2001 art

branches: 1.14.2;
Clarify some struct fields.


# 1.13 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.12 15-May-2001 deraadt

DTYPE_CRYPTO


# 1.11 14-May-2001 art

Add a fo_stat member to struct fileops. Used soon.
Also add a stat function for kqueue from FreeBSD.


Revision tags: OPENBSD_2_9_BASE
# 1.10 01-Mar-2001 provos

port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.9 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.8 24-May-2000 deraadt

move kernel prototypes using iovec to the right place


Revision tags: OPENBSD_2_7_BASE
# 1.7 20-Apr-2000 deraadt

p{read,write}{,v} from csapuntz, partial NetBSD origin I think


# 1.6 19-Apr-2000 csapuntz

Change struct file interface methods read and write to pass file offset in
and out.

Make pread/pwrite in netbsd & linux thread safe - which is the whole point
anyway.


Revision tags: SMP_BASE
# 1.5 01-Feb-2000 assar

branches: 1.5.2;
add declaration of `vnops'


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE OPENBSD_2_5_BASE OPENBSD_2_6_BASE kame_19991208
# 1.4 01-Mar-1998 deraadt

crank f_count/f_msgcount to long; when incrementing try to leave 2 slots
empty for unp_gc() in case of cross referenced sockets .


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.3 27-Aug-1996 shawn

New fast pipe(2) from freebsd without fancy vm stuff.

The old pipes can be used with the "OLD_PIPE" config option.


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.54 22-Jun-2019 semarie

push the KERNEL_LOCK deeper on read(2) and write(2)

unlocks read(2) and write(2) syscalls families, and push the KERNEL_LOCK
deeper in the code path. KERNEL_LOCK is managed per file type in fileops
handlers (fo_read, fo_write, and fo_close). read(2) and write(2) on
socket are KERNEL_LOCK-free.

initial work from mpi@ and ians@

ok mpi@ kettenis@ visa@ ians@


Revision tags: OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.53 20-Aug-2018 mpi

Reorder checks in the read/write(2) family of syscalls to prepare making
file operations mp-safe.

This change makes it clear that `f_offset' is only accessed in vn_read()
and vn_write(), which will help taking it out of the KERNEL_LOCK().

This refactoring uncovered a race in vn_read() which is now documented
and will be addressed in a later diff.

ok visa@


# 1.52 03-Jul-2018 kettenis

Add a new so_seek member to "struct file" such that we can have seekable
files that aren't vnodes. Move the vnode-specific code into its own
function. Add an implementation for the "DMA buffers" that can be used
by DRI3/prime code to find out the size of the graphics buffer.
This implementation is very limited and only supports offset 0 and only
for SEEK_SET and SEEK_END. This doesn't really make sense; implementing
stat(2) would be a more obvious choice. But this is what Linux does.

ok guenther@, visa@


# 1.51 02-Jul-2018 visa

Update the file reference count field `f_count' using atomic operations
instead of using a mutex for update serialization. Use a per-fdp mutex
to manage updating of file instance pointers in the `fd_ofiles' array
to let fd_getfile() acquire file references safely with concurrent file
reference releases.

OK mpi@


# 1.50 25-Jun-2018 kettenis

Implement DRI3/prime support. This allows graphics buffers to be passed
between processes using file descriptors. This provides an alternative to
eporting them with guesable 32-bit IDs. This implementation does not (yet)
allow sharing of graphics buffers between GPUs.

ok mpi@, visa@


# 1.49 20-Jun-2018 mpi

Unlock sendmsg(2) and sendto(2).

These syscalls can now be executed w/o the KERNEL_LOCK() depending on
the kind of socket.

The current solution uses a single global mutex to serialize access to,
and reference count, 'struct file'.

ok visa@, kettenis@


# 1.48 18-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup, take 3.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]' or the global linked list. This allows
us to simplifies a lot code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu Masson, visa@, guenther@ and art@

Previous version ok bluhm@, ok visa@, sthen@


# 1.47 05-Jun-2018 mpi

Revert introduction of fdinsert(), a sanitify check triggers when
closing a LARVAL file.

Found the hardway by sthen@.


# 1.46 02-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]'. This allows us to simplifies a lot
code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu -, visa@, guenther@ and art@

ok visa@, bluhm@


# 1.45 09-May-2018 mpi

Mark `f_ops' as immutable.

The only place where it was modified after initialization is a corner
case where the vnode of an open file is substitued by another one. Sine
the type of the file doesn't change, there's no need to overwrite `f_ops'.

While here proctect file counters with `f_mtx'.

ok bluhm@, visa@


# 1.44 08-May-2018 mpi

Do do include <sys/mount.h> because it breaks some userland programs
that define _KERNEL...


# 1.43 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.42 08-May-2018 mpi

Protect per-file counters and document which lock is used to protect
the other fields.

Once we no longer have any [k] (kernel lock) protections, we'll be
able to unlock almost all network related syscalls.

Inputs from and ok bluhm@, visa@


# 1.41 25-Apr-2018 mpi

Introduce fd_iterfile() a new helper function to iterate over `filehead'.

This turns `filehead' into a local variable, that will make it easier
to protect it.

ok visa@


Revision tags: OPENBSD_6_3_BASE
# 1.40 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.39 02-Jan-2018 guenther

Don't #include fcntl.h when _KERNEL is defined.

inspired by FreeBSD r24131
ok millert@ sthen@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.38 23-Aug-2016 tedu

rename nfiles to numfiles to avoid shadowing and stretch out the name.
ok deraadt


Revision tags: OPENBSD_6_0_BASE
# 1.37 26-Apr-2016 deraadt

No good reason to retain comments about old DTYPE_CRYPTO or DTYPE_SYSTRACE
values.


# 1.36 25-Apr-2016 tedu

remove systrace remnants


Revision tags: OPENBSD_5_9_BASE
# 1.35 28-Aug-2015 guenther

Rework the UNIX domain socket garbage collector, including ideas from
{Free,Net}BSD
- when a socket is closed with fds in its input, defer closing them to
a task to avoid recursing. This eliminates the complicated extra
reference taking which had a 37 line(!) comment explanation
- move flags, counts, and links only needed for this from struct file to
struct unpcb
- document the flow of the mark/sweep collector

much help from claudio@ who made me explain the GC to him until we trusted it
ok claudio@ mpi@ deraadt@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.34 18-Nov-2014 mikeb

DTYPE_CRYPTO is not used anymore; ok guenther (a while ago)


# 1.33 18-Nov-2014 tedu

file.h doesn't need to include unistd.h


Revision tags: OPENBSD_5_6_BASE
# 1.32 10-Jul-2014 deraadt

struct ucred; for fstat _KERNEL block


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.31 05-Jun-2013 guenther

Move FHASLOCK from f_flag to f_iflags, freeing up a bit for passing
O_* flags and eliminating an XXX comment.

ok matthew@ deraadt@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.30 01-May-2012 guenther

Eliminate the f_usecount ref count in struct file; instead of sleeping
at the top of closef() until all in-progress calls finish, just do the
advisory locking bits required of close() by POSIX and let whichever
thread has the last reference do the call to the file's fo_close()
method and the final cleanup.

lots of discussion with deraadt@ and others; worked out with and ok krw@


# 1.29 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.28 28-Jun-2011 thib

Rename FMARK to FIF_MARK and FDEFER to FIF_DEFER and
move those flags to f_iflags; This makes rooms in the
flag member of struct file for some goodies matthew@
as planned.

ok matthew@, deraadt@.


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.27 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.26 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.25 04-Jun-2009 blambert

Put readv/writev changes back in, as they no longer hang ckuethe's ntpd.

Special thanks to ckuethe's ntpd for noticing the problem.

ok deraadt@


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.24 26-Mar-2006 mickey

do per file io accounting and show that in fstat as well; pedro@ marco@ ok


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE OPENBSD_3_9_BASE SMP_SYNC_A SMP_SYNC_B
# 1.23 23-Sep-2003 millert

Replace select backends with poll backends. selscan() and pollscan()
now call the poll backend. With this change we implement greater
poll(2) functionality instead of emulating it via the select backend.
Adapted from NetBSD and including some changes from FreeBSD.
Tested by many, deraadt@ OK


Revision tags: OPENBSD_3_4_BASE
# 1.22 06-Aug-2003 deraadt

must pre-def struct file before circular structs


# 1.21 01-Aug-2003 tedu

move fileops out of file, and make it pretty. ok deraadt@ millert@


# 1.20 18-Jul-2003 tedu

caddr_t -> void *. ok millert@ tdeval@


# 1.19 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.18 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.17 14-Mar-2002 millert

First round of __P removal in sys


# 1.16 08-Feb-2002 art

- Rename FILE_{,UN}USE to FREF and FRELE. USE is a bad verb and we don't have
the same semantics as NetBSD anyway, so it's good to avoid name collissions.
- Always fdremove before freeing the file, not the other way around.
- falloc FREFs the file.
- have FILE_SET_MATURE FRELE the file (It feels like a good ortogonality to
falloc FREFing the file).
- Use closef as much as possible instead of ffree in error paths of
falloc:ing functions. closef is much more careful with the fd and can
deal with the fd being forcibly closed by dup2. Also try to avoid
manually calling *fo_close when closef can do that for us (this makes
some error paths mroe complicated (sys_socketpair and sys_pipe), but
others become simpler (sys_open)).


# 1.15 05-Feb-2002 art

Add counting of temporary references to a struct file (as opposed to references
from fd tables and other long-lived objects). This is to avoid races between
using a file descriptor and having another process (with shared fd table)
close it. We use a separate refence count so that error values from close(2)
will be correctly returned to the caller of close(2).

The macros for those reference counts are FILE_USE(fp) and FILE_UNUSE(fp).

Make sure that the cases where closef can be called "incorrectly" (most notably
dup2(2)) are handled.

Right now only callers of closef (and {,p}read) use FILE_{,UN}USE correctly,
more fixes incoming soon.


Revision tags: UBC_BASE
# 1.14 31-Oct-2001 art

branches: 1.14.2;
Clarify some struct fields.


# 1.13 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.12 15-May-2001 deraadt

DTYPE_CRYPTO


# 1.11 14-May-2001 art

Add a fo_stat member to struct fileops. Used soon.
Also add a stat function for kqueue from FreeBSD.


Revision tags: OPENBSD_2_9_BASE
# 1.10 01-Mar-2001 provos

port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.9 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.8 24-May-2000 deraadt

move kernel prototypes using iovec to the right place


Revision tags: OPENBSD_2_7_BASE
# 1.7 20-Apr-2000 deraadt

p{read,write}{,v} from csapuntz, partial NetBSD origin I think


# 1.6 19-Apr-2000 csapuntz

Change struct file interface methods read and write to pass file offset in
and out.

Make pread/pwrite in netbsd & linux thread safe - which is the whole point
anyway.


Revision tags: SMP_BASE
# 1.5 01-Feb-2000 assar

branches: 1.5.2;
add declaration of `vnops'


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE OPENBSD_2_5_BASE OPENBSD_2_6_BASE kame_19991208
# 1.4 01-Mar-1998 deraadt

crank f_count/f_msgcount to long; when incrementing try to leave 2 slots
empty for unp_gc() in case of cross referenced sockets .


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.3 27-Aug-1996 shawn

New fast pipe(2) from freebsd without fancy vm stuff.

The old pipes can be used with the "OLD_PIPE" config option.


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.53 20-Aug-2018 mpi

Reorder checks in the read/write(2) family of syscalls to prepare making
file operations mp-safe.

This change makes it clear that `f_offset' is only accessed in vn_read()
and vn_write(), which will help taking it out of the KERNEL_LOCK().

This refactoring uncovered a race in vn_read() which is now documented
and will be addressed in a later diff.

ok visa@


# 1.52 03-Jul-2018 kettenis

Add a new so_seek member to "struct file" such that we can have seekable
files that aren't vnodes. Move the vnode-specific code into its own
function. Add an implementation for the "DMA buffers" that can be used
by DRI3/prime code to find out the size of the graphics buffer.
This implementation is very limited and only supports offset 0 and only
for SEEK_SET and SEEK_END. This doesn't really make sense; implementing
stat(2) would be a more obvious choice. But this is what Linux does.

ok guenther@, visa@


# 1.51 02-Jul-2018 visa

Update the file reference count field `f_count' using atomic operations
instead of using a mutex for update serialization. Use a per-fdp mutex
to manage updating of file instance pointers in the `fd_ofiles' array
to let fd_getfile() acquire file references safely with concurrent file
reference releases.

OK mpi@


# 1.50 25-Jun-2018 kettenis

Implement DRI3/prime support. This allows graphics buffers to be passed
between processes using file descriptors. This provides an alternative to
eporting them with guesable 32-bit IDs. This implementation does not (yet)
allow sharing of graphics buffers between GPUs.

ok mpi@, visa@


# 1.49 20-Jun-2018 mpi

Unlock sendmsg(2) and sendto(2).

These syscalls can now be executed w/o the KERNEL_LOCK() depending on
the kind of socket.

The current solution uses a single global mutex to serialize access to,
and reference count, 'struct file'.

ok visa@, kettenis@


# 1.48 18-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup, take 3.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]' or the global linked list. This allows
us to simplifies a lot code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu Masson, visa@, guenther@ and art@

Previous version ok bluhm@, ok visa@, sthen@


# 1.47 05-Jun-2018 mpi

Revert introduction of fdinsert(), a sanitify check triggers when
closing a LARVAL file.

Found the hardway by sthen@.


# 1.46 02-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]'. This allows us to simplifies a lot
code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu -, visa@, guenther@ and art@

ok visa@, bluhm@


# 1.45 09-May-2018 mpi

Mark `f_ops' as immutable.

The only place where it was modified after initialization is a corner
case where the vnode of an open file is substitued by another one. Sine
the type of the file doesn't change, there's no need to overwrite `f_ops'.

While here proctect file counters with `f_mtx'.

ok bluhm@, visa@


# 1.44 08-May-2018 mpi

Do do include <sys/mount.h> because it breaks some userland programs
that define _KERNEL...


# 1.43 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.42 08-May-2018 mpi

Protect per-file counters and document which lock is used to protect
the other fields.

Once we no longer have any [k] (kernel lock) protections, we'll be
able to unlock almost all network related syscalls.

Inputs from and ok bluhm@, visa@


# 1.41 25-Apr-2018 mpi

Introduce fd_iterfile() a new helper function to iterate over `filehead'.

This turns `filehead' into a local variable, that will make it easier
to protect it.

ok visa@


Revision tags: OPENBSD_6_3_BASE
# 1.40 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.39 02-Jan-2018 guenther

Don't #include fcntl.h when _KERNEL is defined.

inspired by FreeBSD r24131
ok millert@ sthen@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.38 23-Aug-2016 tedu

rename nfiles to numfiles to avoid shadowing and stretch out the name.
ok deraadt


Revision tags: OPENBSD_6_0_BASE
# 1.37 26-Apr-2016 deraadt

No good reason to retain comments about old DTYPE_CRYPTO or DTYPE_SYSTRACE
values.


# 1.36 25-Apr-2016 tedu

remove systrace remnants


Revision tags: OPENBSD_5_9_BASE
# 1.35 28-Aug-2015 guenther

Rework the UNIX domain socket garbage collector, including ideas from
{Free,Net}BSD
- when a socket is closed with fds in its input, defer closing them to
a task to avoid recursing. This eliminates the complicated extra
reference taking which had a 37 line(!) comment explanation
- move flags, counts, and links only needed for this from struct file to
struct unpcb
- document the flow of the mark/sweep collector

much help from claudio@ who made me explain the GC to him until we trusted it
ok claudio@ mpi@ deraadt@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.34 18-Nov-2014 mikeb

DTYPE_CRYPTO is not used anymore; ok guenther (a while ago)


# 1.33 18-Nov-2014 tedu

file.h doesn't need to include unistd.h


Revision tags: OPENBSD_5_6_BASE
# 1.32 10-Jul-2014 deraadt

struct ucred; for fstat _KERNEL block


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.31 05-Jun-2013 guenther

Move FHASLOCK from f_flag to f_iflags, freeing up a bit for passing
O_* flags and eliminating an XXX comment.

ok matthew@ deraadt@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.30 01-May-2012 guenther

Eliminate the f_usecount ref count in struct file; instead of sleeping
at the top of closef() until all in-progress calls finish, just do the
advisory locking bits required of close() by POSIX and let whichever
thread has the last reference do the call to the file's fo_close()
method and the final cleanup.

lots of discussion with deraadt@ and others; worked out with and ok krw@


# 1.29 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.28 28-Jun-2011 thib

Rename FMARK to FIF_MARK and FDEFER to FIF_DEFER and
move those flags to f_iflags; This makes rooms in the
flag member of struct file for some goodies matthew@
as planned.

ok matthew@, deraadt@.


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.27 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.26 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.25 04-Jun-2009 blambert

Put readv/writev changes back in, as they no longer hang ckuethe's ntpd.

Special thanks to ckuethe's ntpd for noticing the problem.

ok deraadt@


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.24 26-Mar-2006 mickey

do per file io accounting and show that in fstat as well; pedro@ marco@ ok


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE OPENBSD_3_9_BASE SMP_SYNC_A SMP_SYNC_B
# 1.23 23-Sep-2003 millert

Replace select backends with poll backends. selscan() and pollscan()
now call the poll backend. With this change we implement greater
poll(2) functionality instead of emulating it via the select backend.
Adapted from NetBSD and including some changes from FreeBSD.
Tested by many, deraadt@ OK


Revision tags: OPENBSD_3_4_BASE
# 1.22 06-Aug-2003 deraadt

must pre-def struct file before circular structs


# 1.21 01-Aug-2003 tedu

move fileops out of file, and make it pretty. ok deraadt@ millert@


# 1.20 18-Jul-2003 tedu

caddr_t -> void *. ok millert@ tdeval@


# 1.19 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.18 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.17 14-Mar-2002 millert

First round of __P removal in sys


# 1.16 08-Feb-2002 art

- Rename FILE_{,UN}USE to FREF and FRELE. USE is a bad verb and we don't have
the same semantics as NetBSD anyway, so it's good to avoid name collissions.
- Always fdremove before freeing the file, not the other way around.
- falloc FREFs the file.
- have FILE_SET_MATURE FRELE the file (It feels like a good ortogonality to
falloc FREFing the file).
- Use closef as much as possible instead of ffree in error paths of
falloc:ing functions. closef is much more careful with the fd and can
deal with the fd being forcibly closed by dup2. Also try to avoid
manually calling *fo_close when closef can do that for us (this makes
some error paths mroe complicated (sys_socketpair and sys_pipe), but
others become simpler (sys_open)).


# 1.15 05-Feb-2002 art

Add counting of temporary references to a struct file (as opposed to references
from fd tables and other long-lived objects). This is to avoid races between
using a file descriptor and having another process (with shared fd table)
close it. We use a separate refence count so that error values from close(2)
will be correctly returned to the caller of close(2).

The macros for those reference counts are FILE_USE(fp) and FILE_UNUSE(fp).

Make sure that the cases where closef can be called "incorrectly" (most notably
dup2(2)) are handled.

Right now only callers of closef (and {,p}read) use FILE_{,UN}USE correctly,
more fixes incoming soon.


Revision tags: UBC_BASE
# 1.14 31-Oct-2001 art

branches: 1.14.2;
Clarify some struct fields.


# 1.13 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.12 15-May-2001 deraadt

DTYPE_CRYPTO


# 1.11 14-May-2001 art

Add a fo_stat member to struct fileops. Used soon.
Also add a stat function for kqueue from FreeBSD.


Revision tags: OPENBSD_2_9_BASE
# 1.10 01-Mar-2001 provos

port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.9 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.8 24-May-2000 deraadt

move kernel prototypes using iovec to the right place


Revision tags: OPENBSD_2_7_BASE
# 1.7 20-Apr-2000 deraadt

p{read,write}{,v} from csapuntz, partial NetBSD origin I think


# 1.6 19-Apr-2000 csapuntz

Change struct file interface methods read and write to pass file offset in
and out.

Make pread/pwrite in netbsd & linux thread safe - which is the whole point
anyway.


Revision tags: SMP_BASE
# 1.5 01-Feb-2000 assar

branches: 1.5.2;
add declaration of `vnops'


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE OPENBSD_2_5_BASE OPENBSD_2_6_BASE kame_19991208
# 1.4 01-Mar-1998 deraadt

crank f_count/f_msgcount to long; when incrementing try to leave 2 slots
empty for unp_gc() in case of cross referenced sockets .


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.3 27-Aug-1996 shawn

New fast pipe(2) from freebsd without fancy vm stuff.

The old pipes can be used with the "OLD_PIPE" config option.


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.52 03-Jul-2018 kettenis

Add a new so_seek member to "struct file" such that we can have seekable
files that aren't vnodes. Move the vnode-specific code into its own
function. Add an implementation for the "DMA buffers" that can be used
by DRI3/prime code to find out the size of the graphics buffer.
This implementation is very limited and only supports offset 0 and only
for SEEK_SET and SEEK_END. This doesn't really make sense; implementing
stat(2) would be a more obvious choice. But this is what Linux does.

ok guenther@, visa@


# 1.51 02-Jul-2018 visa

Update the file reference count field `f_count' using atomic operations
instead of using a mutex for update serialization. Use a per-fdp mutex
to manage updating of file instance pointers in the `fd_ofiles' array
to let fd_getfile() acquire file references safely with concurrent file
reference releases.

OK mpi@


# 1.50 25-Jun-2018 kettenis

Implement DRI3/prime support. This allows graphics buffers to be passed
between processes using file descriptors. This provides an alternative to
eporting them with guesable 32-bit IDs. This implementation does not (yet)
allow sharing of graphics buffers between GPUs.

ok mpi@, visa@


# 1.49 20-Jun-2018 mpi

Unlock sendmsg(2) and sendto(2).

These syscalls can now be executed w/o the KERNEL_LOCK() depending on
the kind of socket.

The current solution uses a single global mutex to serialize access to,
and reference count, 'struct file'.

ok visa@, kettenis@


# 1.48 18-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup, take 3.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]' or the global linked list. This allows
us to simplifies a lot code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu Masson, visa@, guenther@ and art@

Previous version ok bluhm@, ok visa@, sthen@


# 1.47 05-Jun-2018 mpi

Revert introduction of fdinsert(), a sanitify check triggers when
closing a LARVAL file.

Found the hardway by sthen@.


# 1.46 02-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]'. This allows us to simplifies a lot
code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu -, visa@, guenther@ and art@

ok visa@, bluhm@


# 1.45 09-May-2018 mpi

Mark `f_ops' as immutable.

The only place where it was modified after initialization is a corner
case where the vnode of an open file is substitued by another one. Sine
the type of the file doesn't change, there's no need to overwrite `f_ops'.

While here proctect file counters with `f_mtx'.

ok bluhm@, visa@


# 1.44 08-May-2018 mpi

Do do include <sys/mount.h> because it breaks some userland programs
that define _KERNEL...


# 1.43 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.42 08-May-2018 mpi

Protect per-file counters and document which lock is used to protect
the other fields.

Once we no longer have any [k] (kernel lock) protections, we'll be
able to unlock almost all network related syscalls.

Inputs from and ok bluhm@, visa@


# 1.41 25-Apr-2018 mpi

Introduce fd_iterfile() a new helper function to iterate over `filehead'.

This turns `filehead' into a local variable, that will make it easier
to protect it.

ok visa@


Revision tags: OPENBSD_6_3_BASE
# 1.40 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.39 02-Jan-2018 guenther

Don't #include fcntl.h when _KERNEL is defined.

inspired by FreeBSD r24131
ok millert@ sthen@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.38 23-Aug-2016 tedu

rename nfiles to numfiles to avoid shadowing and stretch out the name.
ok deraadt


Revision tags: OPENBSD_6_0_BASE
# 1.37 26-Apr-2016 deraadt

No good reason to retain comments about old DTYPE_CRYPTO or DTYPE_SYSTRACE
values.


# 1.36 25-Apr-2016 tedu

remove systrace remnants


Revision tags: OPENBSD_5_9_BASE
# 1.35 28-Aug-2015 guenther

Rework the UNIX domain socket garbage collector, including ideas from
{Free,Net}BSD
- when a socket is closed with fds in its input, defer closing them to
a task to avoid recursing. This eliminates the complicated extra
reference taking which had a 37 line(!) comment explanation
- move flags, counts, and links only needed for this from struct file to
struct unpcb
- document the flow of the mark/sweep collector

much help from claudio@ who made me explain the GC to him until we trusted it
ok claudio@ mpi@ deraadt@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.34 18-Nov-2014 mikeb

DTYPE_CRYPTO is not used anymore; ok guenther (a while ago)


# 1.33 18-Nov-2014 tedu

file.h doesn't need to include unistd.h


Revision tags: OPENBSD_5_6_BASE
# 1.32 10-Jul-2014 deraadt

struct ucred; for fstat _KERNEL block


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.31 05-Jun-2013 guenther

Move FHASLOCK from f_flag to f_iflags, freeing up a bit for passing
O_* flags and eliminating an XXX comment.

ok matthew@ deraadt@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.30 01-May-2012 guenther

Eliminate the f_usecount ref count in struct file; instead of sleeping
at the top of closef() until all in-progress calls finish, just do the
advisory locking bits required of close() by POSIX and let whichever
thread has the last reference do the call to the file's fo_close()
method and the final cleanup.

lots of discussion with deraadt@ and others; worked out with and ok krw@


# 1.29 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.28 28-Jun-2011 thib

Rename FMARK to FIF_MARK and FDEFER to FIF_DEFER and
move those flags to f_iflags; This makes rooms in the
flag member of struct file for some goodies matthew@
as planned.

ok matthew@, deraadt@.


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.27 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.26 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.25 04-Jun-2009 blambert

Put readv/writev changes back in, as they no longer hang ckuethe's ntpd.

Special thanks to ckuethe's ntpd for noticing the problem.

ok deraadt@


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.24 26-Mar-2006 mickey

do per file io accounting and show that in fstat as well; pedro@ marco@ ok


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE OPENBSD_3_9_BASE SMP_SYNC_A SMP_SYNC_B
# 1.23 23-Sep-2003 millert

Replace select backends with poll backends. selscan() and pollscan()
now call the poll backend. With this change we implement greater
poll(2) functionality instead of emulating it via the select backend.
Adapted from NetBSD and including some changes from FreeBSD.
Tested by many, deraadt@ OK


Revision tags: OPENBSD_3_4_BASE
# 1.22 06-Aug-2003 deraadt

must pre-def struct file before circular structs


# 1.21 01-Aug-2003 tedu

move fileops out of file, and make it pretty. ok deraadt@ millert@


# 1.20 18-Jul-2003 tedu

caddr_t -> void *. ok millert@ tdeval@


# 1.19 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.18 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.17 14-Mar-2002 millert

First round of __P removal in sys


# 1.16 08-Feb-2002 art

- Rename FILE_{,UN}USE to FREF and FRELE. USE is a bad verb and we don't have
the same semantics as NetBSD anyway, so it's good to avoid name collissions.
- Always fdremove before freeing the file, not the other way around.
- falloc FREFs the file.
- have FILE_SET_MATURE FRELE the file (It feels like a good ortogonality to
falloc FREFing the file).
- Use closef as much as possible instead of ffree in error paths of
falloc:ing functions. closef is much more careful with the fd and can
deal with the fd being forcibly closed by dup2. Also try to avoid
manually calling *fo_close when closef can do that for us (this makes
some error paths mroe complicated (sys_socketpair and sys_pipe), but
others become simpler (sys_open)).


# 1.15 05-Feb-2002 art

Add counting of temporary references to a struct file (as opposed to references
from fd tables and other long-lived objects). This is to avoid races between
using a file descriptor and having another process (with shared fd table)
close it. We use a separate refence count so that error values from close(2)
will be correctly returned to the caller of close(2).

The macros for those reference counts are FILE_USE(fp) and FILE_UNUSE(fp).

Make sure that the cases where closef can be called "incorrectly" (most notably
dup2(2)) are handled.

Right now only callers of closef (and {,p}read) use FILE_{,UN}USE correctly,
more fixes incoming soon.


Revision tags: UBC_BASE
# 1.14 31-Oct-2001 art

branches: 1.14.2;
Clarify some struct fields.


# 1.13 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.12 15-May-2001 deraadt

DTYPE_CRYPTO


# 1.11 14-May-2001 art

Add a fo_stat member to struct fileops. Used soon.
Also add a stat function for kqueue from FreeBSD.


Revision tags: OPENBSD_2_9_BASE
# 1.10 01-Mar-2001 provos

port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.9 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.8 24-May-2000 deraadt

move kernel prototypes using iovec to the right place


Revision tags: OPENBSD_2_7_BASE
# 1.7 20-Apr-2000 deraadt

p{read,write}{,v} from csapuntz, partial NetBSD origin I think


# 1.6 19-Apr-2000 csapuntz

Change struct file interface methods read and write to pass file offset in
and out.

Make pread/pwrite in netbsd & linux thread safe - which is the whole point
anyway.


Revision tags: SMP_BASE
# 1.5 01-Feb-2000 assar

branches: 1.5.2;
add declaration of `vnops'


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE OPENBSD_2_5_BASE OPENBSD_2_6_BASE kame_19991208
# 1.4 01-Mar-1998 deraadt

crank f_count/f_msgcount to long; when incrementing try to leave 2 slots
empty for unp_gc() in case of cross referenced sockets .


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.3 27-Aug-1996 shawn

New fast pipe(2) from freebsd without fancy vm stuff.

The old pipes can be used with the "OLD_PIPE" config option.


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.50 25-Jun-2018 kettenis

Implement DRI3/prime support. This allows graphics buffers to be passed
between processes using file descriptors. This provides an alternative to
eporting them with guesable 32-bit IDs. This implementation does not (yet)
allow sharing of graphics buffers between GPUs.

ok mpi@, visa@


# 1.49 20-Jun-2018 mpi

Unlock sendmsg(2) and sendto(2).

These syscalls can now be executed w/o the KERNEL_LOCK() depending on
the kind of socket.

The current solution uses a single global mutex to serialize access to,
and reference count, 'struct file'.

ok visa@, kettenis@


# 1.48 18-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup, take 3.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]' or the global linked list. This allows
us to simplifies a lot code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu Masson, visa@, guenther@ and art@

Previous version ok bluhm@, ok visa@, sthen@


# 1.47 05-Jun-2018 mpi

Revert introduction of fdinsert(), a sanitify check triggers when
closing a LARVAL file.

Found the hardway by sthen@.


# 1.46 02-Jun-2018 mpi

Put file descriptors on shared data structures when they are completely
setup.

LARVAL fd still exist, but they are no longer marked with a flag and no
longer reachable via `fd_ofiles[]'. This allows us to simplifies a lot
code grabbing new references to fds.

All of this is now possible because dup2(2) refuses to clone LARVAL fds.

Note that the `fdplock' could now be release in all open(2)-like syscalls,
just like it is done in accept(2).

With inputs from Mathieu -, visa@, guenther@ and art@

ok visa@, bluhm@


# 1.45 09-May-2018 mpi

Mark `f_ops' as immutable.

The only place where it was modified after initialization is a corner
case where the vnode of an open file is substitued by another one. Sine
the type of the file doesn't change, there's no need to overwrite `f_ops'.

While here proctect file counters with `f_mtx'.

ok bluhm@, visa@


# 1.44 08-May-2018 mpi

Do do include <sys/mount.h> because it breaks some userland programs
that define _KERNEL...


# 1.43 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.42 08-May-2018 mpi

Protect per-file counters and document which lock is used to protect
the other fields.

Once we no longer have any [k] (kernel lock) protections, we'll be
able to unlock almost all network related syscalls.

Inputs from and ok bluhm@, visa@


# 1.41 25-Apr-2018 mpi

Introduce fd_iterfile() a new helper function to iterate over `filehead'.

This turns `filehead' into a local variable, that will make it easier
to protect it.

ok visa@


Revision tags: OPENBSD_6_3_BASE
# 1.40 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.39 02-Jan-2018 guenther

Don't #include fcntl.h when _KERNEL is defined.

inspired by FreeBSD r24131
ok millert@ sthen@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.38 23-Aug-2016 tedu

rename nfiles to numfiles to avoid shadowing and stretch out the name.
ok deraadt


Revision tags: OPENBSD_6_0_BASE
# 1.37 26-Apr-2016 deraadt

No good reason to retain comments about old DTYPE_CRYPTO or DTYPE_SYSTRACE
values.


# 1.36 25-Apr-2016 tedu

remove systrace remnants


Revision tags: OPENBSD_5_9_BASE
# 1.35 28-Aug-2015 guenther

Rework the UNIX domain socket garbage collector, including ideas from
{Free,Net}BSD
- when a socket is closed with fds in its input, defer closing them to
a task to avoid recursing. This eliminates the complicated extra
reference taking which had a 37 line(!) comment explanation
- move flags, counts, and links only needed for this from struct file to
struct unpcb
- document the flow of the mark/sweep collector

much help from claudio@ who made me explain the GC to him until we trusted it
ok claudio@ mpi@ deraadt@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.34 18-Nov-2014 mikeb

DTYPE_CRYPTO is not used anymore; ok guenther (a while ago)


# 1.33 18-Nov-2014 tedu

file.h doesn't need to include unistd.h


Revision tags: OPENBSD_5_6_BASE
# 1.32 10-Jul-2014 deraadt

struct ucred; for fstat _KERNEL block


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.31 05-Jun-2013 guenther

Move FHASLOCK from f_flag to f_iflags, freeing up a bit for passing
O_* flags and eliminating an XXX comment.

ok matthew@ deraadt@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.30 01-May-2012 guenther

Eliminate the f_usecount ref count in struct file; instead of sleeping
at the top of closef() until all in-progress calls finish, just do the
advisory locking bits required of close() by POSIX and let whichever
thread has the last reference do the call to the file's fo_close()
method and the final cleanup.

lots of discussion with deraadt@ and others; worked out with and ok krw@


# 1.29 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.28 28-Jun-2011 thib

Rename FMARK to FIF_MARK and FDEFER to FIF_DEFER and
move those flags to f_iflags; This makes rooms in the
flag member of struct file for some goodies matthew@
as planned.

ok matthew@, deraadt@.


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.27 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.26 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.25 04-Jun-2009 blambert

Put readv/writev changes back in, as they no longer hang ckuethe's ntpd.

Special thanks to ckuethe's ntpd for noticing the problem.

ok deraadt@


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.24 26-Mar-2006 mickey

do per file io accounting and show that in fstat as well; pedro@ marco@ ok


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE OPENBSD_3_9_BASE SMP_SYNC_A SMP_SYNC_B
# 1.23 23-Sep-2003 millert

Replace select backends with poll backends. selscan() and pollscan()
now call the poll backend. With this change we implement greater
poll(2) functionality instead of emulating it via the select backend.
Adapted from NetBSD and including some changes from FreeBSD.
Tested by many, deraadt@ OK


Revision tags: OPENBSD_3_4_BASE
# 1.22 06-Aug-2003 deraadt

must pre-def struct file before circular structs


# 1.21 01-Aug-2003 tedu

move fileops out of file, and make it pretty. ok deraadt@ millert@


# 1.20 18-Jul-2003 tedu

caddr_t -> void *. ok millert@ tdeval@


# 1.19 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.18 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.17 14-Mar-2002 millert

First round of __P removal in sys


# 1.16 08-Feb-2002 art

- Rename FILE_{,UN}USE to FREF and FRELE. USE is a bad verb and we don't have
the same semantics as NetBSD anyway, so it's good to avoid name collissions.
- Always fdremove before freeing the file, not the other way around.
- falloc FREFs the file.
- have FILE_SET_MATURE FRELE the file (It feels like a good ortogonality to
falloc FREFing the file).
- Use closef as much as possible instead of ffree in error paths of
falloc:ing functions. closef is much more careful with the fd and can
deal with the fd being forcibly closed by dup2. Also try to avoid
manually calling *fo_close when closef can do that for us (this makes
some error paths mroe complicated (sys_socketpair and sys_pipe), but
others become simpler (sys_open)).


# 1.15 05-Feb-2002 art

Add counting of temporary references to a struct file (as opposed to references
from fd tables and other long-lived objects). This is to avoid races between
using a file descriptor and having another process (with shared fd table)
close it. We use a separate refence count so that error values from close(2)
will be correctly returned to the caller of close(2).

The macros for those reference counts are FILE_USE(fp) and FILE_UNUSE(fp).

Make sure that the cases where closef can be called "incorrectly" (most notably
dup2(2)) are handled.

Right now only callers of closef (and {,p}read) use FILE_{,UN}USE correctly,
more fixes incoming soon.


Revision tags: UBC_BASE
# 1.14 31-Oct-2001 art

branches: 1.14.2;
Clarify some struct fields.


# 1.13 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.12 15-May-2001 deraadt

DTYPE_CRYPTO


# 1.11 14-May-2001 art

Add a fo_stat member to struct fileops. Used soon.
Also add a stat function for kqueue from FreeBSD.


Revision tags: OPENBSD_2_9_BASE
# 1.10 01-Mar-2001 provos

port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.9 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.8 24-May-2000 deraadt

move kernel prototypes using iovec to the right place


Revision tags: OPENBSD_2_7_BASE
# 1.7 20-Apr-2000 deraadt

p{read,write}{,v} from csapuntz, partial NetBSD origin I think


# 1.6 19-Apr-2000 csapuntz

Change struct file interface methods read and write to pass file offset in
and out.

Make pread/pwrite in netbsd & linux thread safe - which is the whole point
anyway.


Revision tags: SMP_BASE
# 1.5 01-Feb-2000 assar

branches: 1.5.2;
add declaration of `vnops'


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE OPENBSD_2_5_BASE OPENBSD_2_6_BASE kame_19991208
# 1.4 01-Mar-1998 deraadt

crank f_count/f_msgcount to long; when incrementing try to leave 2 slots
empty for unp_gc() in case of cross referenced sockets .


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.3 27-Aug-1996 shawn

New fast pipe(2) from freebsd without fancy vm stuff.

The old pipes can be used with the "OLD_PIPE" config option.


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.41 25-Apr-2018 mpi

Introduce fd_iterfile() a new helper function to iterate over `filehead'.

This turns `filehead' into a local variable, that will make it easier
to protect it.

ok visa@


Revision tags: OPENBSD_6_3_BASE
# 1.40 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.39 02-Jan-2018 guenther

Don't #include fcntl.h when _KERNEL is defined.

inspired by FreeBSD r24131
ok millert@ sthen@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.38 23-Aug-2016 tedu

rename nfiles to numfiles to avoid shadowing and stretch out the name.
ok deraadt


Revision tags: OPENBSD_6_0_BASE
# 1.37 26-Apr-2016 deraadt

No good reason to retain comments about old DTYPE_CRYPTO or DTYPE_SYSTRACE
values.


# 1.36 25-Apr-2016 tedu

remove systrace remnants


Revision tags: OPENBSD_5_9_BASE
# 1.35 28-Aug-2015 guenther

Rework the UNIX domain socket garbage collector, including ideas from
{Free,Net}BSD
- when a socket is closed with fds in its input, defer closing them to
a task to avoid recursing. This eliminates the complicated extra
reference taking which had a 37 line(!) comment explanation
- move flags, counts, and links only needed for this from struct file to
struct unpcb
- document the flow of the mark/sweep collector

much help from claudio@ who made me explain the GC to him until we trusted it
ok claudio@ mpi@ deraadt@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.34 18-Nov-2014 mikeb

DTYPE_CRYPTO is not used anymore; ok guenther (a while ago)


# 1.33 18-Nov-2014 tedu

file.h doesn't need to include unistd.h


Revision tags: OPENBSD_5_6_BASE
# 1.32 10-Jul-2014 deraadt

struct ucred; for fstat _KERNEL block


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.31 05-Jun-2013 guenther

Move FHASLOCK from f_flag to f_iflags, freeing up a bit for passing
O_* flags and eliminating an XXX comment.

ok matthew@ deraadt@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.30 01-May-2012 guenther

Eliminate the f_usecount ref count in struct file; instead of sleeping
at the top of closef() until all in-progress calls finish, just do the
advisory locking bits required of close() by POSIX and let whichever
thread has the last reference do the call to the file's fo_close()
method and the final cleanup.

lots of discussion with deraadt@ and others; worked out with and ok krw@


# 1.29 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.28 28-Jun-2011 thib

Rename FMARK to FIF_MARK and FDEFER to FIF_DEFER and
move those flags to f_iflags; This makes rooms in the
flag member of struct file for some goodies matthew@
as planned.

ok matthew@, deraadt@.


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.27 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.26 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.25 04-Jun-2009 blambert

Put readv/writev changes back in, as they no longer hang ckuethe's ntpd.

Special thanks to ckuethe's ntpd for noticing the problem.

ok deraadt@


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.24 26-Mar-2006 mickey

do per file io accounting and show that in fstat as well; pedro@ marco@ ok


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE OPENBSD_3_9_BASE SMP_SYNC_A SMP_SYNC_B
# 1.23 23-Sep-2003 millert

Replace select backends with poll backends. selscan() and pollscan()
now call the poll backend. With this change we implement greater
poll(2) functionality instead of emulating it via the select backend.
Adapted from NetBSD and including some changes from FreeBSD.
Tested by many, deraadt@ OK


Revision tags: OPENBSD_3_4_BASE
# 1.22 06-Aug-2003 deraadt

must pre-def struct file before circular structs


# 1.21 01-Aug-2003 tedu

move fileops out of file, and make it pretty. ok deraadt@ millert@


# 1.20 18-Jul-2003 tedu

caddr_t -> void *. ok millert@ tdeval@


# 1.19 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.18 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.17 14-Mar-2002 millert

First round of __P removal in sys


# 1.16 08-Feb-2002 art

- Rename FILE_{,UN}USE to FREF and FRELE. USE is a bad verb and we don't have
the same semantics as NetBSD anyway, so it's good to avoid name collissions.
- Always fdremove before freeing the file, not the other way around.
- falloc FREFs the file.
- have FILE_SET_MATURE FRELE the file (It feels like a good ortogonality to
falloc FREFing the file).
- Use closef as much as possible instead of ffree in error paths of
falloc:ing functions. closef is much more careful with the fd and can
deal with the fd being forcibly closed by dup2. Also try to avoid
manually calling *fo_close when closef can do that for us (this makes
some error paths mroe complicated (sys_socketpair and sys_pipe), but
others become simpler (sys_open)).


# 1.15 05-Feb-2002 art

Add counting of temporary references to a struct file (as opposed to references
from fd tables and other long-lived objects). This is to avoid races between
using a file descriptor and having another process (with shared fd table)
close it. We use a separate refence count so that error values from close(2)
will be correctly returned to the caller of close(2).

The macros for those reference counts are FILE_USE(fp) and FILE_UNUSE(fp).

Make sure that the cases where closef can be called "incorrectly" (most notably
dup2(2)) are handled.

Right now only callers of closef (and {,p}read) use FILE_{,UN}USE correctly,
more fixes incoming soon.


Revision tags: UBC_BASE
# 1.14 31-Oct-2001 art

branches: 1.14.2;
Clarify some struct fields.


# 1.13 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.12 15-May-2001 deraadt

DTYPE_CRYPTO


# 1.11 14-May-2001 art

Add a fo_stat member to struct fileops. Used soon.
Also add a stat function for kqueue from FreeBSD.


Revision tags: OPENBSD_2_9_BASE
# 1.10 01-Mar-2001 provos

port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.9 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.8 24-May-2000 deraadt

move kernel prototypes using iovec to the right place


Revision tags: OPENBSD_2_7_BASE
# 1.7 20-Apr-2000 deraadt

p{read,write}{,v} from csapuntz, partial NetBSD origin I think


# 1.6 19-Apr-2000 csapuntz

Change struct file interface methods read and write to pass file offset in
and out.

Make pread/pwrite in netbsd & linux thread safe - which is the whole point
anyway.


Revision tags: SMP_BASE
# 1.5 01-Feb-2000 assar

branches: 1.5.2;
add declaration of `vnops'


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE OPENBSD_2_5_BASE OPENBSD_2_6_BASE kame_19991208
# 1.4 01-Mar-1998 deraadt

crank f_count/f_msgcount to long; when incrementing try to leave 2 slots
empty for unp_gc() in case of cross referenced sockets .


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.3 27-Aug-1996 shawn

New fast pipe(2) from freebsd without fancy vm stuff.

The old pipes can be used with the "OLD_PIPE" config option.


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.40 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.39 02-Jan-2018 guenther

Don't #include fcntl.h when _KERNEL is defined.

inspired by FreeBSD r24131
ok millert@ sthen@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.38 23-Aug-2016 tedu

rename nfiles to numfiles to avoid shadowing and stretch out the name.
ok deraadt


Revision tags: OPENBSD_6_0_BASE
# 1.37 26-Apr-2016 deraadt

No good reason to retain comments about old DTYPE_CRYPTO or DTYPE_SYSTRACE
values.


# 1.36 25-Apr-2016 tedu

remove systrace remnants


Revision tags: OPENBSD_5_9_BASE
# 1.35 28-Aug-2015 guenther

Rework the UNIX domain socket garbage collector, including ideas from
{Free,Net}BSD
- when a socket is closed with fds in its input, defer closing them to
a task to avoid recursing. This eliminates the complicated extra
reference taking which had a 37 line(!) comment explanation
- move flags, counts, and links only needed for this from struct file to
struct unpcb
- document the flow of the mark/sweep collector

much help from claudio@ who made me explain the GC to him until we trusted it
ok claudio@ mpi@ deraadt@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.34 18-Nov-2014 mikeb

DTYPE_CRYPTO is not used anymore; ok guenther (a while ago)


# 1.33 18-Nov-2014 tedu

file.h doesn't need to include unistd.h


Revision tags: OPENBSD_5_6_BASE
# 1.32 10-Jul-2014 deraadt

struct ucred; for fstat _KERNEL block


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.31 05-Jun-2013 guenther

Move FHASLOCK from f_flag to f_iflags, freeing up a bit for passing
O_* flags and eliminating an XXX comment.

ok matthew@ deraadt@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.30 01-May-2012 guenther

Eliminate the f_usecount ref count in struct file; instead of sleeping
at the top of closef() until all in-progress calls finish, just do the
advisory locking bits required of close() by POSIX and let whichever
thread has the last reference do the call to the file's fo_close()
method and the final cleanup.

lots of discussion with deraadt@ and others; worked out with and ok krw@


# 1.29 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.28 28-Jun-2011 thib

Rename FMARK to FIF_MARK and FDEFER to FIF_DEFER and
move those flags to f_iflags; This makes rooms in the
flag member of struct file for some goodies matthew@
as planned.

ok matthew@, deraadt@.


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.27 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.26 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.25 04-Jun-2009 blambert

Put readv/writev changes back in, as they no longer hang ckuethe's ntpd.

Special thanks to ckuethe's ntpd for noticing the problem.

ok deraadt@


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.24 26-Mar-2006 mickey

do per file io accounting and show that in fstat as well; pedro@ marco@ ok


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE OPENBSD_3_9_BASE SMP_SYNC_A SMP_SYNC_B
# 1.23 23-Sep-2003 millert

Replace select backends with poll backends. selscan() and pollscan()
now call the poll backend. With this change we implement greater
poll(2) functionality instead of emulating it via the select backend.
Adapted from NetBSD and including some changes from FreeBSD.
Tested by many, deraadt@ OK


Revision tags: OPENBSD_3_4_BASE
# 1.22 06-Aug-2003 deraadt

must pre-def struct file before circular structs


# 1.21 01-Aug-2003 tedu

move fileops out of file, and make it pretty. ok deraadt@ millert@


# 1.20 18-Jul-2003 tedu

caddr_t -> void *. ok millert@ tdeval@


# 1.19 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.18 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.17 14-Mar-2002 millert

First round of __P removal in sys


# 1.16 08-Feb-2002 art

- Rename FILE_{,UN}USE to FREF and FRELE. USE is a bad verb and we don't have
the same semantics as NetBSD anyway, so it's good to avoid name collissions.
- Always fdremove before freeing the file, not the other way around.
- falloc FREFs the file.
- have FILE_SET_MATURE FRELE the file (It feels like a good ortogonality to
falloc FREFing the file).
- Use closef as much as possible instead of ffree in error paths of
falloc:ing functions. closef is much more careful with the fd and can
deal with the fd being forcibly closed by dup2. Also try to avoid
manually calling *fo_close when closef can do that for us (this makes
some error paths mroe complicated (sys_socketpair and sys_pipe), but
others become simpler (sys_open)).


# 1.15 05-Feb-2002 art

Add counting of temporary references to a struct file (as opposed to references
from fd tables and other long-lived objects). This is to avoid races between
using a file descriptor and having another process (with shared fd table)
close it. We use a separate refence count so that error values from close(2)
will be correctly returned to the caller of close(2).

The macros for those reference counts are FILE_USE(fp) and FILE_UNUSE(fp).

Make sure that the cases where closef can be called "incorrectly" (most notably
dup2(2)) are handled.

Right now only callers of closef (and {,p}read) use FILE_{,UN}USE correctly,
more fixes incoming soon.


Revision tags: UBC_BASE
# 1.14 31-Oct-2001 art

branches: 1.14.2;
Clarify some struct fields.


# 1.13 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.12 15-May-2001 deraadt

DTYPE_CRYPTO


# 1.11 14-May-2001 art

Add a fo_stat member to struct fileops. Used soon.
Also add a stat function for kqueue from FreeBSD.


Revision tags: OPENBSD_2_9_BASE
# 1.10 01-Mar-2001 provos

port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.9 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.8 24-May-2000 deraadt

move kernel prototypes using iovec to the right place


Revision tags: OPENBSD_2_7_BASE
# 1.7 20-Apr-2000 deraadt

p{read,write}{,v} from csapuntz, partial NetBSD origin I think


# 1.6 19-Apr-2000 csapuntz

Change struct file interface methods read and write to pass file offset in
and out.

Make pread/pwrite in netbsd & linux thread safe - which is the whole point
anyway.


Revision tags: SMP_BASE
# 1.5 01-Feb-2000 assar

branches: 1.5.2;
add declaration of `vnops'


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE OPENBSD_2_5_BASE OPENBSD_2_6_BASE kame_19991208
# 1.4 01-Mar-1998 deraadt

crank f_count/f_msgcount to long; when incrementing try to leave 2 slots
empty for unp_gc() in case of cross referenced sockets .


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.3 27-Aug-1996 shawn

New fast pipe(2) from freebsd without fancy vm stuff.

The old pipes can be used with the "OLD_PIPE" config option.


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.39 02-Jan-2018 guenther

Don't #include fcntl.h when _KERNEL is defined.

inspired by FreeBSD r24131
ok millert@ sthen@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.38 23-Aug-2016 tedu

rename nfiles to numfiles to avoid shadowing and stretch out the name.
ok deraadt


Revision tags: OPENBSD_6_0_BASE
# 1.37 26-Apr-2016 deraadt

No good reason to retain comments about old DTYPE_CRYPTO or DTYPE_SYSTRACE
values.


# 1.36 25-Apr-2016 tedu

remove systrace remnants


Revision tags: OPENBSD_5_9_BASE
# 1.35 28-Aug-2015 guenther

Rework the UNIX domain socket garbage collector, including ideas from
{Free,Net}BSD
- when a socket is closed with fds in its input, defer closing them to
a task to avoid recursing. This eliminates the complicated extra
reference taking which had a 37 line(!) comment explanation
- move flags, counts, and links only needed for this from struct file to
struct unpcb
- document the flow of the mark/sweep collector

much help from claudio@ who made me explain the GC to him until we trusted it
ok claudio@ mpi@ deraadt@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.34 18-Nov-2014 mikeb

DTYPE_CRYPTO is not used anymore; ok guenther (a while ago)


# 1.33 18-Nov-2014 tedu

file.h doesn't need to include unistd.h


Revision tags: OPENBSD_5_6_BASE
# 1.32 10-Jul-2014 deraadt

struct ucred; for fstat _KERNEL block


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.31 05-Jun-2013 guenther

Move FHASLOCK from f_flag to f_iflags, freeing up a bit for passing
O_* flags and eliminating an XXX comment.

ok matthew@ deraadt@


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.30 01-May-2012 guenther

Eliminate the f_usecount ref count in struct file; instead of sleeping
at the top of closef() until all in-progress calls finish, just do the
advisory locking bits required of close() by POSIX and let whichever
thread has the last reference do the call to the file's fo_close()
method and the final cleanup.

lots of discussion with deraadt@ and others; worked out with and ok krw@


# 1.29 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.28 28-Jun-2011 thib

Rename FMARK to FIF_MARK and FDEFER to FIF_DEFER and
move those flags to f_iflags; This makes rooms in the
flag member of struct file for some goodies matthew@
as planned.

ok matthew@, deraadt@.


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.27 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.26 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.25 04-Jun-2009 blambert

Put readv/writev changes back in, as they no longer hang ckuethe's ntpd.

Special thanks to ckuethe's ntpd for noticing the problem.

ok deraadt@


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE
# 1.24 26-Mar-2006 mickey

do per file io accounting and show that in fstat as well; pedro@ marco@ ok


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE OPENBSD_3_9_BASE SMP_SYNC_A SMP_SYNC_B
# 1.23 23-Sep-2003 millert

Replace select backends with poll backends. selscan() and pollscan()
now call the poll backend. With this change we implement greater
poll(2) functionality instead of emulating it via the select backend.
Adapted from NetBSD and including some changes from FreeBSD.
Tested by many, deraadt@ OK


Revision tags: OPENBSD_3_4_BASE
# 1.22 06-Aug-2003 deraadt

must pre-def struct file before circular structs


# 1.21 01-Aug-2003 tedu

move fileops out of file, and make it pretty. ok deraadt@ millert@


# 1.20 18-Jul-2003 tedu

caddr_t -> void *. ok millert@ tdeval@


# 1.19 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.18 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.17 14-Mar-2002 millert

First round of __P removal in sys


# 1.16 08-Feb-2002 art

- Rename FILE_{,UN}USE to FREF and FRELE. USE is a bad verb and we don't have
the same semantics as NetBSD anyway, so it's good to avoid name collissions.
- Always fdremove before freeing the file, not the other way around.
- falloc FREFs the file.
- have FILE_SET_MATURE FRELE the file (It feels like a good ortogonality to
falloc FREFing the file).
- Use closef as much as possible instead of ffree in error paths of
falloc:ing functions. closef is much more careful with the fd and can
deal with the fd being forcibly closed by dup2. Also try to avoid
manually calling *fo_close when closef can do that for us (this makes
some error paths mroe complicated (sys_socketpair and sys_pipe), but
others become simpler (sys_open)).


# 1.15 05-Feb-2002 art

Add counting of temporary references to a struct file (as opposed to references
from fd tables and other long-lived objects). This is to avoid races between
using a file descriptor and having another process (with shared fd table)
close it. We use a separate refence count so that error values from close(2)
will be correctly returned to the caller of close(2).

The macros for those reference counts are FILE_USE(fp) and FILE_UNUSE(fp).

Make sure that the cases where closef can be called "incorrectly" (most notably
dup2(2)) are handled.

Right now only callers of closef (and {,p}read) use FILE_{,UN}USE correctly,
more fixes incoming soon.


Revision tags: UBC_BASE
# 1.14 31-Oct-2001 art

branches: 1.14.2;
Clarify some struct fields.


# 1.13 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.12 15-May-2001 deraadt

DTYPE_CRYPTO


# 1.11 14-May-2001 art

Add a fo_stat member to struct fileops. Used soon.
Also add a stat function for kqueue from FreeBSD.


Revision tags: OPENBSD_2_9_BASE
# 1.10 01-Mar-2001 provos

port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.9 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.8 24-May-2000 deraadt

move kernel prototypes using iovec to the right place


Revision tags: OPENBSD_2_7_BASE
# 1.7 20-Apr-2000 deraadt

p{read,write}{,v} from csapuntz, partial NetBSD origin I think


# 1.6 19-Apr-2000 csapuntz

Change struct file interface methods read and write to pass file offset in
and out.

Make pread/pwrite in netbsd & linux thread safe - which is the whole point
anyway.


Revision tags: SMP_BASE
# 1.5 01-Feb-2000 assar

branches: 1.5.2;
add declaration of `vnops'


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE OPENBSD_2_5_BASE OPENBSD_2_6_BASE kame_19991208
# 1.4 01-Mar-1998 deraadt

crank f_count/f_msgcount to long; when incrementing try to leave 2 slots
empty for unp_gc() in case of cross referenced sockets .


Revision tags: OPENBSD_2_0_BASE OPENBSD_2_1_BASE OPENBSD_2_2_BASE
# 1.3 27-Aug-1996 shawn

New fast pipe(2) from freebsd without fancy vm stuff.

The old pipes can be used with the "OLD_PIPE" config option.


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision